325 lines
12 KiB
Plaintext
325 lines
12 KiB
Plaintext
---
|
|
summary: Memory usage grows at a near-constant unbounded rate
|
|
---
|
|
created: 2015-03-17 11:47:55.0
|
|
creator: leonerd
|
|
description: |-
|
|
Synapse on matrix.org seems to grow linearly in memory usage without bounds, until it is killed and restarted.
|
|
|
|
Interestingly, my homeserver at leonerd.org.uk shows totally-flat memory usage when no clients are connected to, though it is still receiving federation messages. It just has no clients, nor is generating outbound federation traffic of its own.
|
|
id: '11246'
|
|
key: SYN-315
|
|
number: '315'
|
|
priority: '3'
|
|
project: '10000'
|
|
reporter: leonerd
|
|
status: '1'
|
|
type: '4'
|
|
updated: 2016-11-07 18:27:51.0
|
|
votes: '0'
|
|
watches: '5'
|
|
workflowId: '11346'
|
|
---
|
|
actions:
|
|
- author: leonerd
|
|
body: |-
|
|
Some further analysis here on the results of `syload` under various 2minute tests.
|
|
|
|
Hammering it with:
|
|
|
|
`/initialSync` requests alone, with no messages or eventstreams ==> no growth.
|
|
|
|
Sending a message with no local viewers at all (not even the sending user) ==> grows ~8k per message:
|
|
{noformat}
|
|
tested [2223 in 2m00s; 18.52/sec]...
|
|
process_resource_maxrss | 245891072 | 264699904 +18808832 (8461.01 /call)
|
|
{noformat}
|
|
|
|
Sending a message with only the sending user viewing the room ==> grows ~52k per message
|
|
{noformat}
|
|
tested [1869 in 2m00s; 15.57/sec]...
|
|
process_resource_maxrss | 285736960 | 384155648 +98418688 (52658.47 /call)
|
|
{noformat}
|
|
|
|
Sending a message with 3 other local viewers as well as the sender ==> grows ~54k per message
|
|
{noformat}
|
|
tested [621 in 2m00s; 5.17/sec]...
|
|
process_resource_maxrss | 401752064 | 435781632 +34029568 (54798.02 /call)
|
|
{noformat}
|
|
|
|
Sending a message with the same 3 local viewers + remote viewers over federation ==> grows ~52k per message
|
|
{noformat}
|
|
tested [509 in 2m00s; 4.24/sec]...
|
|
process_resource_maxrss | 442925056 | 469762048 +26836992 (52724.94 /call)
|
|
{noformat}
|
|
|
|
Receiving a message over federation with all 4 local viewers, but no local senders ==> grows ~52k per message
|
|
{noformat}
|
|
tested [572 in 2m00s; 4.77/sec]...
|
|
process_resource_maxrss | 472530944 | 502366208 +29835264 (52159.55 /call)
|
|
{noformat}
|
|
|
|
Receiving a message over federation with no local viewers ==> grows ~13k per message
|
|
{noformat}
|
|
tested [535 in 2m00s; 4.46/sec]...
|
|
process_resource_maxrss | 504758272 | 511885312 +7127040 (13321.57 /call)
|
|
{noformat}
|
|
created: 2015-03-18 19:10:10.0
|
|
id: '11404'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-03-18 19:10:10.0
|
|
- author: leonerd
|
|
body: It's also interesting to note that with four local actors (1 sender or 3 viewers, or 4 viewers), the leak rate of 52k/message is around four times that of having no local actors, receiving on federation, where it's 13k/message.
|
|
created: 2015-03-18 19:11:54.0
|
|
id: '11405'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-03-18 19:11:54.0
|
|
- author: erikj
|
|
body: Is this with or without the caches?
|
|
created: 2015-03-18 19:32:40.0
|
|
id: '11406'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: erikj
|
|
updated: 2015-03-18 19:32:40.0
|
|
- author: leonerd
|
|
body: With; basically stock code off `develop`. I'll put in a quick oneline hack to just bypass all of the caches and try it again.
|
|
created: 2015-03-19 10:27:13.0
|
|
id: '11407'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-03-19 10:27:13.0
|
|
- author: leonerd
|
|
body: |-
|
|
With the caching layer disabled (by having it not store anything in the dict, so every request looks like a miss)
|
|
|
|
Sending (no viewers):
|
|
{noformat}
|
|
process_resource_maxrss | 244072448 | 266092544 +22020096 (9483.25 /call)
|
|
{noformat}
|
|
|
|
Sending (only sender viewing):
|
|
{noformat}
|
|
process_resource_maxrss | 282673152 | 379551744 +96878592 (53642.63 /call)
|
|
{noformat}
|
|
|
|
Sending (other local viewers):
|
|
{noformat}
|
|
process_resource_maxrss | 389906432 | 420167680 +30261248 (53089.91 /call)
|
|
{noformat}
|
|
|
|
Sending (local and remote viewers):
|
|
{noformat}
|
|
process_resource_maxrss | 423378944 | 449429504 +26050560 (56022.71 /call)
|
|
{noformat}
|
|
|
|
Receiving (local viewers):
|
|
{noformat}
|
|
process_resource_maxrss | 457654272 | 526893056 +69238784 (128457.86 /call)
|
|
{noformat}
|
|
|
|
Receiving (no local viewers):
|
|
{noformat}
|
|
process_resource_maxrss | 534740992 | 534757376 +16384 (29.90 /call)
|
|
{noformat}
|
|
|
|
This very last one is somewhat interesting - suggests that actually there isn't a leak when receiving with no viewers, other than the extra usage the cache consumes. This also agrees with my observation on `leonerd.org.uk`.
|
|
created: 2015-03-19 12:52:41.0
|
|
id: '11408'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-03-19 12:52:41.0
|
|
- author: leonerd
|
|
body: |-
|
|
Now doing some `manhole+objgraph` based debugging around the first leaking test (local sender + just itself viewing):
|
|
|
|
Before the test:
|
|
{noformat}
|
|
>>> objgraph.show_most_common_types()
|
|
dict 21171
|
|
tuple 13741
|
|
function 10282
|
|
frozendict 7025
|
|
list 3647
|
|
weakref 3057
|
|
builtin_function_or_method 2045
|
|
instance 1757
|
|
PointerType 1492
|
|
wrapper_descriptor 1315
|
|
{noformat}
|
|
|
|
After having done {{tested [1777 in 2m00s; 14.81/sec]}}:
|
|
{noformat}
|
|
>>> objgraph.show_most_common_types()
|
|
dict 46240
|
|
tuple 27002
|
|
frozendict 21746
|
|
function 10992
|
|
list 6524
|
|
weakref 3183
|
|
_EventInternalMetadata 3108
|
|
FrozenEvent 3108
|
|
instance 2589
|
|
cell 2131
|
|
{noformat}
|
|
created: 2015-03-19 17:59:17.0
|
|
id: '11411'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-03-19 17:59:17.0
|
|
- author: leonerd
|
|
body: Huh; oops. Turns out a lot of those were still coming from the main *getEvents* LruCache. I'll disable that and try it again...
|
|
created: 2015-03-19 18:17:30.0
|
|
id: '11412'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-03-19 18:17:30.0
|
|
- author: leonerd
|
|
body: |-
|
|
Before:
|
|
|
|
{noformat}
|
|
>>> objgraph.show_growth()
|
|
function 10282 +10282
|
|
dict 10108 +10108
|
|
tuple 7596 +7596
|
|
list 3587 +3587
|
|
weakref 3234 +3234
|
|
builtin_function_or_method 2045 +2045
|
|
instance 1759 +1759
|
|
PointerType 1492 +1492
|
|
wrapper_descriptor 1315 +1315
|
|
type 1251 +1251
|
|
{noformat}
|
|
|
|
After doing {{tested [476 in 2m00s; 3.97/sec]}}:
|
|
|
|
{noformat}
|
|
>>> objgraph.show_growth()
|
|
dict 11606 +1498
|
|
frozendict 780 +780
|
|
list 4356 +769
|
|
tuple 8332 +736
|
|
cell 1013 +326
|
|
function 10436 +154
|
|
FrozenEvent 130 +130
|
|
_EventInternalMetadata 130 +130
|
|
instance 1886 +127
|
|
instancemethod 584 +85
|
|
{noformat}
|
|
|
|
Amusingly there therefore, we could only possibly have leaked 1 list, 1 frozendict and 3 dicts per event, maximum. Everything else grew too little to account for it.
|
|
|
|
Hmmmmm.
|
|
created: 2015-03-19 18:31:45.0
|
|
id: '11413'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-03-19 18:31:45.0
|
|
- author: leonerd
|
|
body: |-
|
|
Turns out in my previous tests of memory leaking, I forgot about the *getEvents* LruCache. Running again with that disabled too is much slower, but leaks much less. Confusingly, some of the tests that do more things don't appear to leak, while subsets of that do. I suspect at this point we're not even looking at a leak at all, just extra memory temporarily used for results, etc..., combined with the extra usage in that cache.
|
|
|
|
Annoyingly, maxrss isn't the most useful metric to be looking at here, but I can't work out how to ask Python itself how much memory it thinks it's allocated to internal objects :(
|
|
created: 2015-03-19 19:16:17.0
|
|
id: '11414'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-03-19 19:16:17.0
|
|
- author: leonerd
|
|
body: |-
|
|
Well, that doesn't seem to pan out.
|
|
We tried an experiment at capping the getEvents cache to 1024 entries, a limit it hit well within a few hours, growing in memory from a restart-time level of 185MiB to 2.3GiB. After that, it stayed at 1024 entries for almost a day, by which point it grew in memory usae to 5.5GiB.
|
|
|
|
So yes, more leakage than just the cache.
|
|
created: 2015-03-25 15:31:35.0
|
|
id: '11435'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-03-25 15:31:35.0
|
|
- author: leonerd
|
|
body: |-
|
|
Looking again at today's graphs after almost a week of continuous uptime shows another curious fact.
|
|
|
|
It seems like the notifiers list grows fairly unbounded, and that sharp increases in memory usage do look correlated with increases in notifiers. This is also supported by my home server, which has no notifiers because there's no client connected to it, and its memory usage hasn't grown a byte in two weeks.
|
|
|
|
We seem to be currently missing an "open file descriptors" metric count; I'll add one of those and then we can compare that to the notifier list - maybe we're failing to notice some cases of clients disconnecting?
|
|
created: 2015-04-01 17:46:15.0
|
|
id: '11465'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-04-01 17:46:15.0
|
|
- author: markjh
|
|
body: |-
|
|
We are attempting to reproduce the memory leak using syload so that we can study the heap with objgraph. Latest develop (5b89052d2f) doesn't appear to leak memory doing basic message sending. In order to try and reproduce the leak we intend to expand the range of things syload tests.
|
|
|
|
We will need to test:
|
|
* -Sending messages with listening clients-
|
|
* -Sending messages with transaction ids-.
|
|
* -Sending messages over federation-.
|
|
* -Receiving messages over federation-.
|
|
* Joining/leaving rooms.
|
|
* -Typing/presence with listening clients-.
|
|
* Sending messages with pushers.
|
|
|
|
We should test Application Servers at some point. But we believe that the memory leak was present before the application server code was added so we shouldn't need to include tests for them in order to find the leak.
|
|
created: 2015-04-08 16:14:21.0
|
|
id: '11477'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-04-09 19:02:46.0
|
|
- author: leonerd
|
|
body: |-
|
|
Have added:
|
|
[master c91770a] Perform a room send test with client-specified transaction IDs
|
|
created: 2015-04-09 17:44:40.0
|
|
id: '11480'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-04-09 17:44:40.0
|
|
- author: leonerd
|
|
body: '[master d8ccb02] Some load tests involving EDUs - presence and typing notifications'
|
|
created: 2015-04-09 19:02:32.0
|
|
id: '11482'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: leonerd
|
|
updated: 2015-04-09 19:02:32.0
|
|
- author: indrora
|
|
body: |-
|
|
I can confirm that this is definitely happening as well.
|
|
|
|
I'm running on a Ramnode instance (256M real, 128M swap) and Synapse is getting OOMKilled about every 4 hours. At the moment, I've got my settings scaled way back.
|
|
|
|
One thing I've found eats *lots* of memory is the webclient. Removing the webclient from my config has slowed the time-to-oomkill fairly heftily. I've been unable to disable the non-HTTPS listener without crashing, and that's fodder for another ticket, but I'd assume that that would cut the memory footprint by about a third or so.
|
|
|
|
Note that my instance ( cortex.zaibatsutel.net ) is only joined up to two rooms: #matrix:matrix.org and #matrix-dev:matrix.org
|
|
created: 2015-09-26 16:42:01.0
|
|
id: '12174'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: indrora
|
|
updated: 2015-09-26 16:42:01.0
|
|
- author: richvdh
|
|
body: 'Migrated to github: https://github.com/matrix-org/synapse/issues/1299'
|
|
created: 2016-11-07 18:27:51.0
|
|
id: '13620'
|
|
issue: '11246'
|
|
type: comment
|
|
updateauthor: richvdh
|
|
updated: 2016-11-07 18:27:51.0
|