matrix.org/static/jira/browse/SYN-315

325 lines
12 KiB
Plaintext

---
summary: Memory usage grows at a near-constant unbounded rate
---
created: 2015-03-17 11:47:55.0
creator: leonerd
description: |-
Synapse on matrix.org seems to grow linearly in memory usage without bounds, until it is killed and restarted.
Interestingly, my homeserver at leonerd.org.uk shows totally-flat memory usage when no clients are connected to, though it is still receiving federation messages. It just has no clients, nor is generating outbound federation traffic of its own.
id: '11246'
key: SYN-315
number: '315'
priority: '3'
project: '10000'
reporter: leonerd
status: '1'
type: '4'
updated: 2016-11-07 18:27:51.0
votes: '0'
watches: '5'
workflowId: '11346'
---
actions:
- author: leonerd
body: |-
Some further analysis here on the results of `syload` under various 2minute tests.
Hammering it with:
`/initialSync` requests alone, with no messages or eventstreams ==> no growth.
Sending a message with no local viewers at all (not even the sending user) ==> grows ~8k per message:
{noformat}
tested [2223 in 2m00s; 18.52/sec]...
process_resource_maxrss | 245891072 | 264699904 +18808832 (8461.01 /call)
{noformat}
Sending a message with only the sending user viewing the room ==> grows ~52k per message
{noformat}
tested [1869 in 2m00s; 15.57/sec]...
process_resource_maxrss | 285736960 | 384155648 +98418688 (52658.47 /call)
{noformat}
Sending a message with 3 other local viewers as well as the sender ==> grows ~54k per message
{noformat}
tested [621 in 2m00s; 5.17/sec]...
process_resource_maxrss | 401752064 | 435781632 +34029568 (54798.02 /call)
{noformat}
Sending a message with the same 3 local viewers + remote viewers over federation ==> grows ~52k per message
{noformat}
tested [509 in 2m00s; 4.24/sec]...
process_resource_maxrss | 442925056 | 469762048 +26836992 (52724.94 /call)
{noformat}
Receiving a message over federation with all 4 local viewers, but no local senders ==> grows ~52k per message
{noformat}
tested [572 in 2m00s; 4.77/sec]...
process_resource_maxrss | 472530944 | 502366208 +29835264 (52159.55 /call)
{noformat}
Receiving a message over federation with no local viewers ==> grows ~13k per message
{noformat}
tested [535 in 2m00s; 4.46/sec]...
process_resource_maxrss | 504758272 | 511885312 +7127040 (13321.57 /call)
{noformat}
created: 2015-03-18 19:10:10.0
id: '11404'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-03-18 19:10:10.0
- author: leonerd
body: It's also interesting to note that with four local actors (1 sender or 3 viewers, or 4 viewers), the leak rate of 52k/message is around four times that of having no local actors, receiving on federation, where it's 13k/message.
created: 2015-03-18 19:11:54.0
id: '11405'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-03-18 19:11:54.0
- author: erikj
body: Is this with or without the caches?
created: 2015-03-18 19:32:40.0
id: '11406'
issue: '11246'
type: comment
updateauthor: erikj
updated: 2015-03-18 19:32:40.0
- author: leonerd
body: With; basically stock code off `develop`. I'll put in a quick oneline hack to just bypass all of the caches and try it again.
created: 2015-03-19 10:27:13.0
id: '11407'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-03-19 10:27:13.0
- author: leonerd
body: |-
With the caching layer disabled (by having it not store anything in the dict, so every request looks like a miss)
Sending (no viewers):
{noformat}
process_resource_maxrss | 244072448 | 266092544 +22020096 (9483.25 /call)
{noformat}
Sending (only sender viewing):
{noformat}
process_resource_maxrss | 282673152 | 379551744 +96878592 (53642.63 /call)
{noformat}
Sending (other local viewers):
{noformat}
process_resource_maxrss | 389906432 | 420167680 +30261248 (53089.91 /call)
{noformat}
Sending (local and remote viewers):
{noformat}
process_resource_maxrss | 423378944 | 449429504 +26050560 (56022.71 /call)
{noformat}
Receiving (local viewers):
{noformat}
process_resource_maxrss | 457654272 | 526893056 +69238784 (128457.86 /call)
{noformat}
Receiving (no local viewers):
{noformat}
process_resource_maxrss | 534740992 | 534757376 +16384 (29.90 /call)
{noformat}
This very last one is somewhat interesting - suggests that actually there isn't a leak when receiving with no viewers, other than the extra usage the cache consumes. This also agrees with my observation on `leonerd.org.uk`.
created: 2015-03-19 12:52:41.0
id: '11408'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-03-19 12:52:41.0
- author: leonerd
body: |-
Now doing some `manhole+objgraph` based debugging around the first leaking test (local sender + just itself viewing):
Before the test:
{noformat}
>>> objgraph.show_most_common_types()
dict 21171
tuple 13741
function 10282
frozendict 7025
list 3647
weakref 3057
builtin_function_or_method 2045
instance 1757
PointerType 1492
wrapper_descriptor 1315
{noformat}
After having done {{tested [1777 in 2m00s; 14.81/sec]}}:
{noformat}
>>> objgraph.show_most_common_types()
dict 46240
tuple 27002
frozendict 21746
function 10992
list 6524
weakref 3183
_EventInternalMetadata 3108
FrozenEvent 3108
instance 2589
cell 2131
{noformat}
created: 2015-03-19 17:59:17.0
id: '11411'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-03-19 17:59:17.0
- author: leonerd
body: Huh; oops. Turns out a lot of those were still coming from the main *getEvents* LruCache. I'll disable that and try it again...
created: 2015-03-19 18:17:30.0
id: '11412'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-03-19 18:17:30.0
- author: leonerd
body: |-
Before:
{noformat}
>>> objgraph.show_growth()
function 10282 +10282
dict 10108 +10108
tuple 7596 +7596
list 3587 +3587
weakref 3234 +3234
builtin_function_or_method 2045 +2045
instance 1759 +1759
PointerType 1492 +1492
wrapper_descriptor 1315 +1315
type 1251 +1251
{noformat}
After doing {{tested [476 in 2m00s; 3.97/sec]}}:
{noformat}
>>> objgraph.show_growth()
dict 11606 +1498
frozendict 780 +780
list 4356 +769
tuple 8332 +736
cell 1013 +326
function 10436 +154
FrozenEvent 130 +130
_EventInternalMetadata 130 +130
instance 1886 +127
instancemethod 584 +85
{noformat}
Amusingly there therefore, we could only possibly have leaked 1 list, 1 frozendict and 3 dicts per event, maximum. Everything else grew too little to account for it.
Hmmmmm.
created: 2015-03-19 18:31:45.0
id: '11413'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-03-19 18:31:45.0
- author: leonerd
body: |-
Turns out in my previous tests of memory leaking, I forgot about the *getEvents* LruCache. Running again with that disabled too is much slower, but leaks much less. Confusingly, some of the tests that do more things don't appear to leak, while subsets of that do. I suspect at this point we're not even looking at a leak at all, just extra memory temporarily used for results, etc..., combined with the extra usage in that cache.
Annoyingly, maxrss isn't the most useful metric to be looking at here, but I can't work out how to ask Python itself how much memory it thinks it's allocated to internal objects :(
created: 2015-03-19 19:16:17.0
id: '11414'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-03-19 19:16:17.0
- author: leonerd
body: |-
Well, that doesn't seem to pan out.
We tried an experiment at capping the getEvents cache to 1024 entries, a limit it hit well within a few hours, growing in memory from a restart-time level of 185MiB to 2.3GiB. After that, it stayed at 1024 entries for almost a day, by which point it grew in memory usae to 5.5GiB.
So yes, more leakage than just the cache.
created: 2015-03-25 15:31:35.0
id: '11435'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-03-25 15:31:35.0
- author: leonerd
body: |-
Looking again at today's graphs after almost a week of continuous uptime shows another curious fact.
It seems like the notifiers list grows fairly unbounded, and that sharp increases in memory usage do look correlated with increases in notifiers. This is also supported by my home server, which has no notifiers because there's no client connected to it, and its memory usage hasn't grown a byte in two weeks.
We seem to be currently missing an "open file descriptors" metric count; I'll add one of those and then we can compare that to the notifier list - maybe we're failing to notice some cases of clients disconnecting?
created: 2015-04-01 17:46:15.0
id: '11465'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-04-01 17:46:15.0
- author: markjh
body: |-
We are attempting to reproduce the memory leak using syload so that we can study the heap with objgraph. Latest develop (5b89052d2f) doesn't appear to leak memory doing basic message sending. In order to try and reproduce the leak we intend to expand the range of things syload tests.
We will need to test:
* -Sending messages with listening clients-
* -Sending messages with transaction ids-.
* -Sending messages over federation-.
* -Receiving messages over federation-.
* Joining/leaving rooms.
* -Typing/presence with listening clients-.
* Sending messages with pushers.
We should test Application Servers at some point. But we believe that the memory leak was present before the application server code was added so we shouldn't need to include tests for them in order to find the leak.
created: 2015-04-08 16:14:21.0
id: '11477'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-04-09 19:02:46.0
- author: leonerd
body: |-
Have added:
[master c91770a] Perform a room send test with client-specified transaction IDs
created: 2015-04-09 17:44:40.0
id: '11480'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-04-09 17:44:40.0
- author: leonerd
body: '[master d8ccb02] Some load tests involving EDUs - presence and typing notifications'
created: 2015-04-09 19:02:32.0
id: '11482'
issue: '11246'
type: comment
updateauthor: leonerd
updated: 2015-04-09 19:02:32.0
- author: indrora
body: |-
I can confirm that this is definitely happening as well.
I'm running on a Ramnode instance (256M real, 128M swap) and Synapse is getting OOMKilled about every 4 hours. At the moment, I've got my settings scaled way back.
One thing I've found eats *lots* of memory is the webclient. Removing the webclient from my config has slowed the time-to-oomkill fairly heftily. I've been unable to disable the non-HTTPS listener without crashing, and that's fodder for another ticket, but I'd assume that that would cut the memory footprint by about a third or so.
Note that my instance ( cortex.zaibatsutel.net ) is only joined up to two rooms: #matrix:matrix.org and #matrix-dev:matrix.org
created: 2015-09-26 16:42:01.0
id: '12174'
issue: '11246'
type: comment
updateauthor: indrora
updated: 2015-09-26 16:42:01.0
- author: richvdh
body: 'Migrated to github: https://github.com/matrix-org/synapse/issues/1299'
created: 2016-11-07 18:27:51.0
id: '13620'
issue: '11246'
type: comment
updateauthor: richvdh
updated: 2016-11-07 18:27:51.0