OOM crashing several times per day
Steps to reproduce
Unclear. I am trying using the app per normal.
Crashes feel more frequent after sending messages in encrypted rooms though. I've noticed it mostly while marking rooms as read by clicking through several in my room list.
Outcome
What did you expect?
n/a
What happened instead?
Crashing to blank page. Started happening within the last week. This is annoying because it takes 5-10 minutes for the app to start back up again.
Operating system
Windows 11
Application version
Element Nightly version: 0.0.1-nightly.2024090101 Crypto version: Rust SDK 0.7.1 (c8c9d15), Vodozemac 0.6.0
How did you install the app?
The Internet
Homeserver
t2l.io
Will you send logs?
Yes
I think there's a memory leak somewhere in the timeline/room view. I've noticed crashing more frequently with the number of rooms I switch to. Speed of the switching does not appear to affect the bug (clicking through my room list at 5Hz vs 0.25Hz makes no difference - it crashes eventually after ~50 or so clicks). It may be one room in particular that's causing the crashing, but I've not yet found the pattern.
Many of my rooms are encrypted.
This is now getting as bad as crashing within minutes after restarting. I didn't even get to read a message before it crashed again.
My account is too large for a heap/allocation profile, but I can run the detached nodes profile multiple times. It appears we're collecting RoomView references, possibly from clicking rooms too quickly.
The message preview tiles tend to clean themselves up eventually (probably just some React thing). The RoomViews persist, however.
I've done some local poking in https://github.com/element-hq/element-web/commit/1324181a596bbb0ee245ef0def223c2621aef92c and had mild success: about 10% fewer rooms got stuck, but that's hardly a fix. I haven't been able to figure out what is consistent between all the rooms getting stuck to figure out what might be sticking.
The legacyCallHandler remove listener is called twice, but that doesn't appear to have any consequences.
I can't confirm it with a memory profile, however by never clicking on Matrix HQ, I'm able to keep my Element running for 2-3 days instead of 2-3 minutes. I suspect that something in Element is pulling down the whole room state, and storing that in live memory on startup, as there's not really other reasons for that room to be problematic compared to other rooms.
Lazily loading details of room state might be helpful in overall memory usage? We probably don't need to know all the users in HQ all the time, for example.
edit: the reason I think lazily loading state will help is because in testing, a Clear Cache and Reload is the only thing that fixes "accidentally clicked on HQ, so now it crashes within a few minutes of starting up". This would imply that clicking the room causes full state (or something else memory heavy) to be pulled into local cache and put into live memory on startup, perpetuating the OOM cycle.
This has been biting me for a year or so too, and got to the point where the app would OOM during launch, and so I had to manually clean the cache using await mxMatrixClientPeg.get().stopClient(); await mxMatrixClientPget.get().deleteAllData() or similar at the console to fake a cache flush. I got to the point where I was having to do this roughly once a day, and given an initial sync takes ~30 minutes on my account, I cracked and investigated further.
Findings are as follows:
- On a fresh login, my account idles at around 900MB of v8 heap for 5400 rooms. This is high; last time I went on a mission like this (https://github.com/element-hq/element-web/issues/12417 iirc) my heap was around 350MB for around 4,500 rooms.
- A quick initial analysis of this showed:
- it doens't seem amazing that it has 1,230,797 DanglingReceipt objects hanging around (albeit only retaining 40MB)
- it "only" has 68,571 room members, which isn't so bad
- and 30,372 FiberNodes from react
- I added a very basic automatic profiler to Electron using the v8 debugger API to try to catch either snapshots before/during blowup or heap allocation profiles: https://gist.github.com/ara4n/0a834d14f6e59f55b4df197647ea2fbc
- Heap usage during a typical explosion looks like this:
- The actual heap usage reported by the debugger API when it finally explodes jumps from ~1.3GB to... 2.xGB or so. Which seems crazy, and made me think that there was a gigantic allocation happening that finally tips it over the edge. In retrospect, I think this may be a red herring - when v8 starts to hit the limit (around 1.4GB) perhaps it doesn't have enough memory space to actually GC properly (or it ends up with two generations in RAM at the same time) and so stops freeing anything, and so blows up as it OOMs.
A typical malicious blowup (which happened for reasons unknown, having been switching between rooms a bit, albeit with one with the threadpanel open and then using seshat to search around a bit) looks like this - jumping suddenly from 965MB of heap to ~1980MB.
2025-11-27T08:25:58.093Z: Heap: 960.38MB / 1112.41MB
2025-11-27T08:25:59.024Z: Heap: 961.97MB / 1112.41MB
2025-11-27T08:26:00.058Z: Heap: 963.56MB / 1112.41MB
2025-11-27T08:26:01.104Z: Heap: 965.16MB / 1112.41MB
<--- Last few GCs --->
[98592:0x11402090000] 22355891 ms: Scavenge (during sweeping) 1980.7 (2030.2) -> 1975.1 (2044.4) MB, pooled: 0.0 MB, 4.96 / 0.00 ms (average mu = 0.999, current mu = 0.970) allocation failure;
[98592:0x11402090000] 22355914 ms: Scavenge (during sweeping) 1995.0 (2044.4) -> 1991.0 (2057.2) MB, pooled: 0.0 MB, 6.61 / 0.00 ms (average mu = 0.999, current mu = 0.970) allocation failure;
<--- Last few GCs --->
[98592:0x11402090000] 22355891 ms: Scavenge (during sweeping) 1980.7 (2030.2) -> 1975.1 (2044.4) MB, pooled: 0.0 MB, 4.96 / 0.00 ms (average mu = 0.999, current mu = 0.970) allocation failure;
[98592:0x11402090000] 22355914 ms: Scavenge (during sweeping) 1995.0 (2044.4) -> 1991.0 (2057.2) MB, pooled: 0.0 MB, 6.61 / 0.00 ms (average mu = 0.999, current mu = 0.970) allocation failure;
[98592:1127/082601.941521:ERROR:third_party/blink/renderer/bindings/core/v8/v8_initializer.cc:855] V8 javascript OOM (MarkCompactCollector: young object promotion failed).
[98592:1127
<--- Last few GCs --->
[98592:0x11402090000] 22355891 ms: Scavenge (during sweeping) 1980.7 (2030.2) -> 1975.1 (2044.4) MB, pooled: 0.0 MB, 4.96 / 0.00 ms (average mu = 0.999, current mu = 0.970) allocation failure;
[98592:0x11402090000] 22355914 ms: Scavenge (during sweeping) 1995.0 (2044.4) -> 1991.0 (2057.2) MB, pooled: 0.0 MB, 6.61 / 0.00 ms (average mu = 0.999, current mu = 0.970) allocation failure;
/082601.941543:ERROR:third_party/blink/renderer/bindings/core/v8/v8_initializer.cc:855] V8 javascript OOM (MarkCompactCollector: young object promotion failed).
- I can't heap snapshot during blowups, even via the debugger API, because a 1.3GB heap takes minutes to capture on my M1 Max and by the time it finishes, the process has already OOMed.
- Similarly allocation profiling also failed at first - trying to sample for ~20s either side of an explosion silently failed to save a profile.
- However, i ended up reliably reproing an OOM on launch by peeking into old Matrix HQ (!OGEhHVWSdvArJzumhm:matrix.org), which has ~73K joined users (and probably many more state events given spam)
- So I ended up triggering the alloc profile for 2s on receipt of a v1 /room/initialSync used for peeking and finally caught it in the act:
2025-11-27T01:02:32.553Z: Explosion imminent, sampling...
2025-11-27T01:02:33.072Z: Heap: 873.37MB / 1049.41MB
2025-11-27T01:02:33.719Z: Heap: 990.89MB / 1166.94MB
2025-11-27T01:02:37.557Z: ⚠️ Trying to stop profiling...
2025-11-27T01:02:38.127Z: Heap: 1608.03MB / 1702.52MB
2025-11-27T01:02:40.013Z: ⚠️ Got profile...
Allocation profile saved to: /Users/matthew/Library/Application Support/Element/allocation-1764205360013.heapprofile ⠀
Profile being: allocation-1764205360013.heapprofile
- This shows that it allocated 500MB of heap over the 2 seconds in js-sdk, mainly RoomMembers but also loads of reemitted events and state events.
So: my best bet here is:
- Something is still causing explosions (aside from peeking in rooms) - could be threads, dangling receipts, retained RoomViews or similar.
- We never clear out js-sdk RAM, so if you peek into old Matrix HQ you'll lose 500MB forever. We should flush state events and certainly members info roomstore for rooms which aren't used very often.
- We should probably also flush live timeline history for rooms which haven't been viewed in a while too (although i'm not sure that's contributing that much)
- RoomMembers are big - 74K members take up 50MB of allocations, which seems huge. They have a bunch of fields.
- We should switch to SSS for peeking.
- We should not use members for profiles; we should do MSC4218 or similar in order for tab completion to work.
- We should never lazy-load in members for non-encrypted rooms
Next steps:
- Find out why the baseline heap usage is so high
- Flush the room store if you haven't looked at a room in a bit.
So, before I forget to capture conclusions:
Find out why the baseline heap usage is so high
This is due to 500MB(!) of heap retained by read receipts. Apparently i have 800K read receipts in my sync accumulator, and some bug in the threaded RR implementation means that results in 1.2M of them in RAM, and also doubles all the state events on the heap. For now I've commented out the receipt accumulator to stop them ever hitting the heap of my main process.
Something is still causing explosions
The reason the RAM suddenly spikes from nowhere when it OOMs is due to persisting the sync accumulator - allocating a huge blob of JSON to then postMessage to the indexeddb service worker. I think v8 then fails to have enough room to GC, causing a runaway explosion in heap size before it OOMs (perhaps the beginning of the GC storm mentioned in the --detect-ineffective-gcs-near-heap-limit "trigger out-of-memory failure to avoid GC storm near heap limit" v8 option, which defaults to true):
2025-11-28T19:18:24.458Z: [VERBOSE]: Persisting sync data up to s6460312421_757284974_22305924_4183883604_4801008804_267616778_1553125848_11221601866_0_572540_34
Source: vector://vector/webapp/bundles/7996f33588679dd6b681/default-vendors-node_modules_matrix-js-sdk_src_indexeddb-worker_ts-node_modules_matrix-js-sdk-467139.js:827
2025-11-28T19:18:25.002Z: Heap: 964.40MB / 1143.11MB
2025-11-28T19:18:26.036Z: Heap: 964.81MB / 1143.11MB
2025-11-28T19:18:27.083Z: Heap: 965.44MB / 1143.11MB
2025-11-28T19:18:28.020Z: Heap: 966.77MB / 1143.11MB
2025-11-28T19:18:29.060Z: Heap: 967.20MB / 1143.11MB
2025-11-28T19:18:30.104Z: Heap: 967.61MB / 1143.11MB
<--- Last few GCs --->
[52053:0x13409570000] 736160 ms: Scavenge (during sweeping) 1965.7 (2026.0) -> 1961.2 (2039.0) MB, pooled: 0.0 MB, 8.30 / 0.00 ms (average mu = 0.999, current mu = 0.970) allocation failure;
[52053:0x13409570000] 736215 ms: Scavenge (during sweeping) 1981.0 (2040.7) -> 1971.8 (2054.0) MB, pooled: 0.0 MB, 7.55 / 0.00 ms (average mu = 0.999, current mu = 0.970) allocation failure;
The reason it took me ages to realise this is:
- I had totally blanked all the work I did on https://github.com/element-hq/element-desktop/issues/680
- Heap snapshotting based on profiling wasn't fast enough to react to the spike (or the spike crashed v8 before the snapshot was complete)
- Alloc profiling similarly couldn't complete before the OOM kicked in.
- In the end, I tracked it down using
app.commandLine.appendSwitch('js-flags', '--trace-allocation-stack-interval=16384');as a poor man's profiler, and literally eyeballing the stacktraces once it finally OOMed. This is much saner and easier than the previous approach of extracting v8 stacktraces out of a native coredump, once I caught it in the act:
==== JS stack trace =========================================
0: ApiCallbackExitFrame put(this=0x37d92367e6bd <IDBObjectStore map = 0x37d91f7b8549>#0#,0x37d92367e6cd <Object map = 0x37d94b7a5811>#1#)
1: /* anonymous */(aka /* anonymous */) [0x37d92367e11d] [vector://vector/webapp/bundles/7996f33588679dd6b681/default-node_modules_matrix-js-sdk_src_store_indexeddb-local-backend_ts.js:37774] [bytecode=0x13800c035c5 offset=64](this=0x37d900000011 <undefined>)
2: promiseTry(aka promiseTry) [0x37d90af1947d] [vector://vector/webapp/bundles/7996f33588679dd6b681/default-vendors-node_modules_matrix-js-sdk_src_indexeddb-worker_ts-node_modules_matrix-js-sdk-467139.js:1621] [bytecode=0x13800c00b25 offset=9](this=0x37d900000011 <undefined>,0x37d92367e11d <JSFunction (sfi = 0x37d94b7a568d)>#2#)
3: persistSyncData [0x37d90af01999] [vector://vector/webapp/bundles/7996f33588679dd6b681/default-node_modules_matrix-js-sdk_src_store_indexeddb-local-backend_ts.js:37771] [bytecode=0x13800c00e45 offset=59](this=0x37d90af2f535 <LocalIndexedDBStoreBackend map = 0x37d91f7b3585>#3#,0x37d92b13ff5d <String[137]: "m6459894381~36.6459894383~1.6459894387~3.6459894383_757284974_21613873_4183519081_4800543611_267616002_1553074096_11221500426_0_572469_34">,0x37d93eb7a71d <Object map = 0x37d91bbf5ad9>#4#)
4: doSyncToDatabase [0x37d90af0197d] [vector://vector/webapp/bundles/7996f33588679dd6b681/default-node_modules_matrix-js-sdk_src_store_indexeddb-local-backend_ts.js:37757] [bytecode=0x13800bf91c5 offset=111](this=0x37d90af2f535 <LocalIndexedDBStoreBackend map = 0x37d91f7b3585>#3#,0x37d93a37fff1 <JSArray[88]>#5#)
5: syncToDatabase [0x37d90af01961] [vector://vector/webapp/bundles/7996f33588679dd6b681/default-node_modules_matrix-js-sdk_src_store_indexeddb-local-backend_ts.js:37751] [bytecode=0x13800bf911d offset=95](this=0x37d90af2f535 <LocalIndexedDBStoreBackend map = 0x37d91f7b3585>#3#,0x37d93a37fff1 <JSArray[88]>#5#)
6: /* anonymous */ [0x37d90aec126d] [vector://vector/webapp/bundles/7996f33588679dd6b681/default-vendors-node_modules_matrix-js-sdk_src_indexeddb-worker_ts-node_modules_matrix-js-sdk-467139.js:~1042] [pc=0x1701d16d4](this=0x37d90a6e56d9 <JSGlobalProxy>#6#,0x37d9296fffdd <MessageEvent map = 0x37d9293c0eb1>#7#)
7: InternalFrame [pc: 0x177e0b3e8]
8: EntryFrame [pc: 0x177e0b034]
=====================
- Meanwhile, there's a bug(?) in Electron so that service worker logging gets discarded from the JS console, so I couldn't spot the sync accumulator persists which predate the OOMs.
- Adding in stdout logging for all v8 contexts made the OOM spike root cause very obvious:
app.on('web-contents-created', (event, webContents) => {
webContents.on('console-message', (event, level, message, line, sourceId) => {
const levelMap = ['VERBOSE', 'INFO', 'WARNING', 'ERROR'];
console.log(`${new Date().toISOString()}: [${levelMap[level]}]:`, message);
console.log(` Source: ${sourceId}:${line}`);
});
});
So, next steps are:
- [ ] Actually lazy-load read receipts (or at least apply a heuristic to only accumulate ones into the main v8 that we care about, rather than for receipts referencing events which aren't in our timeline, or for members we haven't lazy-loaded)
- [ ] Fix the leak in the receipt sync accumulator which is somehow doubling the number of RoomMembers on our heap
- [ ] Save heap by unloading state events (or LL members) for rooms from RAM if we haven't looked at them in a while.