faasm
faasm copied to clipboard
Remove TSAN and ASAN runs
ASAN and TSAN builds of the tests are consuming upwards of 15 and 20 GB of memory each. This is bricking the Github-hosted runners.
If we run a self-hosted runner in one of our readily available servers, we are severly limited in terms of concurrency by two factors: (i) having to deploy one runner per concurrent job, and (ii) running out of the 32 GBs of memory. This means that we'd have to run the sanitised tests in serial, accounting for > 1h of execution time. As a consequence, we disable them altogether.
For reference, TSAN was already removed in #650. Also for future reference, here are the TSAN runtime flags and the ASAN runtime flags.
Pinging @eigenraven to pick their brains. Is there something we are completely overlooking here?
Hi! For TSAN, have you tried reducing history size? The memory used grows exponentially with the size parameter in the options. As for ASAN, I'm not sure why the sudden jump - did you change anything with memory management that could've caused more allocations? It usually doesn't have that high of an overhead, it mostly consumes virtual memory which is almost free.
@eigenraven thanks for the tips! For ASAN, as you say, I think there might be another problem. For TSAN, changing the history_size to 0 does not help. I am a bit puzzled by that, as I was expecting a big reduce in memory consumption. However, it seems that TSAN's runtime flags need to be provided as a space-spearated (and not colon-separated) list. So we may have been using the default value of 2 all this time.
@csegarragonz looking at the current failed run, there seems to be 1) a lot of useless tls access logging, 2) a lot of warnings about unreclaimed memory https://github.com/faasm/faasm/runs/7000258660?check_suite_focus=true#step%3A23%3A5773= - this looks like faasm logs rather than tsan logs
Looks like this check is failing repeatedly https://github.com/faasm/faasm/blob/main/src/wasm/WasmModule.cpp#L826-L833=. I remember having to modify the brk mechanism locally because it didn't work properly when the wasm module allocated memory itself via memory.grow, and led to a lot of memory leaks if that function was called, maybe that's also what's happening now?
@eigenraven thanks for pointing out the typos. FYI, the flush_memory_ms really made the difference for TSAN. Is there anything else that u think needs amending?
Good question. I removed it and the tests pass as well. I can't remember why I included it in the first place...