mcrit icon indicating copy to clipboard operation
mcrit copied to clipboard

Workers consume a lot of ram on query

Open yankovs opened this issue 1 year ago • 2 comments
trafficstars

While doing a query on a sample, often the memory usage of a single worker jumps to 10s of GBs - sometime even more than 60GB. It seems like there's no limit at all and the workers are greedy with memory usage, if some file will require 200GB of ram the worker will try to get it and will probably crash due to lack memory. As a result in a setup with multiple workers it happens quite a lot that some worker will hoard all the memory and starve other workers so that they crash.

Here is a list of hashes for samples that consistently consume a lot of ram on a worker (on our MCRIT instance with 20 million functions): cea60afdae38004df948f1d2c6cb11d2d0a9ab52950c97142d0a417d5de2ff87 d92f6dd996a2f52e86f86d870ef30d8c80840fe36769cb825f3e30109078e339 bab77145165ebe5ab733487915841c23b29be7efec9a4f407a111c6aa79b00ce 97f1ea0a143f371ecf377912cbe4565d1f5c6d60ed63742ffa0b35b51a83afa2 94433566d1cb5a9962de6279c212c3ab6aa5f18dbff59fe489ec76806b09b15f a5b38fa9a0031e8913e19ef95ac2bd21cb07052e0ef64abb8f5ef03cf11cb4d5 085b68fa717510f527f74025025b6a91de83c229dc1080c58f0f7b13e8a39904 043aac85af1bda77c259b56cd76e4750c8c10c382d7b6ec29be48ee6e40faa00 84ad84a1f730659ac2e227b71528daec5d59b361ace00554824e0fddb4b453cf 1c4bdd70338655f16cd6cf1eb596cd82a1caaf51722d0015726ec95e719f7a27 29bd1ffe07d8820c4d34a7869dbd96c8a4733c496b225b1caf31be2a7d4ff6df f72bb91a4569fb9ba2aa40db2499f39bb7aba4d20a5cb5f6dd1e2a9a4ce9af98 9119213b617e203fbc44348eb91150a4db009d78a4123a5cbce6dc6421982a91 a614ed116edc46301a4b3995067d5028af14c8949f406165d702496630cb02ce 0c9edded5ff2ac86b06c1b9929117eab3be54ee45d44fcdb0b416664c7183cbf

I am not sure what is the correct way to handle this, but I think there at least should be a way to limit each worker to some amount of memory.

yankovs avatar Mar 11 '24 11:03 yankovs

I think limiting memory is less of an issue rather than the workers not freeing up what they allocated once they are done.

I did some experiments with cea60afdae38004df948f1d2c6cb11d2d0a9ab52950c97142d0a417d5de2ff87 and it doesn't seem like an issue that it would cause excessive load on itself (smda finds 30k functions but on my MCRIT with 10m functions it didn't consume beyond 10GB total during processing).

What would probably have to be done and what seems the cleanest solution is that the actual processing should be moved into child processes so that they do indeed cleanly free all memory with their conclusion. I'll have a look how that can be possibly addressed using something like ProcessPoolExecutor with a single worker.

danielplohmann avatar Mar 12 '24 08:03 danielplohmann

Okay, I have now implemented a first draft for job processing with subprocesses in MCRIT v1.3.14.

You can try it out by running python -m mcrit spawningworker instead of the usual python -m mcrit worker.

Downside: For now, there is no longer interactive output for what is happening during job processing for now. Upsides: This should generally open the door for splitting of low/heavy load jobs for workers (i.e. minhashing/collection jobs versus disassembly/matching jobs) and to set timeouts for jobs that appear to have stalled/become stuck.

Please try out and let me know if this helps in any way with the memory issues you are/were experiencing.

danielplohmann avatar Apr 02 '24 14:04 danielplohmann