PushshiftDumps icon indicating copy to clipboard operation
PushshiftDumps copied to clipboard

multiprocessing.pool vs ProcessPoolExecutor

Open taygetea opened this issue 2 years ago • 2 comments

I've been having an issue using the multiprocessing to filter down the entire 2005-2022 dataset, and i wont be able to limit it to just one subreddit. and im currently working through an issue where combine_folder_multiprocess will hang. i ran into that a few times with smaller chunks of the reddit data, but i was able to just kill and restart it and it was able to pick up where it left off. but not with the 2tb dataset. and the processing of debugging this is made harder by multiprocessing.pool having a tendency to silently fail (especially if the OOM killer kicks in), where ProcessPoolExecutor will give a BrokenProcessPool exception. the two have effectively the same features, but ProcessPoolExecutor is probably what's going to get the most updates going forward. https://stackoverflow.com/questions/65115092/occasional-deadlock-in-multiprocessing-pool https://bugs.python.org/issue22393#msg315684 https://stackoverflow.com/questions/24896193/whats-the-difference-between-pythons-multiprocessing-and-concurrent-futures

Other than that suggestion (and I'll send a PR if i end up porting it over and it works well), I'll update this on what works. but, how much RAM does the system where you process the entire dataset have? right now the machine I'm using has 32gb, and i gave it 20 workers because i have 24 cores and wanted to use my computer at the same time it was running. i could easily give the machine more, it's a WSL vm currently assigned half my system memory. Would you expect 10 vs 20 workers, 32 vs 64gb of ram, etc, to have major effects on whether the script completes?

taygetea avatar Aug 24 '23 20:08 taygetea

I wouldn't mind a pull request to switch to processpoolexecuter if you put it together.

The memory intensive part of the script is that it reads in a chunk of the compressed data, tries to decompress it and if it fails, it reads in more, appends it to the previous data and tries again. There's a limit of 512 MB, and most chunks don't require anywhere near that limit. So while in theory it could be up to 512 MB per working (plus a bit of overhead), in practice it should be much less.

But it seems like you could just, use fewer workers so you don't run out of memory.

I store the dumps on a network drive which uses disk drives, so my limit has always been read/write speed instead of processing power, so I don't gain anything by using more workers once the read/write saturates.

Watchful1 avatar Aug 25 '23 03:08 Watchful1