EQcorrscan Speed-up Tribe and Party reading

What does this PR do?

Implements a few speed-ups for Party IO, and ~~will/may include shifting to sparse event style objects~~. The main IO patches should be non-breaking and are:

Write chunked catalog files to the party archive, with a user-definable maximum number of events per file.
Read detection catalog files in parallel (optional), requires 1
Move away from searching exhaustively (and stupidly) through the catalog of detection events for a matching id - just use a dictionary instead.

Why was it initiated? Any relevant Issues?

#445 mentions some of these points. Mostly speed for large catalogs was a real issue, and the catalog reading wasn't the main slow point! Scanning through the whole catalog to assign events to detections was dumb (my bad), and dictionary lookups speed this up a lot.

Writing out large catalogs can be really expensive for memory - it SUCKS getting to the end of a big matched-filter run, only to run out of memory when writing out the results... Limiting the number of events written to a single catalog file (hidden from the user in the tar archive) reduces the risk here - the main memory cost is holding both the catalog and the serialized catalog in memory at the same time - only serializing a chunk of the catalog at a time reduces the memory cost, without an obvious major slow-down. In theory this could be parallelized.

PR Checklist

[x] develop base branch selected?
[x] This PR is not directly related to an existing issue (which has no PR yet).
[ ] All tests still pass.
[ ] Any new features or fixed regressions are be covered via new tests.
[ ] Any new or changed features have are fully documented.
[ ] Significant changes have been added to CHANGES.md. ~~- [ ] First time contributors have added your name to CONTRIBUTORS.md.~~

May 09 '21 07:05 calum-chamberlain

Thanks for those comments @flixha I agree and will get onto them! I'm not fond of setting the process spawning method, even in a context manager, mostly to ensure that the user can (but probably won't) set the spawn method.

Quick question, what OS and Python version are you running to get those deadlocks? I pretty much just develop on Linux and haven't run into deadlocks, but I know different OS's support multiprocessing in different ways. Would you also be able to share a script that deadlocks?

Finally, I may switch to using concurrent.futures to more easily allow ProcessPoolExecutor and ThreadPoolExecutor options for the reading of QuakeML and MSEED respectively.

May 10 '21 20:05 calum-chamberlain

An additional speed-up for party reading could be to read the parts of the party (Tribe, detection catalogs, detection stats) in separate processes, before finally reconstructing the party at the end.

I should provide some benchmarks for this at some point to actually demonstrate any speed-ups. One of the main speed-ups was actually the change from the silly lookup of template streams by an exhaustive search, to using a dictionary, which doesn't need any parallelism.

Initial testing showed that reading a Party of c. 300k detections and 2k templates went from taking > 9 hours to < 30 minutes. Still not fast, but at least more manageable...

May 11 '21 22:05 calum-chamberlain

Thanks for those comments @flixha I agree and will get onto them! I'm not fond of setting the process spawning method, even in a context manager, mostly to ensure that the user can (but probably won't) set the spawn method.

Quick question, what OS and Python version are you running to get those deadlocks? I pretty much just develop on Linux and haven't run into deadlocks, but I know different OS's support multiprocessing in different ways. Would you also be able to share a script that deadlocks?

Finally, I may switch to using concurrent.futures to more easily allow ProcessPoolExecutor and ThreadPoolExecutor options for the reading of QuakeML and MSEED respectively.

I'm on python 3.8.8, and as OS I'm using RHEL 8.3 and RHEL 7.9 depending on the machine. I have seen many deadlocks especially when reading files in parallel on these. I've spent too much time trying to debug that without really understanding why it's happening. From what I read, python on Windows and Mac uses the spawn-method by default, and only on Linux uses forking by default, which is less robust in terms of package imports in the workers.

I wasn't fully aware of the context managing abilities of multiprocessing. I tried to use multiprocessing.set_start_method('spawn') at the start of the script below ~~, but that will unfortunately result in a RuntimeError: context has already been set.~~ which will work if I guard it within if __name__ == "__main__":.

Example that will deadlock on the last line for me using the Linux machines; while it works on Mac 10.13.6:

from obspy import UTCDateTime
from obspy.clients.fdsn import Client
from obspy.core.event import Pick
from eqcorrscan.core.match_filter import Tribe
from eqcorrscan.utils import catalog_utils

client = Client('GEONET')
t1 = UTCDateTime(2016, 9, 4)
t2 = t1 + 86400
catalog = client.get_events(
    starttime=t1, endtime=t2, minmagnitude=4, minlatitude=-49,
    maxlatitude=-35, minlongitude=175.0, maxlongitude=-175.0)
catalog = catalog_utils.filter_picks(
    catalog, channels=['EHZ'], top_n_picks=5)
for event in catalog:
    extra_pick = Pick()
    extra_pick.phase_hint = 'S'
    extra_pick.time = event.picks[0].time + 10
    extra_pick.waveform_id = event.picks[0].waveform_id
    event.picks.append(extra_pick)
tribe = Tribe()
tribe.construct(
    method='from_client', catalog=catalog, client_id='GEONET',
    lowcut=2.0, highcut=9.0, samp_rate=50.0, filt_order=4,
    length=3.0, prepick=0.15, swin='all', process_len=3600)

tribe.write('test.tgz', max_events_per_file=4)

# Deadlocks here:
Tribe().read('test.tgz', cores=4)

May 12 '21 08:05 flixha

Are you running that as a script or from an iPython session? In general when running scripts that use multiprocessing libraries you should always encapsulate your code in an if __name__ == "__main__": block. Check the multiprocessong docs here - the section on "Safe importing of main module" is what you need.

May 12 '21 08:05 calum-chamberlain

The if __name__ == "__main__": did solve it indeed here!

May 12 '21 08:05 flixha

I just ran that script without issue on my Ubuntu machine using Python 3.8.8 from conda... Curious that you run into locks. I have seen similar issues on Windows - there should probably be a note somewhere prominent in the docs about encapsulating everything in an if __name__ == "__main__": block.

If you have the time/chance, if you run into these locks again without the block can you try that and report if it does/does not work?

Shameful admittance on my part, but I either run everything in iPython (for writing), Jupyter (for sharing with people), or write everything in functions by default because I tend to reuse a lot of code. I don't think I ever run EQcorrscan components in a script like this, hence not making any note of it!

May 12 '21 08:05 calum-chamberlain

I get the deadlock both in ipython (in a window in VScode) and when running with python from the terminal. I have not generally had the deadlock-problem with EQcorrscan's functions (if I do run into that I can report it of course), but I have experienced it in several of my functions that wrap around EQcorrscan-functions (e.g., creating templates in parallel). In these cases, the deadlock still occurs even though all parallelization happens in the functions.

For the test above, if I put that into a function, it will run even without the if __name__ == "__main__":in front of the function call.

May 12 '21 08:05 flixha

Interesting - I also don't use the iPython within vscode, or pycharm, I seem to get odd hangs in then in general, but if you are getting the same locks from the terminal then I don't know!

RE making templates in parallel - that may have something to do with pre-processing trying to process the streams in parallel and nested parallelism not working well. At some point I will redesign pre-processing to move away from multiprocessing - hopefully though some obspy speed-ups I'm playing with. If you can, for parallel template creation, or anything that doesn't require any kind of memory sharing (e.g. looping through days of detections), I recommend not using Python multiprocessing to do this. I usually write a script with arguments of start and end and write a little bash script that calls the Python script multiple times and puts them into the background.

May 12 '21 09:05 calum-chamberlain

Thanks for your thoughts on this! When doing the template-creation in parallel, I use parallel=False, cores=1 for the pre_processing. In total that makes it fast because most time in my scripts is spent on organizing picks and traces.

For the template-creation parallelization, I now use joblib's Parallel(n_jobs=cores)delayed(function)(parameters), which defaults to a robust pool-creation in loky through fork + exec instead of a simple fork or spawn in multiprocessing. That's more stable in that situation, while it comes with some other inconveniences, e.g., in handling errors in the workers.

May 12 '21 09:05 flixha

Hi Calum, I noticed an issue with the Resource-ID links when reading Quakeml files in parallel in subprocesses, see https://github.com/obspy/obspy/issues/2842 . I haven't seen the problem in this implementation here yet, or it may not be particularly relevant, but thought it's worth mentioning in case it has a "hidden" effect.

May 25 '21 13:05 flixha

EQcorrscan EQcorrscan copied to clipboard

Speed-up Tribe and Party reading

What does this PR do?

Why was it initiated? Any relevant Issues?

PR Checklist

EQcorrscan
EQcorrscan copied to clipboard