grass icon indicating copy to clipboard operation
grass copied to clipboard

[Feat] ParallelModuleQueue (python multiprocessing): don't wait for entire block to finish before pulling new processes

Open griembauer opened this issue 2 years ago • 2 comments

The option to run GRASS modules in parallel (in python) is implemented via the ParallelModuleQueue class. The standard way (?) is to define a processing queue via an nprocs parameter, add GRASS modules to be executed in parallel via the put() method and finally start the parallel processing using the wait() method. The way it is implemented now, the queue seems to run a number of processes defined by nprocs and waits for all processes to finish before starting the next "block" of processes. This means that the longest process determines the duration of an entire processing "block". Ideally, free slots could be filled directly with pending processes from the queue instead.

griembauer avatar Jul 05 '22 08:07 griembauer

I agree that is a problem, which is partially the reason I usually just use standard Python multiprocessing.Pool methods (like map_async) with run_command. Just curious, do you prefer ParallelModuleQueue for some specific reason?

petrasovaa avatar Jul 05 '22 15:07 petrasovaa

No, not at all, I am just used to using it since it is the pygrass way ;) Also, some GRASS modules from the temporal framework use ParallelModuleQueue, e.g. for aggregation: https://github.com/OSGeo/grass/blob/1961472afeb7633c9b744b0a60c923fb9b1d4411/python/grass/temporal/aggregation.py#L267

griembauer avatar Jul 18 '22 06:07 griembauer