parmap
parmap copied to clipboard
Are you a parmap user? Please enter
Hi,
I have curiosity to know who is using parmap and for what purpose. Sometimes I believe there are no users out there and then I feel happy when someone pops by and opens an issue. If you are using parmap and want to leave a note, please do that here. I would be very happy to know what is parmap being used for. Once you have answered feel free to click on "Unsubscribe" on the right if you don't want to receive further notifications from other parmap users.
For instance here is one user that wrote me about his paper on spinning black hole binaries where he had used parmap:
- Davide Gerosa and Michael Kesden “PRECESSION. Dynamics of spinning black-hole binaries with python.” Phys. Rev. D 93, 124066 – Published 27 June 2016 arXiv:1605.01067 DOI
Thanks!
I actually found parmap on stackoverflow whilst looking for a nice, py2+py3 way to provide constant variables to map
. Finding it supported tqdm was very pleasant. I'm using it help me process about 300GB of seismic data, I hand off to parmap to perform analysis calculations. Thanks for the useful library!
I'm using it for custom scikit-learn estimators.
You could attract potential users if you would add parmap as an answer to related questions on stackoverflow (e.g. https://stackoverflow.com/q/9911819). Indeed, I found it the best solution I tested. -- but you should state that you're the author
Thanks for the tip. I am not actively searching for more users though. It's great if they find parmap
and they like it, and I will talk about parmap
to anyone that might be interested. However, I can't spend time on finding users who might like parmap
right now, and if these users came I would need to spend even more time fixing issues.
So, when I have the time I will start actively looking for more users. Until then they will have to find parmap
if they want to. Feel free to tell others about parmap if you want, though.
I am currently using parmap
for my master thesis about emotion detection in tweets.
Just found parmap
and loving it, it saved me a lot of partial
and pool calls! As for the application: signal analysis for single photon detectors.
One line code to use parallel computation and with progress bar. I love this tiny tool very much. I use it everywhere I need parallelization.
Hi - I am using parmap
for generating nodes in knowledge graphs. A couple of questions:
- If `pm_processes is not passed. Does the number of processes scale to the max available?
- If each item in the list spawns a long process - Is chunking a good way to speed things up further?
@gryBox
Empty pm_processes
If pm_processes
is not passed, parmap follows multiprocessing.Pool
defaults and therefore uses os.cpu_count()
.
About chunksize values
By default, the chunksize is len(iterable)/(4*pm_parallel)
, rounding up if necessary. This is also the default from multiprocessing. If you have 200 tasks and 5 parallel processes, chunksize = 200/(4*5) = 10
.
I will try to explain here why that default is reasonable going to the extremes:
chunksize = 1
Using a chunksize of 1 would mean that each task is submitted individually. As soon as one task is finished, the main process submits another one. This would be fine if submitting a task did not have any overhead, which is not the case. If each taks takes a short time to finish, using such a small chunksize would mean that multiprocessing has to spend in comparison a lot of time on submitting data and getting back the results. In this case parallelizing with chunksize=1 could make the code run slower.
chunksize = number of tasks
If you just create one big chunk, you can only send it to one process, so you can't parallelize. It is an absurdly high value. Instead of using this value please disable parallelization.
chunksize = num_tasks/num_processes
You split your tasks in as many groups as parallel processes. You minimize the submissions, so the overhead is minimal. This may seem like a very smart approach, but what happens if tasks take a different amount of time to complete? With this approach, if you have bad luck, one of your processes may get one or a lot of long tasks and while the other processes have finished, you will need to wait for that one process to finish multiple tasks. All tasks have already been submitted so the other processes can't do anything to help the process that has been given too much work.
chunksize = num_tasks/(4*num_processes)
This is a reasonable tradeoff. Each process would get on average 4 submissions of tasks. If one task was much longer than the rest, the process with that task would probably get 3 or 2 submissions and other processes would get 5 submissions each. While the overhead is a little bit bigger, the benefit on the general case is much larger.
chunksize summary
In summary, the default is usually good enough. If you have a huge amount of equally super short tasks maybe a larger chunksize would be significantly beneficial. I haven't done any formal benchmark, feel free to do so if you want.
@zeehio Thank you that is a clear and easy explanation. Leaving things at default for now. Wonderful tool!
tagbase-server uses parmap to asynchronously process biologging data from electronic tags deployed on various marine animals. This is an excellent utility library. Thank you @zeehio 👍
@zeehio I used parmap to target 24 million github repos for their language dependency files a few years ago. This was a part of some security analysis I was doing during my Master's Very glad this tool existed; especially since I didn't want to move to a compiled language for multi processing stuff.