Parallelize downloading from Zenodo and plotting
My paper requires large dataset from Zenodo and long runtime for plotting the figures. Is it possible to parallelize the two process so that it would save some time?
Are you using the "caching" feature here (dynamic datasets) or just static datasets? If the latter, I believe that it is already parallel, but the former will be hard to parallelize because the whole caching infrastructure is quite the hack!
Thank you for the reply. I am using the static dataset only. I didn't notice it was in parallel already (maybe I set something wrong?), as it seems the files only start to download after the one plotting script is finished in the building process. In this GitHub Action Build for example, it seems it only start to download another file after plotting a figure.
It's not obvious from that build log whether or not it's running in parallel, but I wouldn't expect that you'll get a significant speedup on GH actions regardless since there are only 2 virtual threads in general, and you're probably I/O limited. I'd be happy to look into it more closely if wanted to do some local benchmarking. I think snakemake will sensibly parallelize these downloads because we have separate rules for each, but I'm not 100% sure how it handles parallelizing script directives.
How should I do the local benchmarking that we need?
I think that —cores n is necessary to get things going in parallel and I am not sure that the showyourwork action has this set as a default.
@arm61 Thanks for the reply. I am not familiar with this, where should I apply this --core option?
It is a snakemake option (which showyourwork is built on top of). To run locally, you would have
showyourwork build --cores X
I am unsure if it can be used on the GitHub action (I am less familiar with the action implementation). Perhaps @dfm knows?
Good point! It does look like the default argument there is "1" for some reason:
https://github.com/showyourwork/showyourwork/blob/64fb8cc7dbde5a0ee68ba5b4731372ae6492e5d4/src/showyourwork/cli/main.py#L19-L25
I really thought it was set to "all", which would probably be a better default. Unfortunately I'm not sure that there is any way to use this option on GitHub right at the moment.
I tried --core all. It works! Thank you so much. @arm61 @dfm
Just to note it here. The progress bars are not displaying correctly. As the downloading tasks and rules are starting at different time, the progress bars are quite messy. Also, I also couldn't figure out how to use it with GitHub Action. It would be nice if the setting works with it later on.
@dfm I would recommend against setting --cores all as a default. It can cause problems with uploading to Zenodo (you get banned from uploading if you upload too many things in a minute) and if a user suddenly has all their cores running showyourwork jobs, that might be annoying. Perhaps just a documentation page on --cores n instead?