showyourwork Parallelize downloading from Zenodo and plotting

My paper requires large dataset from Zenodo and long runtime for plotting the figures. Is it possible to parallelize the two process so that it would save some time?

Oct 06 '23 06:10 thomasckng

Are you using the "caching" feature here (dynamic datasets) or just static datasets? If the latter, I believe that it is already parallel, but the former will be hard to parallelize because the whole caching infrastructure is quite the hack!

Oct 06 '23 16:10 dfm

Thank you for the reply. I am using the static dataset only. I didn't notice it was in parallel already (maybe I set something wrong?), as it seems the files only start to download after the one plotting script is finished in the building process. In this GitHub Action Build for example, it seems it only start to download another file after plotting a figure.

Oct 06 '23 16:10 thomasckng

It's not obvious from that build log whether or not it's running in parallel, but I wouldn't expect that you'll get a significant speedup on GH actions regardless since there are only 2 virtual threads in general, and you're probably I/O limited. I'd be happy to look into it more closely if wanted to do some local benchmarking. I think snakemake will sensibly parallelize these downloads because we have separate rules for each, but I'm not 100% sure how it handles parallelizing script directives.

Oct 06 '23 18:10 dfm

How should I do the local benchmarking that we need?

Oct 07 '23 06:10 thomasckng

I think that —cores n is necessary to get things going in parallel and I am not sure that the showyourwork action has this set as a default.

Dec 08 '23 07:12 arm61

@arm61 Thanks for the reply. I am not familiar with this, where should I apply this --core option?

Dec 11 '23 04:12 thomasckng

It is a snakemake option (which showyourwork is built on top of). To run locally, you would have

showyourwork build --cores X

I am unsure if it can be used on the GitHub action (I am less familiar with the action implementation). Perhaps @dfm knows?

Dec 11 '23 08:12 arm61

Good point! It does look like the default argument there is "1" for some reason:

https://github.com/showyourwork/showyourwork/blob/64fb8cc7dbde5a0ee68ba5b4731372ae6492e5d4/src/showyourwork/cli/main.py#L19-L25

I really thought it was set to "all", which would probably be a better default. Unfortunately I'm not sure that there is any way to use this option on GitHub right at the moment.

Dec 11 '23 16:12 dfm

I tried --core all. It works! Thank you so much. @arm61 @dfm

Just to note it here. The progress bars are not displaying correctly. As the downloading tasks and rules are starting at different time, the progress bars are quite messy. Also, I also couldn't figure out how to use it with GitHub Action. It would be nice if the setting works with it later on.

Dec 11 '23 19:12 thomasckng

@dfm I would recommend against setting --cores all as a default. It can cause problems with uploading to Zenodo (you get banned from uploading if you upload too many things in a minute) and if a user suddenly has all their cores running showyourwork jobs, that might be annoying. Perhaps just a documentation page on --cores n instead?

Dec 13 '23 10:12 arm61