ipyrad icon indicating copy to clipboard operation
ipyrad copied to clipboard

Feature Request: Restarting Partway Through a Step

Open alexkrohn opened this issue 3 years ago • 5 comments

It would be really nice to be able to restart ipyrad from the middle of a step rather than having to restart the entire step.

This is probably most important on Step 3. For example, I was running ipyrad on a large dataset. Steps 1-2 took a few hours, but Step 3 took > 7 days on my machine. When clustering (the longest action in Step 3) was ~85% done after about 6 days of calculations, the power went out.

To my knowledge, in order to restart that ipyrad run, I would have to give -s 34567 with the same params file as before. That would restart Step 3 from the beginning. Given that all of the 85% complete tmp and cluster files are already there, it would be great if there was enough information in the json to just restart Step 3 where it failed, rather than from the beginning.

Thanks!

alexkrohn avatar Jan 04 '22 14:01 alexkrohn

Thanks for the suggestion, and yes, i agree that this would be nice. It would be a not insignificant amount of work, unfortunately, so we have not prioritized it.

isaacovercast avatar Jan 04 '22 14:01 isaacovercast

That's what I figured. I figured I'd at least make an official suggestion :-)

alexkrohn avatar Jan 04 '22 14:01 alexkrohn

Adding my voice back in again for checkpointing at step 3. This time I'm running a salamander 3RAD dataset of 125 individuals where most individuals have 10-20 million reads, but one individual has 230 million. Most individuals finished clustering in 7 days (with 500 GB RAM, 40 cores, 3 TB of allocated hard drive space), but that largest individual is still clustering after 15 days. I'm coming up close to the deadline for this analysis. It would be wonderful to be able to restart the run with all individuals minus the 230-million-read-individual, without having to re-do the 7 days of clustering.

alexkrohn avatar Dec 16 '22 16:12 alexkrohn

Oh man, I feel your pain. This is still a good idea, I am with you on that. At the moment I still can't make any promises (it's faculty job season), but I will try to think about whether there's a quick and dirty way to hotwire it.

You have probably already thought of this, but an alternative is to carve off the first 20 million reads from the huge sample, and just go forward with that. The extreme amount of extra reads for that one sample isn't really going to do much for you except burn cpu time (10-20 million reads per sample is a LOT). I know it would be throwing away data, but I do not think it would meaningfully change any of the results. Just an idea.

isaacovercast avatar Dec 16 '22 16:12 isaacovercast

I totally understand that this would be a significant lift. I figured I'd just put it out there that this would still be a useful feature 😃

With a genome likely bigger than 50 Gb, 10-20 million reads per individual rarely gets us more than 1-2k orthologous loci 😅 We're actually using this big run to design baits around SNPs of interest to cut down on the sequencing depth needed. Since clustering the final individual is only taking up 1-2 cores, I'll probably start another run with the first 20 million reads of the huge individual somewhere else.

alexkrohn avatar Dec 16 '22 18:12 alexkrohn