plantcv icon indicating copy to clipboard operation
plantcv copied to clipboard

Checkpointing in parallel

Open HaleySchuhl opened this issue 1 year ago • 0 comments

Is your feature request related to a problem? Please describe. In very large datasets, it's possible for significant amounts of data and compute time to be replicated in order to recover a subset of failing workflows. One example would be if a user ran out of memory in the middle of running a workflow. Maybe half of the images from the dataset have already been analyzed, but since the process is unable to complete, you'd have to rerun all images rather than just those that hadn't gotten queued up yet. We want to create a checkpoint log which will allow users to see which files have been processed already vs. those that still need analysis. Additionally, process_results could ideally handle incomplete parallelization workflows by failing gracefully on problemsome images while still concatenating data from successful reps.

Highly related is the behavior of pcv.analyze_object and brain storming on how we can make that function fail gracefully/informatively. Currently, people often code around this step since it can fail fatally in empty pot/extremely small plant examples.

Describe the solution you'd like Create a log, print filepath to the log during job building (list of all jobs) and then remove filepaths from the log as they complete and save data out successfully. At the end of your workflow, remaining filepaths in the checkpoint log will be those reps that need re-analysis.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context, sample data, or code relevant to the feature request here.

HaleySchuhl avatar May 11 '23 18:05 HaleySchuhl