gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Added percentage complete and expected time remaining to `ProgressMeter`

Open jonn-smith opened this issue 1 year ago • 5 comments

Added percentage complete and expected time remaining to ProgressMeter. This behavior requires a sequence dictionary to be passed to ProgressMeter.

Connected GATKTool to use the new functionality in the case where a SAMSequenceDictionary is defined.

jonn-smith avatar Aug 18 '23 19:08 jonn-smith

Github actions tests reported job failures from actions build 5906500357 Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 17.0.6+10 5906500357.10 logs
unit 17.0.6+10 5906500357.12 logs
integration 17.0.6+10 5906500357.11 logs
conda 17.0.6+10 5906500357.3 logs
unit 17.0.6+10 5906500357.1 logs
variantcalling 17.0.6+10 5906500357.2 logs
integration 17.0.6+10 5906500357.0 logs

gatk-bot avatar Aug 18 '23 20:08 gatk-bot

Github actions tests reported job failures from actions build 5906807871 Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 17.0.6+10 5906807871.10 logs
unit 17.0.6+10 5906807871.12 logs
integration 17.0.6+10 5906807871.11 logs
unit 17.0.6+10 5906807871.1 logs
conda 17.0.6+10 5906807871.3 logs
variantcalling 17.0.6+10 5906807871.2 logs
integration 17.0.6+10 5906807871.0 logs

gatk-bot avatar Aug 18 '23 20:08 gatk-bot

We had something similar in GATK3 but it ended up causing confusion since it was often wildly inaccurate.

I don't think this is necessarily a bad idea, but I'm not sure this implementation is sufficient. As far as I can tell this only looks at the current position the tool is at and compares it to the fraction of the reference it's traversed. I think if we wanted to enable % completion by default we'd have to take into account the actual intervals that are specified and track how many bases we've covered out of how many bases we expect to cover.

It's also going to be inaccurate for things like small files that don't actually cover the entire reference space / interval list.

A different and possibly more accurate approach for whole file scans would be to look at how much of the file has been chewed threw already, and compare it to the total file size. That would need some additional instrumentation in various places though and would be really tricky to connect to the interval list.

lbergelson avatar Aug 18 '23 21:08 lbergelson

Yeah - this makes 2 assumptions: the data are evenly distributed and that progress is constant. This isn't the most accurate way to do this, but something is better than nothing.

There is another implementation that I considered - base the remaining time on the time it has taken for the last N updates. This would account for bursty processing times, but would also result in wildly fluctuating estimates (because it would still assume a uniform distribution of data). We cam also do something like this with a sliding window average to smooth it out. If you prefer another implementation I can change it, but again - something is better than nothing.

I don't want to have to scan the input data to make the progress bar work - that seems way too heavy-handed and would slow everything down. The tradeoff doesn't seem worth it.

For small files it doesn't matter anyway, so I'm not too concerned.

This arose for me because I've been needing to wait many hours for jobs to finish and I would like an estimate of when I cam expect it to finish.

jonn-smith avatar Aug 18 '23 22:08 jonn-smith

Github actions tests reported job failures from actions build 5930144661 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 17.0.6+10 5930144661.11 logs
integration 17.0.6+10 5930144661.0 logs

gatk-bot avatar Aug 21 '23 19:08 gatk-bot