gatk
gatk copied to clipboard
Added percentage complete and expected time remaining to `ProgressMeter`
Added percentage complete and expected time remaining to ProgressMeter
. This behavior requires a sequence dictionary to be passed to ProgressMeter
.
Connected GATKTool
to use the new functionality in the case where a SAMSequenceDictionary
is defined.
Github actions tests reported job failures from actions build 5906500357 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
cloud | 17.0.6+10 | 5906500357.10 | logs |
unit | 17.0.6+10 | 5906500357.12 | logs |
integration | 17.0.6+10 | 5906500357.11 | logs |
conda | 17.0.6+10 | 5906500357.3 | logs |
unit | 17.0.6+10 | 5906500357.1 | logs |
variantcalling | 17.0.6+10 | 5906500357.2 | logs |
integration | 17.0.6+10 | 5906500357.0 | logs |
Github actions tests reported job failures from actions build 5906807871 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
cloud | 17.0.6+10 | 5906807871.10 | logs |
unit | 17.0.6+10 | 5906807871.12 | logs |
integration | 17.0.6+10 | 5906807871.11 | logs |
unit | 17.0.6+10 | 5906807871.1 | logs |
conda | 17.0.6+10 | 5906807871.3 | logs |
variantcalling | 17.0.6+10 | 5906807871.2 | logs |
integration | 17.0.6+10 | 5906807871.0 | logs |
We had something similar in GATK3 but it ended up causing confusion since it was often wildly inaccurate.
I don't think this is necessarily a bad idea, but I'm not sure this implementation is sufficient. As far as I can tell this only looks at the current position the tool is at and compares it to the fraction of the reference it's traversed. I think if we wanted to enable % completion by default we'd have to take into account the actual intervals that are specified and track how many bases we've covered out of how many bases we expect to cover.
It's also going to be inaccurate for things like small files that don't actually cover the entire reference space / interval list.
A different and possibly more accurate approach for whole file scans would be to look at how much of the file has been chewed threw already, and compare it to the total file size. That would need some additional instrumentation in various places though and would be really tricky to connect to the interval list.
Yeah - this makes 2 assumptions: the data are evenly distributed and that progress is constant. This isn't the most accurate way to do this, but something is better than nothing.
There is another implementation that I considered - base the remaining time on the time it has taken for the last N
updates. This would account for bursty processing times, but would also result in wildly fluctuating estimates (because it would still assume a uniform distribution of data). We cam also do something like this with a sliding window average to smooth it out. If you prefer another implementation I can change it, but again - something is better than nothing.
I don't want to have to scan the input data to make the progress bar work - that seems way too heavy-handed and would slow everything down. The tradeoff doesn't seem worth it.
For small files it doesn't matter anyway, so I'm not too concerned.
This arose for me because I've been needing to wait many hours for jobs to finish and I would like an estimate of when I cam expect it to finish.
Github actions tests reported job failures from actions build 5930144661 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 17.0.6+10 | 5930144661.11 | logs |
integration | 17.0.6+10 | 5930144661.0 | logs |