Add a PoC disk_estimator feature to the vcf_to_bq preprocessor.
The disk_estimator pipeline uses, for each input VCF file, the 1) raw file size and the 2) raw and 3) encoded sizes of the first 50 Variants/lines to estimate the disk space that would be required for a shuffle step in the pipeline. This estimate is useful because a pipeline that is run with insufficient disk and MergeVariants enabled as a step in a pipeline can fail hours/days after the Dataflow pipeline is kicked off, while the customer is still billed for the compute.
The estimated disk size is emitted in the preprocessor report file as well as the PrintEstimate step of the pipeline.
At least three significant things are missing in this pull request, which is why the feature is currently disabled by default when running:
- Needs unit tests
- Needs integration/e2e tests
- When reading the snippets from files, there needs to be support for a pattern similar to
ReadAllFromVcfthat can handle reading from a large set (> tens of thousands) of input files.
The disk estimator is triggered by adding the --estimate_disk_usage flag to the vcf_to_bq_preprocess pipeline invocation e.g.
python -m gcp_variant_transforms.vcf_to_bq_preprocess --input_pattern gs://genomics-public-data/1000-genomes/vcf/*.vcf --estimate_disk_usage --report_path report.log.
Output in preprocessor report for 1000-genomes VCFs (invocation above): "Estimated disk usage by Dataflow: 4847 GB. The total raw file sizes summed up to 1231 GB."
Issue #67
Thank you so much for adding the estimator, John! This is so exciting! It will be very useful when processing large dataset!
Before I do the review, can you make sure the unit tests pass? You can run all unit tests by python setup.py test. Meanwhile, please also ensure Pylint passes (just run pylint --rcfile=.pylintrc gcp_variant_transforms/). More details can found on our development guide.
BTW, please refer Issue 67 in the PR comment (just add something like issues: Issue 67)
to provide more context and easily trace the issue.
Pull Request Test Coverage Report for Build 1703
- 72 of 129 (55.81%) changed or added relevant lines in 6 files are covered.
- 1 unchanged line in 1 file lost coverage.
- Overall coverage decreased (-0.5%) to 87.315%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| gcp_variant_transforms/beam_io/vcfio.py | 13 | 16 | 81.25% |
| gcp_variant_transforms/vcf_to_bq_preprocess.py | 0 | 8 | 0.0% |
| gcp_variant_transforms/beam_io/vcf_file_size_io.py | 34 | 80 | 42.5% |
| <!-- | Total: | 72 | 129 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| gcp_variant_transforms/vcf_to_bq_preprocess.py | 1 | 0.0% |
| <!-- | Total: | 1 |
| Totals | |
|---|---|
| Change from base Build 1697: | -0.5% |
| Covered Lines: | 7083 |
| Relevant Lines: | 8112 |
💛 - Coveralls
Hi Allie, sorry about that, done; thanks!
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.
Let me know how this looks; if it lg then I can squash these commits and rebase with the newest changes to master, and thanks again!
Rebased and fixed with the final fixes; thanks so much for your patience!
Thanks for the comments! I had some trouble with developing locally on Mac but it seems like it works again after some package update
Travis says it's failing but I can reproduce it in essentially a clean client by creating a new PR; it seems like there's something wrong with the cbuild atm https://travis-ci.org/googlegenomics/gcp-variant-transforms/builds/580488773?utm_source=github_status&utm_medium=notification