gatk
gatk copied to clipboard
GenomicsDB excessive logging "[E::faidx_adjust_position] The sequence "chrX" was not found"
Bug Report
Affected tool(s) or class(es)
GenomicsDBImport
Affected version(s)
- [ ] Latest public release version [version?]
- [x] Latest master branch as of Apr 4, 2022
Description
[E::faidx_adjust_position] The sequence "chrX" was not found [E::faidx_adjust_position] The sequence "chrX" was not found [E::faidx_adjust_position] The sequence "chrX" was not found [E::faidx_adjust_position] The sequence "chrX" was not found
Steps to reproduce
Run the first test case for GnarlyGenotyperIntergrationTest::testUsingGenomicsDB() on the branch https://github.com/broadinstitute/gatk/pull/7750
The test contains the argument --intervals chrX:1000000-5000000
, but I'm not sure why that would be an issue. The tool runs fine and the output is valid.
Expected behavior
An informative warning or a single output of the existing warning
Actual behavior
Excessive logging
The test contains the argument --intervals chrX:1000000-5000000, but I'm not sure why that would be an issue.
This is from htslib::faidx_fetch_seq_into_buffer
because the reference for the test does not contain the contig chrX
. We could just log this once and continue. Is this what you want? Or do you want an exception at this point?
The VCF sequence dictionary does contain a chrX -- is that enough? A lot of our tools only need a dictionary and can get one from the header of various file types.
Otherwise I think an exception would be appropriate. If that was the only reference a user had, would they be able to query the GenomicsDB successfully?
Sequence dictionary is not enough -- we actually need the reference because GenomicsDB uses that to fill in the reference base in some cases. For this reason, the reference is a required argument when reading from GenomicsDB, but as this issue outlines we probably should go one step further and validate that the intervals being queried are in the reference. We can add this to GenomicsDB but it's probably better to have a check done in GATK so that we fail fast.
It is interesting though that the results seem valid...presumably having the reference base as 'N' in some cases doesn't affect it?