gatk icon indicating copy to clipboard operation
gatk copied to clipboard

GenomicsDB excessive logging "[E::faidx_adjust_position] The sequence "chrX" was not found"

Open ldgauthier opened this issue 2 years ago • 3 comments

Bug Report

Affected tool(s) or class(es)

GenomicsDBImport

Affected version(s)

  • [ ] Latest public release version [version?]
  • [x] Latest master branch as of Apr 4, 2022

Description

[E::faidx_adjust_position] The sequence "chrX" was not found [E::faidx_adjust_position] The sequence "chrX" was not found [E::faidx_adjust_position] The sequence "chrX" was not found [E::faidx_adjust_position] The sequence "chrX" was not found

Steps to reproduce

Run the first test case for GnarlyGenotyperIntergrationTest::testUsingGenomicsDB() on the branch https://github.com/broadinstitute/gatk/pull/7750

The test contains the argument --intervals chrX:1000000-5000000, but I'm not sure why that would be an issue. The tool runs fine and the output is valid.

Expected behavior

An informative warning or a single output of the existing warning

Actual behavior

Excessive logging

ldgauthier avatar Apr 04 '22 19:04 ldgauthier

The test contains the argument --intervals chrX:1000000-5000000, but I'm not sure why that would be an issue.

This is from htslib::faidx_fetch_seq_into_buffer because the reference for the test does not contain the contig chrX. We could just log this once and continue. Is this what you want? Or do you want an exception at this point?

nalinigans avatar Apr 07 '22 16:04 nalinigans

The VCF sequence dictionary does contain a chrX -- is that enough? A lot of our tools only need a dictionary and can get one from the header of various file types.

Otherwise I think an exception would be appropriate. If that was the only reference a user had, would they be able to query the GenomicsDB successfully?

ldgauthier avatar Apr 07 '22 19:04 ldgauthier

Sequence dictionary is not enough -- we actually need the reference because GenomicsDB uses that to fill in the reference base in some cases. For this reason, the reference is a required argument when reading from GenomicsDB, but as this issue outlines we probably should go one step further and validate that the intervals being queried are in the reference. We can add this to GenomicsDB but it's probably better to have a check done in GATK so that we fail fast.

It is interesting though that the results seem valid...presumably having the reference base as 'N' in some cases doesn't affect it?

mlathara avatar Apr 11 '22 15:04 mlathara