gatk icon indicating copy to clipboard operation
gatk copied to clipboard

when I use gatk to identify variants in a bam file converted from gam using vg surject, I get error

Open ChenDepp opened this issue 8 months ago • 7 comments

@jmarshall when i use (vg surject) convert gam to bam, then use gatk detect variants, it report error as bleow Image

the genome no chromosome is longer than 512M,i can't solve it by myself, can you give me some suggestion, waiting for your reply!

ChenDepp avatar Mar 13 '25 03:03 ChenDepp

Hi @ChenDepp This is probably an error with HTSJDK's CSI decoder for BAM files. Can you reindex your bam file with BAI index and try again. Looks like an error when calculating bins using min_shift given CSI index. It may be an easy fix but may need additional work on HTSJDK and later on GATK.

gokalpcelik avatar Mar 13 '25 19:03 gokalpcelik

Hmn, we might have an off by one error somewhere...

lbergelson avatar Mar 13 '25 20:03 lbergelson

Hi @ChenDepp This is probably an error with HTSJDK's CSI decoder for BAM files. Can you reindex your bam file with BAI index and try again. Looks like an error when calculating bins using min_shift given CSI index. It may be an easy fix but may need additional work on HTSJDK and later on GATK.

@gokalpcelik hi,I had tried bai and csi index, it still report error,That's why I asked you for help.

ChenDepp avatar Mar 16 '25 06:03 ChenDepp

Hmn, we might have an off by one error somewhere...

@lbergelson hi,Are you saying it's gatk's problem, or is it my problem when dealing with the file?

ChenDepp avatar Mar 16 '25 07:03 ChenDepp

How did you create index files for your bam?

Looking at the issue that you opened below your problem may not be lying with GATK at all. https://github.com/vgteam/vg/issues/4538

One thing that I can tell is that if "min_shift" required for the index file is not asserted correctly for the writer in whatever tool you are using then it is normal to have index out of bounds errors with the bin array within the index. Unless you are using tools that rely on htslib or htsjdk it may be possible that there is a bug somewhere along the lines of tools that generate that index for you. We suggest you to check the integrity of the bam file you generate as well as index with samtools or GATK directly.

gokalpcelik avatar Mar 16 '25 07:03 gokalpcelik

The BAM here is initially coming from our vg tool, but vg doesn't generate either .bai or .csi indexes for BAM files. Nor does it sort BAM files itself. So there should be a sorting and indexing tool involved between vg and GATK.

There are a number of things that might be wrong with the BAM file:

  1. Not actually sorted?
  2. Not indexed properly?
  3. Actually has alignments outside the sequence bounds in the @SQ headers? (This would be a bug we could presumably fix in vg, if we could show it was happening for a particular named read.)
  4. Has @SQ headers that describe contigs that are shorter than the corresponding contigs in the FASTA? (We know of circumstances under which vg is expected to do this. It doesn't always have access to the original reference FASTA's sequence lengths when making BAMs, so it can only report the length up to the latest interval on the contig that it knows exists. Is this known to not be allowed by HTSJDK?)
  5. Is missing @SQ headers for contigs that are in the FASTQ?

As the vg author, I need GATK to produce a validation error here (or I need a validation command line I can recommend people try) so I can find out what has gone wrong for the user(s) and fix it upstream in vg, or write documentation explaining a recommended downstream sorting/indexing/reheadering pipeline.

I've gotten an email from another user having a similar out-of-range problem with IGV, on a BAM generated for GRCh38 from a graph that only partially includes it. So I would suspect point number 4, about the shorter contig lengths in the BAM header, might be the problem, and GATK doesn't produce a validation error pointing to that and just does an out-of-bounds access.

adamnovak avatar Mar 19 '25 16:03 adamnovak

Hi @adamnovak GATK and Picard have ValidateSamFile tool which can generate any validations based on reference, reads, etc.

GATK walkers tend to keep most debugging messages silent by default so verbosity can be set to DEBUG to generate even more verbose logs. Additionally the below parameter is available to throw more detailed stack traces for user errors --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'

It is possible that there may be reads extending or mapping (without clipping) outside of the contig in query therefore it ends up being an invalid BAM.

Regards.

gokalpcelik avatar Mar 19 '25 17:03 gokalpcelik