vg surjected bam file does not open correctly in igv
1. What were you trying to do? I was trying to open a surjected BAM file in IGV to compare with the mapping results of BWA mem.
2. What did you want to happen?
3. What actually happened?
I got the errror Error querying alignments for : AB1012.ab.sorted.bam Error message: Index 4965 out of bounds for length 4695.
4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:
Place stacktrace here.
5. What data and command can the vg dev team use to make the problem happen?
I have a gbz file created from minigraph-cactus pipeline named ab.gbz and ab.d2.gbz. I had mapped paired-end reads using vg giraffe to ab.d2.gbz.
vg giraffe -Z ab.d2.gbz -f ${prefix}SRR11676271_1.fastq.gz -f ${prefix}SRR11676271_2.fastq.gz -o gaf | bgzip > ${outdir}/AB1012.ab.vg_giraffe.gaf.gz
which returned warnings:
Guessing that ab.d2.dist is Giraffe Distance Index
Guessing that ab.d2.min is Short Read Minimizers
Rebuilding minimizer index to include zipcodes
[IndexRegistry]: Constructing minimizer index and associated zipcodes.
use parameters -k 29 -w 11
warning[vg::giraffe]: Refusing to perform too-large rescue alignment of 251 bp against 35384 bp ordered subgraph for read SRR11676271.1988 which would use more than 1572864 cells and might exhaust Dozeu's allocator; suppressing further warnings.
warning[vg::giraffe]: Refusing to perform too-large tail alignment of 115 bp against 20962 bp tree which would use more than 1572864 cells and might exhaust Dozeu's allocator; suppressing further warnings.
Because I wanted to surject on a non-reference path, I first converted my gbz file's reference path using vg gbwt.
vg gbwt -Z --set-tag "reference_samples=ASM1844v1" --gbz-format -g ab.ASB1844v1.gbz ab.gbz
This was what I used to get the paths.
vg paths -x ab.ASB1844v1.gbz -S ASM1844v1 -L | tee ASM1844v1.paths.txt
Which outputted the following:
ASM1844v1#0#main
ASM1844v1#0#main[226855]
ASM1844v1#0#main[764017]
ASM1844v1#0#main[1167975]
ASM1844v1#0#main[1361768]
ASM1844v1#0#main[1603193]
ASM1844v1#0#main[1936106]
ASM1844v1#0#main[2044247]
ASM1844v1#0#main[2220280]
ASM1844v1#0#main[2413865]
ASM1844v1#0#main[2500620]
ASM1844v1#0#main[2544022]
ASM1844v1#0#main[2579178]
ASM1844v1#0#main[2878606]
ASM1844v1#0#main[3060855]
ASM1844v1#0#main[3131276]
ASM1844v1#0#main[3680370]
Which was worrying because it seems to be chunked, and I did not know if I have to do something about it.
For the actual projection, this was what I did:
vg surject -x ab.ASB1844v1.gbz -G AB1012.ab.vg_giraffe.gaf.gz --interleaved -F ASM1844v1.paths.txt -b -N ASM1844v1 -R 'ID:AB1012 LB:AB1012 SM:ASM1844v1 PL:illumina PU:unit1' > AB1012.ab.bam
which returned the warning.
warning:[vg::get_sequence_dictionary] Path ASM1844v1#0#main[226855] from sequence dictionary in ASM1844v1.paths.txt looks like part of a path. Output coordinates will be in ASM1844v1#0#main instead. Suppressing further warnings.
I wanted to visualize the surjection in IGV, so this was what I did.
samtools sort AB1012.ab.bam -o AB1012.ab.sorted.bam
samtools index AB1012.ab.sorted.bam
igv.sh AB1012.ab.sorted.bam
which returned the above error.
6. What does running vg version say?
vg version v1.64.0 "Vibbiana"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Using HTSlib headers 101990, library 1.19.1-29-g3cfe8769
Built by [email protected]
Thank you in advance, sorry if I am not being specific.
So, a weird fix I did was to change the bam file header. I changed the samtools header by using reheader to edit the LN to match the reference genome length. I found it matched the length of the chunked coordinate. This seemed to fix the issue with visualizing in IGV, but I am not sure this is the correct method.
We've been getting several people reporting what I think is this problem: IGV and other htsjdk-based software will crash when the sequence length in the BAM is shorter than the one in its FASTA. But when you surject with vg to a path where only some parts of the path are in the graph (like your ASM1844v1#0#main path), vg doesn't have access to the true length of the full path, so it spits out an LN tag that's the minimum possible length of the source path, given all the pieces it has and their lengths.
You should be able to avoid reheadering by passing in a .dict file from the FASTA that has a full-length >ASM1844v1#0#main record, when you do surjection, instead of a .txt path list.
It seems like surject-to-partial-paths and then view-with-IGV is a common use case, though. We might want to think about whether we can get a GBZ to store the full base path lengths somewhere so you don't have to do this. @jltsiren we could cram this into a GBZ "tag", right?
Hello, thanks for the fast reply, @adamnovak !
You should be able to avoid reheadering by passing in a
.dictfile from the FASTA that has a full-length>ASM1844v1#0#mainrecord, when you do surjection, instead of a.txtpath list.
Thank you for the solution!
It seems like surject-to-partial-paths and then view-with-IGV is a common use case, though. We might want to think about whether we can get a GBZ to store the full base path lengths somewhere so you don't have to do this. @jltsiren we could cram this into a GBZ "tag", right?
My use case was that I wanted to project reads to a non-reference path of my minigraph-cactus pangenome. It would be wonderful if the gbz file stores the full length information of each path.
@adamnovak I don't think GBZ tags are the right solution. We already have >10000 full paths in the latest HPRC graphs, and the number is going to rise in the future.
I think we need a two-level metadata model for haplotypes. Maybe something like this:
M GRCh38 0 chr1 LN:i:1000 RF:i:1
W GRCh38 0 chr1 0 400 <path>
W GRCh38 0 chr1 500 800 <path>
M CHM13 0 chr1 FI:i:1
W CHM13 0 chr1 0 * <path>
W CHM13 0 chr1 1 * <path>
First we have GRCh38#0#chr1, which has a known length and is marked as a reference. And then we have CHM13#0#chr1, which has unknown length and fragment identifiers instead of sequence offsets. The metadata lines would be optional, as there are no mandatory fields beyond those that identify the haplotype.
I think in the HPRC graphs we would end up knowing the lengths/offsets for both the GRCh38 and CHM13 base paths. But the ability to say in the GFA that we were using fragment IDs and not offsets would be useful.
Could we cram this into GFA "optional filed" tags on the W lines, instead of introducing a new line type that would break existing parsers? We might have to repeat metadata/have the possibility for metadata from different W lines to not match, but we could carry the data along in a file that e.g. Bandage could still read.
@jltsiren Another thing to consider tracking at base-path level would be md5sum of the full sequence, for generating the M5 tags that we need to write referenced CRAM.