vg output giraffe alignment when --hard-hit-cap exceeded for all minimizers

1. What were you trying to do?

Compare reads aligned with bwa and giraffe, where the bwa reference sequence is also a path found in the graph.

2. What did you want to happen?

Reads with a full-length (no gaps or mismatches) bwa alignment should also have a corresponding giraffe alignment.

3. What actually happened?

giraffe did not generate an alignment for some reads with "perfect" bwa alignments. Using giraffe --show-work I determined that this occurred when all read minimizers had exceeded the default --hard-hit-cap of 500, which caused all minimizers to be rejected and no seeds established. When the --hard-hit-cap was increased to a value above the repetitive minimizer hit counts, many seeds were selected and the read was aligned successfully. The alignment had mapping quality = 0, which seemed appropriate and confirmed that the read had aligned. (indicating the read was not a contaminant or missing from the assembly/graph) The giraffe alignment was then consistent with the bwa alignment.

Of course, the need to evaluate all seeds came at a significant performance penalty. It would be ideal for giraffe to output MQ = 0 alignments for reads described above without the need to increase the --hard-hit-cap. Perhaps an option could be added or default behavior defined to randomly select an (MQ = 0) alignment when all minimizers exceed the --hard-hit-cap and seeds can't be established.

6. What does running vg version say?

vg version v1.37.0 "Monchio"
Compiled with g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 on Linux
Linked against libstd++ 20200808
Built by anovak@octagon

Feb 17 '22 14:02 brianabernathy

I'm not sure how feasible this is.

BWA uses an FM-index to produce variable-length seeds. If there are too many hits (and hence probably too many equivalent alignments), it's quite likely that you get a reasonable alignment by extending an arbitrary occurrence of the longest seed.

Giraffe gets fixed-length seeds from a minimizer index. Because all seeds are of the same length, we can't tell which seeds are more likely to extend into good alignments without taking them all and clustering them (which is often the expensive part). Random sampling won't help much, because we would be sampling a few seeds from many clusters rather than multiple seeds from some clusters.

I guess it depends on whether a randomly chosen seed is likely to extend into a near-optimal alignment in a situation like this, or whether most seeds are from partial matches.

Feb 18 '22 05:02 jltsiren

It seems important to differentiate between reads lacking an alignment due to no minimizers and too many minimizers. If sub-sampling to reduce the search space is not feasible, it may be useful to include a flag in the alignment explicitly indicating excessive mimimizers. (I was initially confused by the "null" alignment records containing only read name, read length and quality along with '*' for the remaining fields.) It could also be useful to have a separate count of these reads in the vg stats output. I was drawn to this issue by discrepancies between the bwa and giraffe alignment stats.

Feb 18 '22 14:02 brianabernathy

I suppose we could skip all the clustering, pull one arbitrary (not necessarily random or evenly distributed!) minimizer, and produce an optimal local alignment based on just that at MAPQ 0.

The problem is, even MAPQ 0 comes with the expectation that the produced alignment is sampled uniformly from the set of globally optimal alignments. But with so many minimizer hits, we can't actually compute that in reasonable time.

@brianabernathy Would your application be well-served if, for example, these reads all ended up mapped with MAPQ 0 at wherever the first shared k-mer is between them and chr1?

Tagging reads that bump up against the hit caps wouldn't be too hard. We could do it in GAM and define SAM/GAF tags to represent the same idea.

Mar 14 '22 15:03 adamnovak

@adamnovak I'm primarily interested in reads that do not contain an alignment in the graph regardless of computational complexity. I'm not particularly interested in the MAPQ 0 alignments themselves. Any method that separates true unalignable reads from reads without an alignment due to excessive minimizers (which currently have identical gam/gaf records) and accurately reflect these counts via vg stats would be ideal.

Tagging repetitive read gam/gaf records seems like a good choice whether repetitive reads maintain all '*' alignment records or use the first shared kmer approach you described. I suppose the first shared kmer approach works seamlessly with vg stats as-is and repetitive reads would be added to the 'Total aligned' count. It may be useful to add a separate vg stats output category for repetitive/computationally-complex read counts though, which could make the first shared kmer positions optional.

Mar 15 '22 15:03 brianabernathy

I think tagging should certainly be added to differentiate contaminants from repeats (as discussed above), but I would be hesitant to create non-random repeat alignments. From an "interpretation of 0 coverage" point-of-view, perhaps a flag to output all graph nodes that spawn max_cap issues for giraffe would be the most useful thing for me. This would at least allow me to reason about 0 coverage nodes more effectively. A partial solution might be to use repeat masker GFF for each reference genome (or all MSA genomes) and vg annotate, but this clearly isn't going to coincide with giraffe behavior very well.

Mar 16 '22 21:03 jnvaughn