vg icon indicating copy to clipboard operation
vg copied to clipboard

Pacbio full-length transcripts with low mapping rate

Open zhangyixing3 opened this issue 6 months ago • 11 comments

Dear developers,

Thank you for developing the vg. Recently, I have been trying to analyze Pacbio ISO Long‐read RNA‐seq based on a graph pangenome, but I’ve found that the alignment rate to the graph genome is lower compared to the linear genome. The linear reference is the backbone of the graph. I have 95,679 transcript sequences in total. When aligning to the graph genome, 77,021 were mapped, while 94,976 were successfully mapped to the linear genome. I noticed that during vg alignment, there are multiple warnings such as:

T6:     warning[MinimizerMapper::map_from_chains]: No seeds found for transcript/38!
T9:     warning[MinimizerMapper::map_from_chains]: No seeds found for transcript/144!
T4:     warning[MinimizerMapper::map_from_chains]: No seeds found for transcript/201!
T7:     warning[MinimizerMapper::map_from_chains]: No seeds found for transcript/231!
T5:     warning[MinimizerMapper::map_from_chains]: No seeds found for transcript/188!
T9:     warning[MinimizerMapper::map_from_chains]: No seeds found for transcript/309!
T2:     warning[MinimizerMapper::map_from_chains]: No seeds found for transcript/249!
T6:     warning[MinimizerMapper::map_from_chains]: No seeds found for transcript/348!
T1:     warning[MinimizerMapper::map_from_chains]: No seeds found for transcript/360!

These warnings suggest that no seeds were found for many transcripts.

# for liner genome 
minimap2 -t 30 -ax splice -uf --secondary=no  Srufi.v20210930.chr.fasta   \
movieX.flnc.polished1.hq.fasta.gz | samtools view -b > 1.bam
# graph pangenome
vg1.65   giraffe -t 10 --progress -b hifi --gbz-name pantran.gbz \
    --fastq-in movieX.flnc.polished1.hq.fasta.gz  --mapq-score-scale 0.01 > giraffe1.gam
$ vg stats  -a  giraffe1.gam --threads 4
vg [warning]: System's vm.overcommit_memory setting is 2 (never overcommit). vg does not work well under these conditions; you may appear to run out of memory with plenty of memory left. Attempting to unsafely reconfigure jemalloc to deal better with this situation.
Total alignments: 95679
Total primary: 95679
Total secondary: 0
Total aligned: 77021
Total perfect: 690
Total gapless (softclips allowed): 13335
Total paired: 0
Total properly paired: 0
Alignment score: mean 2002.36, median 1867, stdev 960.864, max 11393 (1 reads)
Mapping quality: mean 55.4283, median 60, stdev 12.8138, max 60 (65008 reads)
Insertions: 8226784 bp in 412010 read events
Deletions: 1174941 bp in 234091 read events
Substitutions: 895661 bp in 866823 read events
Softclips: 14699150 bp in 104029 read events
Total time: 279.773 seconds
Speed: 341.989 reads/second

Is there any way to improve the mapping rate to the graph pangenome?

zhangyixing3 avatar Jun 12 '25 09:06 zhangyixing3

Something you could do to debug is r-run giraffe with --track-provenance and then use scripts/giraffe-facts.py with its --track last any option. This will make a table showing how many reads are lost at each step/filter. For example, if most reads are lost due to the high hit cap filter, then you could try to adjust that cap upwards, or simplify the graph (e.g. haplotype sampling) to reduce the number of hits for each seed.

faithokamoto avatar Jun 23 '25 16:06 faithokamoto

Thanks, I will give a try.

zhangyixing3 avatar Jun 26 '25 06:06 zhangyixing3

I followed the suggestion and ran scripts/giraffe-facts.py, but the result doesn't seem to contain much information. I still don't know why near 20,000 sequences failed to align to the reference genome.

$ python ./giraffe-facts.py ../giraffe1.gam default_parameter   --vg /public2/home/off_huangyumin/software/vg1.65  \
 --threads 1   --stages  --track-last any > 111
vg [warning]: System's vm.overcommit_memory setting is 2 (never overcommit). vg does not work well under these conditions; you may appear to run out of memory with plenty of memory left. Attempting to unsafely reconfigure jemalloc to deal better with this situation.
vg [warning]: System's vm.overcommit_memory setting is 2 (never overcommit). vg does not work well under these conditions; you may appear to run out of memory with plenty of memory left. Attempting to unsafely reconfigure jemalloc to deal better with this situation.
While generally herbivorous, giraffes have been observed eating meat and bone from carcasses.
┌─────────────────────────────────────────────────────┐
│                    Giraffe Facts                    │
├─────────────────────────────────────────────────────┤
│Reads                                           95679│
│Mapping speed                              341.99 RPS│
├────┬───────┬───────┬───────┬─────┬────┬───┬────┬────┤
│Item│ Filter│Passing│Failing│ Lost│Lost│Cut│ P  │ R  │
│    │       │(/Read)│(/Read)│reads│    │   │    │    │
├────┼───────┼───────┼───────┼─────┼────┼───┼────┼────┤
│    │Overall│       │       │0    │0   │0  │    │    │
└────┴───────┴───────┴───────┴─────┴────┴───┴────┴────┘

zhangyixing3 avatar Jun 30 '25 07:06 zhangyixing3

Oh, I think those improvements were in the v1.66 version of giraffe-facts. You might want to pull the most recent script from the vg repository. It's not part of the build process; just copy/replace your version of vg/scripts/giraffe-facts.py.

From running e.g.

vg giraffe --progress --parameter-preset hifi \
    --gbz-name $GRAPH --track-provenance \
    --gam-in $ALN.unmapped.gam > $ALN.remapped.gam 2> $ALN.remapped.log

python scripts/giraffe-facts.py --vg bin/vg --track-last any $ALN.remapped.gam remap

I got a table like:

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                        Giraffe Facts                                                         │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│Reads                                                                                                                     1135│
│Mapping speed                                                                                                         0.00 RPS│
├─────────────────────────────────────────────────────────────────────────────────┬───────┬───────┬─────┬────┬───────┬────┬────┤
│                                      Filter                                     │Passing│Failing│ Lost│Lost│  Cut  │ P  │ R  │
│                                                                                 │(/Read)│(/Read)│reads│    │       │    │    │
├─────────────────────────────────────────────────────────────────────────────────┼───────┼───────┼─────┼────┼───────┼────┼────┤
│                               window-downsampling                               │ 171.92│ 568.44│    0│ N/A│ 645182│ N/A│N/A │
│                                     any-hits                                    │ 171.06│   0.86│    0│ N/A│    973│ N/A│N/A │
│                               hard-hit-cap (13614)                              │  53.28│ 117.78│    0│ N/A│ 133680│ N/A│N/A │
│                        max-min||num-bp-per-min (79, 152)                        │  53.00│   0.28│    0│ N/A│    321│ N/A│N/A │
│                      zipcode-tree-coverage-threshold (0.5)                      │ 230.96│   0.00│    0│ N/A│      0│ N/A│N/A │
│                               max-to-fragment (15)                              │ 180.17│  50.79│    0│ N/A│  57650│ N/A│N/A │
│                        zipcode-tree-score-threshold (100)                       │   4.64│ 175.52│    0│ N/A│ 199219│ N/A│N/A │
│fragment-score-fraction||fragment-max-min-score||fragment-min-score (0, 50000, 2)│15704.52│   0.00│    0│ N/A│      0│ N/A│N/A │
│                              fragment-min-score (2)                             │15704.52│   0.00│    0│ N/A│      0│ N/A│N/A │
│                        fragment-set-score-threshold (70)                        │15605.36│  99.16│    0│ N/A│ 112551│ N/A│N/A │
│                        max-chaining-problems (2147483647)                       │15605.36│   0.00│    0│ N/A│      0│ N/A│N/A │
│            min-chain-score-per-base||max-min-chain-score (0.1, 1100)            │   0.00│   9.16│ 1135│ N/A│  10401│ N/A│N/A │
├─────────────────────────────────────────────────────────────────────────────────┼───────┼───────┼─────┼────┼───────┼────┼────┤
│                                     Overall                                     │       │       │1135 │0   │1159977│    │    │
└─────────────────────────────────────────────────────────────────────────────────┴───────┴───────┴─────┴────┴───────┴────┴────┘

Which indicates that all of these failed alignments were lost at the very last step due to poor chaining scores. (The above example is pulled from my notes when I ran this script to figure out why a set of reads wasn't being mapped - that's why the original GAM is called .unmapped.gam and I'm remapping them with --track-provenance.)

faithokamoto avatar Jun 30 '25 15:06 faithokamoto

Hi, I've identified two main reasons for the low mapping rate:

  1. Some reads have no hits (any-hits filter).
  2. Many reads are filtered by min-chain-score-per-base || max-min-chain-score The second issue can be mitigated by relaxing the parameters, for example by setting --min-chain-score-per-base 0.1 and --max-min-chain-score 1100.

However, I'm unsure how to address the first issue. Do you have any suggestions?

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                        Giraffe Facts                                                         │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│Reads                                                                                                                    95679│
│Mapping speed                                                                                                        14.86 RPS│
├─────────────────────────────────────────────────────────────────────────────────┬───────┬───────┬─────┬────┬───────┬────┬────┤
│                                      Filter                                     │Passing│Failing│ Lost│Lost│  Cut  │ P  │ R  │
│                                                                                 │(/Read)│(/Read)│reads│    │       │    │    │
├─────────────────────────────────────────────────────────────────────────────────┼───────┼───────┼─────┼────┼───────┼────┼────┤
│                               window-downsampling                               │  51.24│  47.48│    0│ N/A│4542787│ N/A│N/A │
│                                     any-hits                                    │  24.94│  26.30│ 8330│ N/A│2516049│ N/A│N/A │
│                               hard-hit-cap (13614)                              │  24.94│   0.00│    0│ N/A│     82│ N/A│N/A │
│                        max-min||num-bp-per-min (79, 152)                        │  24.94│   0.00│    0│ N/A│    140│ N/A│N/A │
│                      zipcode-tree-coverage-threshold (0.5)                      │   2.69│   0.02│    0│ N/A│   2083│ N/A│N/A │
│                               max-to-fragment (15)                              │   2.17│   0.52│    0│ N/A│  49896│ N/A│N/A │
│                        zipcode-tree-score-threshold (100)                       │   1.36│   0.81│    0│ N/A│  77842│ N/A│N/A │
│fragment-score-fraction||fragment-max-min-score||fragment-min-score (0, 50000, 2)│   7.04│   0.00│    0│ N/A│      0│ N/A│N/A │
│                              fragment-min-score (2)                             │   7.04│   0.00│    0│ N/A│      0│ N/A│N/A │
│                        fragment-set-score-threshold (70)                        │   6.81│   0.23│    0│ N/A│  22414│ N/A│N/A │
│                        max-chaining-problems (2147483647)                       │   6.81│   0.00│    0│ N/A│      0│ N/A│N/A │
│            min-chain-score-per-base||max-min-chain-score (0.1, 1100)            │   1.11│   1.57│ 8648│ N/A│ 150568│ N/A│N/A │
│                                max-alignments (3)                               │   1.10│   0.01│    0│ N/A│   1092│ N/A│N/A │
│                                 no-chain-overlap                                │   0.87│   0.22│    0│ N/A│  21438│ N/A│N/A │
│                             max-chains-per-tree (3)                             │   0.87│   0.00│    0│ N/A│      0│ N/A│N/A │
│                           min-unique-node-fraction (0)                          │   0.85│   0.00│    0│ N/A│      0│ N/A│N/A │
│                                max-multimaps (1)                                │   0.80│   0.05│    0│ N/A│   4760│ N/A│N/A │
├─────────────────────────────────────────────────────────────────────────────────┼───────┼───────┼─────┼────┼───────┼────┼────┤
│                                     Overall                                     │       │       │16978│0   │7389151│    │    │
└─────────────────────────────────────────────────────────────────────────────────┴───────┴───────┴─────┴────┴───────┴────┴────┘

zhangyixing3 avatar Jul 02 '25 01:07 zhangyixing3

@adamnovak help, why would reads be failing to get any hits? (I haven't tried to fix this problem before and I'm about to leave on vacation.)

faithokamoto avatar Jul 02 '25 15:07 faithokamoto

I think that means that, for each minimizer selected in the read, the minimizer index of the graph had actual 0 copies of that minimizer indexed in the graph. None of the minimizer sequences selected in the read actually occur in the graph.

If the problem was that the minimizers occurred in the graph, but too frequently for it to be feasible for the program to consider all the occurrences, then it would have passed any-hits and failed at hard-hit-cap.

This is especially odd if your reads are ~2kbp long; for -b hifi you have k=31, w=50, so any 80-base identical sequence between the read and the graph guarantees a seed hit, and you'll have ~25 different 31-mers that would all need to cover mismatches or indels between the read and the graph for you to get actual 0 hits. And although for Illumina reads it's not uncommon for a read that maps to have all perfectly unique minimizers (so that removing the target part of the reference from the graph would leave it with 0 hits), that's less common with longer reads, so it's not obvious that the problem would be a missing chunk of your target reference. (Although it still could be.)

You can try making the minimizer index and zipcodes files yourself with vg minimizers with lower -k and -w values, if you think there's a reason to expect your reads to be very noisy.

adamnovak avatar Jul 14 '25 20:07 adamnovak

A maximum alignment score of 11393 with a median score of 1867 is also very weird. I guess you are not really mapping reads, but rather full transcripts? Or sequences from transcripts with variable sequence length?

Are the unmapped reads noticeably shorter than the mapped reads?

adamnovak avatar Jul 14 '25 21:07 adamnovak

Sorry, the data I aligned are indeed full-length transcripts rather than raw reads. I did observe that the proportion of unmaped tends to increase as the read length decreases from 7000 bp to 2000 bp.

zhangyixing3 avatar Jul 15 '25 07:07 zhangyixing3

You could try mapping individual transcripts with --show-work to see if you can convince it to report some minimizer sequences and their ostensibly-0 counts. But I think it only makes the nice ASCII art diagrams of where the minimizers are in the query when the query is pretty short, and 2 kbp might be too long. And I don't think we've bound out "find minimizers in this sequence" or "query the index for the count of this minimizer" to the command line.

I would pull one of the offending no-hit transcripts and:

  • Look to make sure it doesn't have a weirdly high density of Ns or something
  • Look at the minimap2 alignment and make sure that that alignment doesn't explain the lack of seed hits (i.e. does it have a mismatch every 20 bases there?)
  • Using the minimap2 alignment, get some primary linear reference path intervals that are relevant. Pull them out of the graph with vg find -x graph.gbz --context 3 --path "GRCh38#0#chr1:1000-2000" --path "GRCh38#0#chr1:2500-3000" >subgraph.vg and then visualize the subgraph. For Bandage you can vg view subgraph.vg >subgraph.gfa. For vg you can vg view -dp subgraph.vg | dot -Tsvg -o subgraph.svg and then open the SVG in a browser. You want to see if you can actually find where the query is meant to go in the graph. If the graph just doesn't have this part of the linear reference, or what it does have somehow doesn't match the query sequence for any long stretches, that could explain the problem.

If you can boil it down to a reproducible example where some particular query demonstrably should have seed hits to a particular subgraph of the graph, but doesn't, then we can try and fix that.

Looking at all this would tell you whether there's something wrong with the mapper, or something wrong with the graph, or whether you just need to be using shorter minimizer lengths for this application.

adamnovak avatar Jul 15 '25 14:07 adamnovak

Oh, hey, I actually put "find minimizers in this sequence" and "query index for the count of this minimizer" in something command-line accessible. Specifically, use vg cluster --hit-cap 0 --hits-above 0 --sequences-only. That will output a GAM with a field high_hit_minimizers that stores, for each read, the sequence of each minimizer, the number of times it appears in the read, and the number of times it appears in the graph. (It's called high_hit_minimizers since I originally intended to use it to find minimizers with a high hit count, so it will only output if the hit count is above --hits-above) You can then query this annotation with vg filter --tsv-out. In summary:

vg cluster -x pantran.gbz --gbz-format --dist-name pantran.dist \
    --hit-cap 0 --hits-above 0 --sequences-only reads.gam > clustered.gam
vg filter --tsv-out "name;annotation.high_hit_minimizers" clustered.gam > clustered.gam.tsv

(I think. I haven't actually tried the above. Mostly wanted to point out that this functionality exists.)

faithokamoto avatar Jul 15 '25 15:07 faithokamoto