vg icon indicating copy to clipboard operation
vg copied to clipboard

Polaris SVs as a graph

Open goranrakocevic opened this issue 1 year ago • 10 comments

Hi All,

First off apologies for posting this here, but the Biostars link appears dead.

Has anyone tried using Ilummina’s Polaris SV set as a graph? We are able to build and index the graph, but the mapping is very slow; for a smaller subset of the FASTQ (~first 5M reads) it seems to go fairly fast (~3K reads/s), but at some point it just bogs downs, and there is very little progress for hours looking at the size of the output.

Looking at the VCF it doesn’t seem like there are any very dense regions, and the total number of variants is ~40K (after taking out anything that’s not PASS in the Filter column), but I’m unsure of what to look for as difficult for VG to tackle, so any pointers would be helpful.

The VCF is not phased, but the variants should mostly be far enough apart for that to matter too much.

goranrakocevic avatar Jul 18 '22 22:07 goranrakocevic

For starters, which of the mapping subcommands are you using? Also, are you mapping with a single thread or with multiple threads?

jeizenga avatar Jul 18 '22 23:07 jeizenga

Forgot to say:

  • For making the graph we just used autoindex with the fasta + vcf
  • For mapping we are using giraffe with -b fast (or without, same behaviour), 64 threads (tried with 16 as well)

hg38-hgsvc from the SV paper works fine (runtime similar to what's stated in the paper, a few hours, didn't excatly time it)

goranrakocevic avatar Jul 19 '22 07:07 goranrakocevic

A small update, if it helps: if I reduce --hard-hit-cap to 100 the problem seems to go away (i.e. mapping is very fast). I tried a few other parameters, this is the only one it seems sensitive to.

goranrakocevic avatar Jul 19 '22 11:07 goranrakocevic

I suspect that there's probably a problem on a single read causing the slowdown. It's very possible that changing the computational parameters might avoid hitting whatever extremely slow/infinite loop is responsible. If I'm right, we'd expect to see that the thread utilization drops down to just a few threads toward the end.

If it is a single read, it would definitely help us diagnose the problem if we had access to the input data. Are you able to share it?

jeizenga avatar Jul 19 '22 19:07 jeizenga

Yes, the fastqs are public (1KG data), this sample: https://www.ebi.ac.uk/ena/browser/view/ERR3239276?show=reads

It doesn't seem to be a single read, because writes to the output gam slow down (didn't check the code, but I assume the output thread would still be writing alignments from the other threads).

Another thought: should I be masking out any sections of the reference? From the description of hard-hit-cap sounds like it's something related to repeats.

goranrakocevic avatar Jul 19 '22 21:07 goranrakocevic

Okay, that's interesting. I'll shop it around the VG team.

And, no, masking shouldn't be necessary. You are correct that the hard hit cap is intended to handle repeats. Essentially, it functions to avoid expending effort aligning to sequences that are so repetitive that we don't really have any chance of mapping correctly anyway with a short read.

jeizenga avatar Jul 19 '22 23:07 jeizenga

It sounds like there's a few reads that are slowing it down. When the file is big enough, each thread eventually encounters one of these nearly-impossible reads and gets stuck.

We could hook up the watchdog system to Giraffe to detect and report when threads seem to be taking an inordinately long time to map particular reads, and let us identify the reads.

Another approach is, if any of these reads do eventually get mapped, to look up the reads that are taking the longest to map in the partial output file. To do this, we would map to GAM, and then run it through something like:

vg view -aj output.gam | jq -r '[.time_used, .name] | @tsv' | sort -n | tail -n50

That would give us the mapping times in seconds and the names of the reads that are taking the longest to map, but do eventually finish. Then we could investigate if anything is unusual about the sequences or eventual mapping locations of those reads.

adamnovak avatar Jul 20 '22 14:07 adamnovak

They do map, just take a long time. Is the time saved by default? (Ie we can do it on any gam)

goranrakocevic avatar Jul 21 '22 08:07 goranrakocevic

I looked at the code, and the time_used field is set by Giraffe whether or not you --track-provenance or --track-correctness. So it should be in any Giraffe GAM file.

adamnovak avatar Jul 21 '22 16:07 adamnovak

Hi All,

Sorry for the long silence.

So far it seems we don't see very few reads that are very slow to map, but a larger number of reads somewhat slow to map (that said this is from using a subset from the begging of the fastq file, but far enough into the file that the slowdown occurs by looking at the writes to the output).

While most reads map in 1e-4 seconds, there is a bunch of these that take .3s (example below).

{"annotation": {"fragment_length": 406, "fragment_length_distribution": "-I 422.645263 -D 97.748476", "mapq_applied_cap": 303.67512390409416, "mapq_explored_cap": 69.147046527023292, "mapq_score_group": 2147483647, "mapq_uncapped": 2147483647, "proper_pair": true, "secondary_scores": [319.98951889597208, 294.98951889597208, 294.98951889597208, 284.98951889597208]}, "fragment_next": {"name": "ERR3239306.2213273"}, "identity": 1.0, "mapping_quality": 60, "name": "ERR3239306.2213273", "path": {"mapping": [{"edit": [{"from_length": 13, "to_length": 13}], "position": {"node_id": "525012", "offset": "19"}, "rank": "1"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "525013"}, "rank": "2"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "525014"}, "rank": "3"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "525015"}, "rank": "4"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "525016"}, "rank": "5"}, {"edit": [{"from_length": 9, "to_length": 9}], "position": {"node_id": "525017"}, "rank": "6"}]}, "quality": "Hh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eCh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4e", "score": 160, "sequence": "TGTGCCTATAGGTCCTCCCTGTGGCAATGACATCTCTCAGCTCAGTAAGGGCCATTTGCAGTAGGAATATGACCCTAACCAGAAGACTCAGTGGATCCTTATCACCTTCATAGAAAGGTACTCACCATCCATGTCAAGAGCCCAGCCAAC", "time_used": 0.28015876499999998} {"annotation": {"fragment_length": 406, "fragment_length_distribution": "-I 422.645263 -D 97.748476", "mapq_applied_cap": 303.67512390409416, "mapq_explored_cap": 82.69051542502379, "mapq_score_group": 2147483647, "mapq_uncapped": 2147483647, "proper_pair": true, "secondary_scores": [319.98951889597208, 294.98951889597208, 294.98951889597208, 284.98951889597208]}, "fragment_prev": {"name": "ERR3239306.2213273"}, "identity": 1.0, "mapping_quality": 60, "name": "ERR3239306.2213273", "path": {"mapping": [{"edit": [{"from_length": 9, "to_length": 9}], "position": {"is_reverse": true, "node_id": "525025", "offset": "23"}, "rank": "1"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"is_reverse": true, "node_id": "525024"}, "rank": "2"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"is_reverse": true, "node_id": "525023"}, "rank": "3"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"is_reverse": true, "node_id": "525022"}, "rank": "4"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"is_reverse": true, "node_id": "525021"}, "rank": "5"}, {"edit": [{"from_length": 13, "to_length": 13}], "position": {"is_reverse": true, "node_id": "525020"}, "rank": "6"}]}, "quality": "Hh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHhQeHh4eHh4eHh4eHh4eHh4eHh4eHh4e", "score": 160, "sequence": "GTCCTTTTTCTTTTCAAACTCTTCCTTATGTTAGCCATGAAATCTAGCTGGGGCTGTGTGGTTTCTGATTCCCCCTGGCTTATTCTTTACTTTTTCCCACTGTTCCAGGCTCAGCAGGGAGCTGCTGGATGAGAAAGGGCCTGAAGTCTT", "time_used": 0.28015862000000002}

goranrakocevic avatar Aug 12 '22 10:08 goranrakocevic

There is a new feature addition (https://github.com/vgteam/vg/pull/3736) that is pending integration into the main codebase - this will speed up the slow alignments.

I'll update this thread when it has been added to a release.

akorhone avatar Nov 07 '22 18:11 akorhone