vg icon indicating copy to clipboard operation
vg copied to clipboard

forwardize_breakpoints error: failure,position 135211-3 is not inside node 135211

Open JD12138 opened this issue 3 years ago • 12 comments

Hello, I used the HPRC pangenome (https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph/hprc-v1.0-minigraph-grch38.gfa.gz) to do the analysis. My processes to do the analysis are listed below.

  1. build an index (vg autoindex -g hprc-v1.0-minigraph-grch38.gfa -T ./tmp )
  2. map my assembled contigs to the graph. (vg map -F my.contigs.fa -x hprc-v1.0-minigraph-grch38.xg -g hprc-v1.0-minigraph-grch38.csa >aln.gam)
  3. convert .gfa to .vg (vg convert -g hprc-v1.0-minigraph-grch38.gfa -v >hprc-v1.0-minigraph-grch38.vg)
  4. augment the .vg graph with my .gam file. (vg augment hprc-v1.0-minigraph-grch38.vg -s aln.gam >aug.vg)

Then I got the error message: forwardize_breakpoints error: failure, position 135211-3 is not inside node 135211 vg: src/augment.cpp:514: std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > > vg::forwardize_breakpoints(const HandleGraph, const std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > >&): Assertion `false' failed. ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug. Stack trace path: /tmp/vg_crash_NTCL3r/stacktrace.txt Please include the stack trace file in your bug report!*

Then I validate the gam file (vg validate hprc-v1.0-minigraph-grch38.vg -a aln.gam). The output file report so many invalid alignment. But I use the same file to do the mapping. Could you tell me the reason? Thank you so much!

JD12138 avatar Jan 06 '23 08:01 JD12138

I think I know what's going on, and it's a common point of confusion that we should really do something about. When sequences are significantly longer than short reads, vg map breaks the sequences into chunks and maps them independently, which can lead to discontiguous alignments. Many of the tools in the vg toolkit require contiguous alignments. For moderately long sequences, vg mpmap -n dna works pretty well, and it only produces contiguous alignments. You can also use GraphAligner for very long sequences.

jeizenga avatar Jan 06 '23 20:01 jeizenga

I have tried before to align with GraphAligner to get the gam file. Then using vg augment to embed the gam into my graph. But the same error was reported. Then I use vg validate. There are 89 invalid alignments. But I use the same vg file in GraphAligner and vg augment. I reported this issue at https://github.com/vgteam/vg/issues/3717 . Someone else has the same question as me.

JD12138 avatar Jan 07 '23 09:01 JD12138

I met the same questions when using vg augment , and I got the error message: forwardize_breakpoints error: failure, position 72522423+152 is not inside node 72522423 forwardize_breakpoints error: failure, position 66398702+10 is not inside node 66398702 vg: src/augment.cpp:514: std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > > vg::forwardize_breakpoints(const HandleGraph*, const std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > >&): Assertion false' failed. vg: src/augment.cpp:514: std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > > vg::forwardize_breakpoints(const HandleGraph*, const std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > >&): Assertion false' failed. ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug. Stack trace path: /tmp/vg_crash_m9CpNU/stacktrace.txt Please include the stack trace file in your bug report!

The gam file was generated using GraphAligner as well. Have you solved the problem? Any suggestions will be appreciated. Thank You.

Zoeyoungxy avatar Mar 28 '23 09:03 Zoeyoungxy

Hello. I am facing the same error with slightly different steps (vg giraffe instead of map). Should i post my errors and steps here or open a separate issue?

edit: i have done vg convert -p of both gfa and gbz files into pg. while using vg validate, only the pg converted from gbz shows valid. i'm confused if i need to rerun the pipeline or continue with gbz

smriti-135 avatar Aug 29 '24 09:08 smriti-135

@smriti-135 Many vg tools require that nodes are no longer than 1024 bp, for various practical reasons. Most published pangenome graphs have longer nodes, at least in the GFA file. If you use vg autoindex with a GFA graph or convert the GFA to GBZ with vg gbwt, the long nodes will be chopped to a sequence of shorter nodes (32 bp for vg autoindex, 1024 bp for GBZ). But if you convert the GFA to a non-GBZ graph, the long nodes will remain.

This means that you can get three different graphs from the same GFA graph, depending on the commands you use. These graphs are not directly compatible with each other, and they cannot be mixed in the same pipeline. You should use the graph you get from vg autoindex (or vg gbwt) as the basis for further work instead of using the original GFA.

jltsiren avatar Aug 29 '24 19:08 jltsiren

@jltsiren i had carried forward with vg autoindex as follows:

vg autoindex --workflow giraffe --prefix pangenome_graph --ref-fasta teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa --vcf maf135.vcf --target-mem 12GB --threads 8
vg giraffe -Z pangenome_graph.giraffe.gbz -m pangenome_graph.min -d pangenome_graph.dist -f ANP1_R1_paired_trimmed.fq.gz -f ANP1_R2_paired_trimmed.fq.gz > ANP1.gam -p

vg convert -p pangenome_graph.giraffe.gbz -t 8 > pangenome_graph_gbz_vc.pg
vg augment pangenome_graph_gbz_vc.pg ANP1.gam -A ANP1_vc.aug.gam -m 2 -q 10 -v -t 16 > ANP1_graph_gbz_vc.aug.pg -p
vg snarls ANP1_graph_gbz_vc.aug.pg > ANP1_snarls_gbz_vc.snarls
vg pack -x ANP1_graph_gbz_vc.aug.pg -g ANP1_vc.aug.gam -Q 5 -t 16 -o ANP1_genotype_gbz_vc.aug.pack

vg call ANP1_graph_gbz_vc.aug.pg -r ANP1_snarls_gbz_vc.snarls -k ANP1_genotype_gbz_vc.aug.pack -t 8 > ANP1_variants_gbz_vc_4.aug.vcf

The resulting vcf (723 MB) does not seem to contain SVs. I'm adding a snippet of the file

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
Pseudomolecule_01	6203	>5777134>23818067	A	G	49.499	lowdepth	AT=>5777134>23818066>23818067,>5777134>30505148>23818067;DP=2	GT:DP:AD:GL:GQ:GP:XD:MAD	1/1:2:0,2:-5.447250,-1.451595,-1.150565:3:-1.504111:0.795580:2
Pseudomolecule_01	6246	>5777146>5777151	T	C	49.499	lowdepth	AT=>5777146>5777148>5777150>5777151,>5777146>5777148>5777149>5777151;DP=2	GT:DP:AD:GL:GQ:GP:XD:MAD	1/1:2:0,2:-5.170265,-1.174610,-0.873579:3:-2.313027:1.924528:2
Pseudomolecule_01	6268	>5777154>25635207	C	T	49.499	lowdepth	AT=>5777154>25635206>25635207,>5777154>30505147>25635207;DP=2	GT:DP:AD:GL:GQ:GP:XD:MAD	1/1:2:0,2:-5.185855,-1.190200,-0.889169:3:-1.504111:1.637931:2
Pseudomolecule_01	6284	>5777160>25130643	CAT	TCG	49.499	lowdepth	AT=>5777160>25130642>25130643,>5777160>30505146>25130643;DP=2	GT:DP:AD:GL:GQ:GP:XD:MAD	1/1:2:0,2:-5.185855,-1.190200,-0.889169:3:-1.504111:1.637931:2
Pseudomolecule_01	6914	>5777219>26368344	T	C	49.499	lowdepth	AT=>5777219>26368343>26368344,>5777219>31277985>26368344;DP=2	GT:DP:AD:GL:GQ:GP:XD:MAD	1/1:2:0,2:-5.169738,-1.174083,-0.873052:3:-1.504111:1.968750:2
Pseudomolecule_01	8212	>5777294>5777297	A	G	26.4111	lowad	AT=>5777294>5777296>5777297,>5777294>5777295>5777297;DP=1	GT:DP:AD:GL:GQ:GP:XD:MAD	1/0:1:0,1:-2.736928,-0.739100,-1.080097:3:-1.481213:0.916667:0
Pseudomolecule_01	9110	>5777355>26518925	A	T	9.54243	lowad	AT=>5777355>26518924>26518925,>5777355>30556276>26518925;DP=0	GT:DP:AD:GL:GQ:GP:XD:MAD	1/1:0:0,0:-0.353163,-0.353163,-0.353163:0:-2.197225:0.813187:0
Pseudomolecule_01	9120	>5777356>24850492	T	C	9.54243	lowad	AT=>5777356>24850491>24850492,>5777356>30556277>24850492;DP=0	GT:DP:AD:GL:GQ:GP:XD:MAD	1/1:0:0,0:-0.477189,-0.477189,-0.477189:0:-2.197225:1.098765:0
Pseudomolecule_01	9134	>24850492>5777359	G	T	49.499	lowdepth	AT=>24850492>5777358>5777359,>24850492>29461530>5777359;DP=2	GT:DP:AD:GL:GQ:GP:XD:MAD	1/1:2:0,2:-5.298480,-1.302825,-1.001794:3:-1.504111:1.098765:2
Pseudomolecule_01	12073	>5777458>5777459	A	AG	12.5346	PASS	AT=>5777458>5777459,>5777458>30452372>5777459;DP=7	GT:DP:AD:GL:GQ:GP:XD:MAD	0/1:7:6,1:-2.989001,-2.292207,-13.899537:6:-1.281771:4.817708:1
Pseudomolecule_01	12621	>5777480>23306363	G	A	33.6934	PASS	AT=>5777480>23306362>23306363,>5777480>30235750>23306363;DP=7	GT:DP:AD:GL:GQ:GP:XD:MAD	0/1:7:5,2:-4.531480,-1.639815,-11.327922:28:-1.099895:7.661539:2
Pseudomolecule_01	12687	>5777482>26018630	A	T	7.17049	PASS	AT=>5777482>26018629>26018630,>5777482>28697073>26018630;DP=10	GT:DP:AD:GL:GQ:GP:XD:MAD	0/1:10:9,1:-2.814530,-2.946761,-20.198554:1:-1.616188:10.761194:1
Pseudomolecule_01	12690	>26018630>26018632	A	G	28.0416	PASS	AT=>26018630>26018631>26018632,>26018630>30235751>26018632;DP=9	GT:DP:AD:GL:GQ:GP:XD:MAD	0/1:9:7,2:-4.444100,-2.119110,-15.802872:23:-1.103333:10.761194:2
Pseudomolecule_01	12781	>5777490>27070101	TAAAAAAAA	AAAAAAAAA,T	209.761	PASS	AT=>5777490>27070099>27070100>27070101,>5777490>29068302>27070100>27070101,>5777490>27070099>27070101;DP=13	GT:DP:AD:GL:GQ:GP:XD:MAD	1/2:13:2,7,4:-25.097774,-8.895493,-14.888975,-10.349251,-4.899838,-15.075861:39:-1.791864:8.634561:4

Am I missing something? I also tried with the '-f' and '-c' options but only the file size decreased due to the filters.

smriti-135 avatar Sep 02 '24 04:09 smriti-135

@smriti-135 Did you have SVs in the original VCF? And did it contain phased haplotypes?

jltsiren avatar Sep 03 '24 23:09 jltsiren

@jltsiren saw that was the problem. It had only SNPs. But when I try to merge it with the one with SVs using bcftools merge ANP1.vcf.gz ANP1.delly.vcf.gz -o ANP1_merged.vcf.gz the resultant merged vcf does not recognise it as the same sample and instead adds it as a new column #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SO_11111_ANP1_SM 2:SO_11111_ANP1_SM any solutions for that?

Edit. I used a tool called delly to call the SVs and form a separate file ANP1.delly.vcf. When I repeat the above workflow with that ANP1.delly.vcf instead of maf135.vcf I am still getting an "empty" file with no SVs in spite of knowing it contains it!

smriti-135 avatar Sep 05 '24 06:09 smriti-135

I haven't had time to answer this, as there are so many things that can go wrong. My best guess is some kind of naming conflict between the VCF file(s) and the reference.

When you run the commands, do you get any warnings?

Does the SV-only graph actually contain any variants? What is the output of vg stats -z graph.gbz?

I'm not very familiar with vg call, and it's possible that there are some issues in the commands you run.

jltsiren avatar Oct 01 '24 02:10 jltsiren

When you run the commands, do you get any warnings?

No there were no warnings except a lot of this

warning:[vg::Constructor] Lowercase characters found in Scaffold_Un904; coercing to uppercase.

But when I tried again now, I got this now

error:[vg::Constructor] Variant/reference sequence mismatch: A vs pos: 20643581: T; do your VCF and FASTA coordinates match? Variant: Pseudomolecule_01 20643581 Pseudomolecule_01_20643581 A G 0 . PR zero ind: 20643580 1-indexed: 20643581

and there is no gbz output.

smriti-135 avatar Oct 01 '24 06:10 smriti-135

If you are getting different results in the graph construction part, it's likely that you either have corrupted files or other hardware errors or you were running different commands.

jltsiren avatar Oct 02 '24 05:10 jltsiren

was able to run vg stats pangenome_graph_ai.giraffe.gbz but its not giving any output

smriti-135 avatar Oct 04 '24 08:10 smriti-135