vg deconstruct error
Hello,
when I use "vg deconstruct" to get the genotype variants of the graph created by pggb, vg got crushed. could you please tell me where I was wrong? Thank you.
Here is the run command:
Here is the error message:
Can you please try with the current vg release? If it still crashes, please share the input data if you can. Thanks.
Hello, I tried downloading the latest version of vg, but my system is CentOS 7, and it does not have "apt-get". Could you please guide me on how to install the latest version of vg on CentOS 7?
We have not been able to use VG deconstruct for many versions now. It doesn't parse the pggb pan-sn naming correctly. There has been a regression.
The version installed by the pggb Dockerfile will work correctly.
On Sat, Jul 29, 2023, 13:56 hcsun1 @.***> wrote:
Hello, I tried downloading the latest version of vg, but my system is CentOS 7, and it does not have "apt-get". Could you please guide me on how to install the latest version of vg on CentOS 7?
— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/4034#issuecomment-1656713825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEL7HIN2CZ66ILGWZ7LXST263ANCNFSM6AAAAAA23DW5EE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
In addition to pggb, we also used minigraph to get the GFA file and do other steps by VG that isn't the latest version, and also got the VCF file without genotype variants. Is this mean that VG also can't work with minigraph? Or if the problem comes from VG, how could we get the latest version on CentOS?
@hcsun1 To install vg, click the big green download button here : https://github.com/vgteam/vg/releases/tag/v1.49.0
Also: vg deconstruct does not work with minigraph, try gfatools bubble instead.
@ekg Can you please share a PGGB gfa and deconstruct command line so I can reproduce and fix this?
@glennhickey https://github.com/pangenome/pggb/blob/master/pggb#L620-L621
echo "[vg::deconstruct] making VCF with reference=$ref and delim=$delim"
( TEMPDIR=$(pwd) $timer -f "$fmt" vg deconstruct -P "$ref" \
-H "$delim" -e -a -t $threads "$prefix_smoothed_output".final.gfa >"$vcf" ) 2> >(tee -a "$log_file")
We're stuck on v1.40.0: https://github.com/pangenome/pggb/blob/master/Dockerfile#L97
wget https://github.com/vgteam/vg/releases/download/v1.40.0/vg ...
@ekg Could you send a link to a GFA too so I can reproduce?
@glennhickey sure http://hypervolu.me/~erik/yeast/cerevisiae.fa.gz.d1a145e.417fcdf.7493449.smooth.final.gfa.gz
So I try with v.49.0. There's a warning about -H (vg now uses only pan-sn so # is default), and the results look reasonable
rm -f vg; wget -q https://github.com/vgteam/vg/releases/download/v1.49.0/vg
chmod +x vg
wget -q http://hypervolu.me/~erik/yeast/cerevisiae.fa.gz.d1a145e.417fcdf.7493449.smooth.final.gfa.gz
gzip -d cerevisiae.fa.gz.d1a145e.417fcdf.7493449.smooth.final.gfa.gz
./vg deconstruct cerevisiae.fa.gz.d1a145e.417fcdf.7493449.smooth.final.gfa -H "#" -P S288C -e -a -t 8 > decon.vcf
bcftools view -H decon.vcf | head
chrI 7 >8472>8475 C CA,CACACCCACA 60 . AC=1,1;AF=0.5,0.5;AN=2;AT=>8472>8475,>8472>8473>8475,>8472>8473>8474>8475;NS=2;LV=1;PS=>98085>92806 GT . 1 . . 2 .
chrI 20 >8475>8478 ACA A,AC 60 . AC=1,1;AF=0.5,0.5;AN=2;AT=>8475>8476>8477>8478,>8475>8478,>8475>8476>8478;NS=2;LV=1;PS=>98085>92806 GT . 1 . . 2 .
chrI 29 >8479>8481 C CCA 60 . AC=2;AF=0.666667;AN=3;AT=>8479>8481,>8479>8480>8481;NS=3;LV=1;PS=>98085>92806 GT . 1 . . 1 0
chrI 35 >8481>8483 AC A 60 . AC=1;AF=0.333333;AN=3;AT=>8481>8482>8483,>8481>8483;NS=3;LV=1;PS=>98085>92806 GT . 0 . . 0 1
chrI 42 >8483>8485 CA C 60 . AC=2;AF=0.666667;AN=3;AT=>8483>8484>8485,>8483>8485;NS=3;LV=1;PS=>98085>92806 GT . 0 . . 1 1
chrI 44 >8485>8487 C CA 60 . AC=2;AF=0.666667;AN=3;AT=>8485>8487,>8485>8486>8487;NS=3;LV=1;PS=>98085>92806 GT . 1 . . 1 0
chrI 47 >8487>8489 CA C 60 . AC=1;AF=0.333333;AN=3;AT=>8487>8488>8489,>8487>8489;NS=3;LV=1;PS=>98085>92806 GT . 1 . . 0 0
chrI 50 >8489>8492 C A 60 . AC=1;AF=0.333333;AN=3;AT=>8489>8490>8492,>8489>8491>8492;NS=3;LV=1;PS=>98085>92806 GT . 1 . . 0 0
chrI 55 >8492>8494 C CCA 60 . AC=1;AF=0.333333;AN=3;AT=>8492>8494,>8492>8493>8494;NS=3;LV=1;PS=>98085>92806 GT . 0 . . 1 0
chrI 61 >8494>8496 C CA 60 . AC=2;AF=0.666667;AN=3;AT=>8494>8496,>8494>8495>8496;NS=3;LV=1;PS=>98085>92806 GT . 0 . . 1 1
Then again with v1.40.0
rm -f vg; wget -q https://github.com/vgteam/vg/releases/download/v1.40.0/vg
chmod +x vg
./vg deconstruct cerevisiae.fa.gz.d1a145e.417fcdf.7493449.smooth.final.gfa -H "#" -P S288C -e -a -t 8 > decon40.vcf
bcftools view -H decon40.vcf | head
The output seems simlar
S288C#1#chrI 7 >8472>8475 C CACACCCACA,CA 60 . AC=1,1;AF=0.5,0.5;AN=2;AT=>8472>8475,>8472>8473>8474>8475,>8472>8473>8475;NS=2;LV=1;PS=>98085>92806 GT . 2 . . 1 .
S288C#1#chrI 20 >8475>8478 ACA AC,A 60 . AC=1,1;AF=0.5,0.5;AN=2;AT=>8475>8476>8477>8478,>8475>8476>8478,>8475>8478;NS=2;LV=1;PS=>98085>92806 GT . 2 . . 1 .
S288C#1#chrI 29 >8479>8481 C CCA 60 . AC=2;AF=0.666667;AN=3;AT=>8479>8481,>8479>8480>8481;NS=3;LV=1;PS=>98085>92806 GT . 1 . . 1 0
S288C#1#chrI 35 >8481>8483 AC A 60 . AC=1;AF=0.333333;AN=3;AT=>8481>8482>8483,>8481>8483;NS=3;LV=1;PS=>98085>92806 GT . 0 . . 0 1
S288C#1#chrI 42 >8483>8485 CA C 60 . AC=2;AF=0.666667;AN=3;AT=>8483>8484>8485,>8483>8485;NS=3;LV=1;PS=>98085>92806 GT . 0 . . 1 1
S288C#1#chrI 44 >8485>8487 C CA 60 . AC=2;AF=0.666667;AN=3;AT=>8485>8487,>8485>8486>8487;NS=3;LV=1;PS=>98085>92806 GT . 1 . . 1 0
S288C#1#chrI 47 >8487>8489 CA C 60 . AC=1;AF=0.333333;AN=3;AT=>8487>8488>8489,>8487>8489;NS=3;LV=1;PS=>98085>92806 GT . 1 . . 0 0
S288C#1#chrI 50 >8489>8492 C A 60 . AC=1;AF=0.333333;AN=3;AT=>8489>8490>8492,>8489>8491>8492;NS=3;LV=1;PS=>98085>92806 GT . 1 . . 0 0
S288C#1#chrI 55 >8492>8494 C CCA 60 . AC=1;AF=0.333333;AN=3;AT=>8492>8494,>8492>8493>8494;NS=3;LV=1;PS=>98085>92806 GT . 0 . . 1 0
S288C#1#chrI 61 >8494>8496 C CA 60 . AC=2;AF=0.666667;AN=3;AT=>8494>8496,>8494>8495>8496;NS=3;LV=1;PS=>98085>92806 GT . 0 . . 1 1
Though the contig names are indeed different, as the sample/haplotype is stripped in the new version. Do you think this is causing downstream issues for PGGB leading to the incompatibility? I think the rationale for this change was that everyone seemed to be running a similar filter to use the VCF anyway, so it seemed simpler. Though I guess it could cause problems calling on multiple samples at once (but that does not seem to be the case here).
The other difference I see is that the new version has three extra calls:
chrI 127337 >105110>105113 T TCTT 60 . AC=0;AF=0;AN=5;AT=<105113<105110,<105113<105112<105111<105110;CONFLICT=YPS128;NS=5;LV=1;PS=>105109>105115 GT 0 0 0 0 0 .
chrIV 765615 >153496>153499 G A 60 . AC=0;AF=0;AN=2;AT=>153496>153498>153499,>153496>153497>153499;CONFLICT=DBVPG6765;NS=2;LV=1;PS=>153491>153715 GT . . . . 0 0
chrIV 766132 >153644>153649 AC TC 60 . AC=0;AF=0;AN=2;AT=>153644>153646>153647>153649,>153644>153645>153647>153649;CONFLICT=DBVPG6044,SK1;NS=2;LV=1;PS=>153491>153715 GT . . . 0 0 .
which seem to be sites for whom genotypes couldn't be resolved that were filtered out before but are now included.
Hi @glennhickey, new versions of vg deconstruct creates issues with PGGB. See https://github.com/vgteam/vg/issues/3807.
The problem is that, to be picky, the reference is called S288C#1#chrI in the FASTA/GFA, not chrI.
@AndreaGuarracino Thanks. I will point out that the header name / contig name mismatch in that issue is resolved (the header in 1.49.0 is ##contig=<ID=chrI,length=219929> etc).
So as far as I can tell, the only remaining issue is that the sample name is being stripped. I understand that this means your VCF contigs are different than the FASTA input contigs. But... it's also pretty standard to only put contig names in VCF's CONTIG field (outside of older versions of deconstruct, I've never seen a VCF with sample names embedded in the contig field anywhere else).
Anyway, I'll revert to the old logic of putting in the full name so PGGB can keep using deconstruct, and maybe add a flag to write the contig-only.