vg icon indicating copy to clipboard operation
vg copied to clipboard

vg deconstruct error

Open hcsun1 opened this issue 2 years ago • 12 comments

Hello,

when I use "vg deconstruct" to get the genotype variants of the graph created by pggb, vg got crushed. could you please tell me where I was wrong? Thank you. Here is the run command: d4330ae1ca5031b1fedc789378616a3a_

Here is the error message: a9748cf78c062db470d7284b1db7ee49_

hcsun1 avatar Jul 28 '23 07:07 hcsun1

Can you please try with the current vg release? If it still crashes, please share the input data if you can. Thanks.

glennhickey avatar Jul 28 '23 12:07 glennhickey

Hello, I tried downloading the latest version of vg, but my system is CentOS 7, and it does not have "apt-get". Could you please guide me on how to install the latest version of vg on CentOS 7?

hcsun1 avatar Jul 29 '23 11:07 hcsun1

We have not been able to use VG deconstruct for many versions now. It doesn't parse the pggb pan-sn naming correctly. There has been a regression.

The version installed by the pggb Dockerfile will work correctly.

On Sat, Jul 29, 2023, 13:56 hcsun1 @.***> wrote:

Hello, I tried downloading the latest version of vg, but my system is CentOS 7, and it does not have "apt-get". Could you please guide me on how to install the latest version of vg on CentOS 7?

— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/4034#issuecomment-1656713825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEL7HIN2CZ66ILGWZ7LXST263ANCNFSM6AAAAAA23DW5EE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ekg avatar Jul 29 '23 12:07 ekg

In addition to pggb, we also used minigraph to get the GFA file and do other steps by VG that isn't the latest version, and also got the VCF file without genotype variants. Is this mean that VG also can't work with minigraph? Or if the problem comes from VG, how could we get the latest version on CentOS?

hcsun1 avatar Jul 31 '23 07:07 hcsun1

@hcsun1 To install vg, click the big green download button here : https://github.com/vgteam/vg/releases/tag/v1.49.0

Also: vg deconstruct does not work with minigraph, try gfatools bubble instead.

@ekg Can you please share a PGGB gfa and deconstruct command line so I can reproduce and fix this?

glennhickey avatar Jul 31 '23 12:07 glennhickey

@glennhickey https://github.com/pangenome/pggb/blob/master/pggb#L620-L621

echo "[vg::deconstruct] making VCF with reference=$ref and delim=$delim"
( TEMPDIR=$(pwd) $timer -f "$fmt" vg deconstruct -P "$ref" \
       -H "$delim" -e -a -t $threads "$prefix_smoothed_output".final.gfa >"$vcf" ) 2> >(tee -a "$log_file")

ekg avatar Jul 31 '23 13:07 ekg

We're stuck on v1.40.0: https://github.com/pangenome/pggb/blob/master/Dockerfile#L97

wget https://github.com/vgteam/vg/releases/download/v1.40.0/vg ...

ekg avatar Jul 31 '23 13:07 ekg

@ekg Could you send a link to a GFA too so I can reproduce?

glennhickey avatar Jul 31 '23 13:07 glennhickey

@glennhickey sure http://hypervolu.me/~erik/yeast/cerevisiae.fa.gz.d1a145e.417fcdf.7493449.smooth.final.gfa.gz

ekg avatar Jul 31 '23 14:07 ekg

So I try with v.49.0. There's a warning about -H (vg now uses only pan-sn so # is default), and the results look reasonable

rm -f vg; wget -q https://github.com/vgteam/vg/releases/download/v1.49.0/vg
chmod +x vg
wget -q http://hypervolu.me/~erik/yeast/cerevisiae.fa.gz.d1a145e.417fcdf.7493449.smooth.final.gfa.gz
gzip -d cerevisiae.fa.gz.d1a145e.417fcdf.7493449.smooth.final.gfa.gz
./vg deconstruct cerevisiae.fa.gz.d1a145e.417fcdf.7493449.smooth.final.gfa -H "#" -P S288C -e -a -t 8 > decon.vcf
bcftools view -H decon.vcf | head 
chrI	7	>8472>8475	C	CA,CACACCCACA	60	.	AC=1,1;AF=0.5,0.5;AN=2;AT=>8472>8475,>8472>8473>8475,>8472>8473>8474>8475;NS=2;LV=1;PS=>98085>92806	GT	.	1	.	.	2	.
chrI	20	>8475>8478	ACA	A,AC	60	.	AC=1,1;AF=0.5,0.5;AN=2;AT=>8475>8476>8477>8478,>8475>8478,>8475>8476>8478;NS=2;LV=1;PS=>98085>92806	GT	.	1	.	.	2	.
chrI	29	>8479>8481	C	CCA	60	.	AC=2;AF=0.666667;AN=3;AT=>8479>8481,>8479>8480>8481;NS=3;LV=1;PS=>98085>92806	GT	.	1	.	.	1	0
chrI	35	>8481>8483	AC	A	60	.	AC=1;AF=0.333333;AN=3;AT=>8481>8482>8483,>8481>8483;NS=3;LV=1;PS=>98085>92806	GT	.	0	.	.	0	1
chrI	42	>8483>8485	CA	C	60	.	AC=2;AF=0.666667;AN=3;AT=>8483>8484>8485,>8483>8485;NS=3;LV=1;PS=>98085>92806	GT	.	0	.	.	1	1
chrI	44	>8485>8487	C	CA	60	.	AC=2;AF=0.666667;AN=3;AT=>8485>8487,>8485>8486>8487;NS=3;LV=1;PS=>98085>92806	GT	.	1	.	.	1	0
chrI	47	>8487>8489	CA	C	60	.	AC=1;AF=0.333333;AN=3;AT=>8487>8488>8489,>8487>8489;NS=3;LV=1;PS=>98085>92806	GT	.	1	.	.	0	0
chrI	50	>8489>8492	C	A	60	.	AC=1;AF=0.333333;AN=3;AT=>8489>8490>8492,>8489>8491>8492;NS=3;LV=1;PS=>98085>92806	GT	.	1	.	.	0	0
chrI	55	>8492>8494	C	CCA	60	.	AC=1;AF=0.333333;AN=3;AT=>8492>8494,>8492>8493>8494;NS=3;LV=1;PS=>98085>92806	GT	.	0	.	.	1	0
chrI	61	>8494>8496	C	CA	60	.	AC=2;AF=0.666667;AN=3;AT=>8494>8496,>8494>8495>8496;NS=3;LV=1;PS=>98085>92806	GT	.	0	.	.	1	1

Then again with v1.40.0

rm -f vg; wget -q https://github.com/vgteam/vg/releases/download/v1.40.0/vg
chmod +x vg
./vg deconstruct cerevisiae.fa.gz.d1a145e.417fcdf.7493449.smooth.final.gfa -H "#" -P S288C -e -a -t 8 > decon40.vcf
bcftools view -H decon40.vcf | head

The output seems simlar

S288C#1#chrI	7	>8472>8475	C	CACACCCACA,CA	60	.	AC=1,1;AF=0.5,0.5;AN=2;AT=>8472>8475,>8472>8473>8474>8475,>8472>8473>8475;NS=2;LV=1;PS=>98085>92806	GT	.	2	.	.	1	.
S288C#1#chrI	20	>8475>8478	ACA	AC,A	60	.	AC=1,1;AF=0.5,0.5;AN=2;AT=>8475>8476>8477>8478,>8475>8476>8478,>8475>8478;NS=2;LV=1;PS=>98085>92806	GT	.	2	.	.	1	.
S288C#1#chrI	29	>8479>8481	C	CCA	60	.	AC=2;AF=0.666667;AN=3;AT=>8479>8481,>8479>8480>8481;NS=3;LV=1;PS=>98085>92806	GT	.	1	.	.	1	0
S288C#1#chrI	35	>8481>8483	AC	A	60	.	AC=1;AF=0.333333;AN=3;AT=>8481>8482>8483,>8481>8483;NS=3;LV=1;PS=>98085>92806	GT	.	0	.	.	0	1
S288C#1#chrI	42	>8483>8485	CA	C	60	.	AC=2;AF=0.666667;AN=3;AT=>8483>8484>8485,>8483>8485;NS=3;LV=1;PS=>98085>92806	GT	.	0	.	.	1	1
S288C#1#chrI	44	>8485>8487	C	CA	60	.	AC=2;AF=0.666667;AN=3;AT=>8485>8487,>8485>8486>8487;NS=3;LV=1;PS=>98085>92806	GT	.	1	.	.	1	0
S288C#1#chrI	47	>8487>8489	CA	C	60	.	AC=1;AF=0.333333;AN=3;AT=>8487>8488>8489,>8487>8489;NS=3;LV=1;PS=>98085>92806	GT	.	1	.	.	0	0
S288C#1#chrI	50	>8489>8492	C	A	60	.	AC=1;AF=0.333333;AN=3;AT=>8489>8490>8492,>8489>8491>8492;NS=3;LV=1;PS=>98085>92806	GT	.	1	.	.	0	0
S288C#1#chrI	55	>8492>8494	C	CCA	60	.	AC=1;AF=0.333333;AN=3;AT=>8492>8494,>8492>8493>8494;NS=3;LV=1;PS=>98085>92806	GT	.	0	.	.	1	0
S288C#1#chrI	61	>8494>8496	C	CA	60	.	AC=2;AF=0.666667;AN=3;AT=>8494>8496,>8494>8495>8496;NS=3;LV=1;PS=>98085>92806	GT	.	0	.	.	1	1

Though the contig names are indeed different, as the sample/haplotype is stripped in the new version. Do you think this is causing downstream issues for PGGB leading to the incompatibility? I think the rationale for this change was that everyone seemed to be running a similar filter to use the VCF anyway, so it seemed simpler. Though I guess it could cause problems calling on multiple samples at once (but that does not seem to be the case here).

The other difference I see is that the new version has three extra calls:

chrI	127337	>105110>105113	T	TCTT	60	.	AC=0;AF=0;AN=5;AT=<105113<105110,<105113<105112<105111<105110;CONFLICT=YPS128;NS=5;LV=1;PS=>105109>105115	GT	0	0	0	0	0	.

chrIV	765615	>153496>153499	G	A	60	.	AC=0;AF=0;AN=2;AT=>153496>153498>153499,>153496>153497>153499;CONFLICT=DBVPG6765;NS=2;LV=1;PS=>153491>153715	GT	.	.	.	.	0	0

chrIV	766132	>153644>153649	AC	TC	60	.	AC=0;AF=0;AN=2;AT=>153644>153646>153647>153649,>153644>153645>153647>153649;CONFLICT=DBVPG6044,SK1;NS=2;LV=1;PS=>153491>153715	GT	.	.	.	0	0	.

which seem to be sites for whom genotypes couldn't be resolved that were filtered out before but are now included.

glennhickey avatar Jul 31 '23 14:07 glennhickey

Hi @glennhickey, new versions of vg deconstruct creates issues with PGGB. See https://github.com/vgteam/vg/issues/3807.

The problem is that, to be picky, the reference is called S288C#1#chrI in the FASTA/GFA, not chrI.

AndreaGuarracino avatar Jul 31 '23 14:07 AndreaGuarracino

@AndreaGuarracino Thanks. I will point out that the header name / contig name mismatch in that issue is resolved (the header in 1.49.0 is ##contig=<ID=chrI,length=219929> etc).

So as far as I can tell, the only remaining issue is that the sample name is being stripped. I understand that this means your VCF contigs are different than the FASTA input contigs. But... it's also pretty standard to only put contig names in VCF's CONTIG field (outside of older versions of deconstruct, I've never seen a VCF with sample names embedded in the contig field anywhere else).

Anyway, I'll revert to the old logic of putting in the full name so PGGB can keep using deconstruct, and maybe add a flag to write the contig-only.

glennhickey avatar Jul 31 '23 15:07 glennhickey