plink-ng icon indicating copy to clipboard operation
plink-ng copied to clipboard

Plink2 round-tripping vcf->pgen->vcf does not preserve contig names

Open cmnbroad opened this issue 1 year ago • 3 comments

In round tripping vcf- > pgen(pvar/psam) -> vcf, contig names are not preserved:

Original VCF (one contig named "chr1"):

##fileformat=VCFv4.2
##contig=<ID=chr1,length=248956422,assembly=38>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ERS4367804      ERS4367805      ERS4367803      ERS4367800      ERS4367801      ERS4367798      ERS4367799      ERS4367796      ERS4367797      ERS4367795
chr1    10146   .       AC      A       .       NO_HQ_GENOTYPES AC=3;AF=0.750;AN=4;AS_QUALapprox=0|224;AS_VQSLOD=.;AS_YNG=.;QUALapprox=137      GT:AD:GQ:RGQ    1/1:1,15:11:137 ./.     ./.     ./.     ./.     0/1:3,5:47:87   ./.     ./.     ./.     ./.

After converting to pgen (plink2 --vcf original.vcf --make-pgen --out round_trip), the generated .pvar has the contig name correctly preserved in contig header line, but the variants use "1" as the contig name:

##contig=<ID=chr1,length=248956422,assembly=38>
#CHROM  POS     ID      REF     ALT     FILTER
1       10146   .       AC      A       NO_HQ_GENOTYPES

(Ok, its a .pvar, not a .VCF, but its still confusing).

But on regenerating a .vcf from the pfile (command plink2 --pfile round_trip --export vcf --out round_trip), both the sequence dictionary and the variants have the altered contig names (despite the assembly property still referencing hg38):

##fileformat=VCFv4.3
##fileDate=20230328
##source=PLINKv2.00
##contig=<ID=1,length=248956422,assembly=38>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ERS4367804      ERS4367805      ERS4367803      ERS4367800      ERS4367801      ERS4367798      ERS4367799      ERS4367796      ERS4367797      ERS4367795
1       10146   .       AC      A       .       NO_HQ_GENOTYPES .       GT      1/1     ./.     ./.     ./.     ./.     0/1     ./.     ./.     ./.     ./.

This kind of change can cause lots of problems if the vcf is used further in downstream analysis code. The plink2 code that I'm running was built locally from commit a2e584ba, 3/11/23.

cmnbroad avatar Mar 28 '23 15:03 cmnbroad

plink's way of controlling this is the --output-chr flag.

chrchang avatar Mar 28 '23 19:03 chrchang

Right, thanks for the link. Perhaps there is some compelling reason to do otherwise, but my suggestion would be to have the default behavior be to preserve the scheme that is presented in the inputs.

cmnbroad avatar Mar 28 '23 19:03 cmnbroad

That creates more interoperability problems than it solves. It is reasonable for programs taking plink-formatted input to currently assume no 'chr' prefix in .bim/.pvar files.

I will look into adding an "--output-chr infer" mode.

chrchang avatar Mar 28 '23 19:03 chrchang