CrossMap icon indicating copy to clipboard operation
CrossMap copied to clipboard

Feature Request: Update VCF header contig lines during the liftover

Open jjfarrell opened this issue 4 years ago • 5 comments

Presently, CrossMap does not update the VCF header when lifting over VCF. It would be great if the old contig lines could be filtered out and replaced with the new reference's contigs. The bcftools sort and tabix won't work, until the header is updated. I have worked around this by using the bcftools reheader tool but this feature would help streamline the overall process. Maybe add a --vcf-contig GRCh38.contigs parameter for example.

jjfarrell avatar Jul 24 '20 13:07 jjfarrell

CrossMap does update the header section. did you use the most recent version?

liguowang avatar Jul 24 '20 16:07 liguowang

I just started using 4.3 and see that there are new contigs but they are missing the chr for some reason. The contigs are out of order also.

So in the 37 aligned file, there are these contigs.....

##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>

In the 38 liftover file, the contigs are missing the chr at the start of any lines. For example, here is one section.

##contig=<ID=1,length=248956422,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=10,length=133797422,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=10_GL383545v1_alt,length=179254,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=10_GL383546v1_alt,length=309802,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=10_KI270824v1_alt,length=181496,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=10_KI270825v1_alt,length=188315,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=11,length=135086622,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=11_GL383547v1_alt,length=154407,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=11_JH159136v1_alt,length=200998,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=11_JH159137v1_alt,length=191409,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>

The HLA lines are fine but listed first before the chr contigs.

jjfarrell avatar Jul 24 '20 18:07 jjfarrell

I took a quick look at the code. It looks like the withChr = True is based on the input vcf. Since the impute aligned to 37 has no chr, the withChr flag is false and then strips the chr from the beginning of the 38 reference contig lines.

jjfarrell avatar Jul 24 '20 18:07 jjfarrell

You are right. The contig ID style is determined by the input VCF. I believe the contig IDs were sorted in alphabetical order.

liguowang avatar Jul 24 '20 22:07 liguowang

So this became a problem when the GRCh37 was not consistent with GRCh38 using chr. Would using the target reference to test for withCHR instead of the input VCF fix this?

jjfarrell avatar Jul 27 '20 00:07 jjfarrell