CrossMap
CrossMap copied to clipboard
Feature Request: Update VCF header contig lines during the liftover
Presently, CrossMap does not update the VCF header when lifting over VCF. It would be great if the old contig lines could be filtered out and replaced with the new reference's contigs. The bcftools sort and tabix won't work, until the header is updated. I have worked around this by using the bcftools reheader tool but this feature would help streamline the overall process. Maybe add a --vcf-contig GRCh38.contigs parameter for example.
CrossMap does update the header section. did you use the most recent version?
I just started using 4.3 and see that there are new contigs but they are missing the chr for some reason. The contigs are out of order also.
So in the 37 aligned file, there are these contigs.....
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
In the 38 liftover file, the contigs are missing the chr at the start of any lines. For example, here is one section.
##contig=<ID=1,length=248956422,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=10,length=133797422,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=10_GL383545v1_alt,length=179254,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=10_GL383546v1_alt,length=309802,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=10_KI270824v1_alt,length=181496,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=10_KI270825v1_alt,length=188315,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=11,length=135086622,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=11_GL383547v1_alt,length=154407,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=11_JH159136v1_alt,length=200998,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
##contig=<ID=11_JH159137v1_alt,length=191409,assembly=GRCh38_full_analysis_set_plus_decoy_hla.fa>
The HLA lines are fine but listed first before the chr contigs.
I took a quick look at the code. It looks like the withChr = True is based on the input vcf. Since the impute aligned to 37 has no chr, the withChr flag is false and then strips the chr from the beginning of the 38 reference contig lines.
You are right. The contig ID style is determined by the input VCF. I believe the contig IDs were sorted in alphabetical order.
So this became a problem when the GRCh37 was not consistent with GRCh38 using chr. Would using the target reference to test for withCHR instead of the input VCF fix this?