CrossMap
CrossMap copied to clipboard
MAF liftover resulting in same reference and alternate allele
Hello,
Thanks for this useful utility. I have data in MAF format (NCBI build 37). I am trying to lift it over to hg38 using the following Crossmap (v0.6.6) command:
CrossMap maf b37ToHg38.over.chain \\
PAAD_atlas.tmp.maf \\
/work/genomes/Hsapiens/hg38/seq/hg38.fa hg38 \\
/explore/liftover_maf/PAAD_atlas.liftover.maf \\
--chromid l
Liftover file was downloaded from https://github.com/broadinstitute/gatk/blob/083aac832cb64515fd0456008bf847dd22f6c234/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain
The command runs successfully with following output:
2024-02-26 10:29:50 [INFO] Read the chain file "/g/data3/gx8/extras/liftover_chains/b37ToHg38.over.chain"
2024-02-26 10:29:51 [INFO] Lifting over ...
2024-02-26 10:33:58 [INFO] Total entries: 6630811
2024-02-26 10:33:58 [INFO] Failed to map: 1372
However, after inspecting the output I have realised that the Reference
and Tumor_Seq_Allele2
are both the same in the lifted over maf file. For example, the head of output looks like:
$ head PAAD_atlas.liftover.maf
#liftOver: Program=CrossMapv0.6.6, Time=February26,2024, ChainFile=/g/data3/gx8/extras/liftover_chains/b37ToHg38.over.chain, NewRefGenome=/g/data3/gx8/local/development/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa
Hugo_Symbol sample_id Hugo_Symbol NCBI_Build Chromosome Start_Position End_Position Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele2 Tumor_Sample_Barcode HGVSp_Short aa_mutation
1 Avner-primary_tissue_subset FAM231B hg38 chr1 16539492 16539492 Missense_Mutation SNP C C p010_tumor-52fccd-somatic.pcgr.vcf p.R143C NA
2 Avner-primary_tissue_subset ZMYM4 hg38 chr1 35389029 35389029 Nonsense_Mutation SNP G G p010_tumor-52fccd-somatic.pcgr.vcf p.E795* NA
3 Avner-primary_tissue_subset COL8A2 hg38 chr1 36099236 36099236 Nonsense_Mutation SNP G G p010_tumor-52fccd-somatic.pcgr.vcf p.R149* NA
4 Avner-primary_tissue_subset PTGER3 hg38 chr1 70953763 70953763 Missense_Mutation SNP T T p010_tumor-52fccd-somatic.pcgr.vcf p.Q368H NA
5 Avner-primary_tissue_subset C1orf52 hg38 chr1 85259561 85259561 Missense_Mutation SNP C C p010_tumor-52fccd-somatic.pcgr.vcf p.E25K NA
6 Avner-primary_tissue_subset AMY2A hg38 chr1 103617550 103617550 Missense_Mutation SNP T T p010_tumor-52fccd-somatic.pcgr.vcf p.V37D NA
7 Avner-primary_tissue_subset TNR hg38 chr1 175391305 175391305 Missense_Mutation SNP G G p010_tumor-52fccd-somatic.pcgr.vcf p.S497L NA
8 Avner-primary_tissue_subset LAMC2 hg38 chr1 183218424 183218424 Missense_Mutation SNP G G p010_tumor-52fccd-somatic.pcgr.vcf p.A147T NA
In comparison, the head of original (genome build 37) file is:
$ head PAAD_atlas.tmp.maf
Hugo_Symbol sample_id Hugo_Symbol NCBI_Build Chromosome Start_Position End_Position Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele2 Tumor_Sample_Barcode HGVSp_Short aa_mutation
1 Avner-primary_tissue_subset FAM231B 37 1 16865987 16865987 Missense_Mutation SNP C T p010_tumor-52fccd-somatic.pcgr.vcf p.R143C NA
2 Avner-primary_tissue_subset ZMYM4 37 1 35854630 35854630 Nonsense_Mutation SNP G T p010_tumor-52fccd-somatic.pcgr.vcf p.E795* NA
3 Avner-primary_tissue_subset COL8A2 37 1 36564837 36564837 Nonsense_Mutation SNP G A p010_tumor-52fccd-somatic.pcgr.vcf p.R149* NA
4 Avner-primary_tissue_subset PTGER3 37 1 71419446 71419446 Missense_Mutation SNP T G p010_tumor-52fccd-somatic.pcgr.vcf p.Q368H NA
5 Avner-primary_tissue_subset C1orf52 37 1 85725244 85725244 Missense_Mutation SNP C T p010_tumor-52fccd-somatic.pcgr.vcf p.E25K NA
6 Avner-primary_tissue_subset AMY2A 37 1 104160172 104160172 Missense_Mutation SNP T A p010_tumor-52fccd-somatic.pcgr.vcf p.V37D NA
7 Avner-primary_tissue_subset TNR 37 1 175360441 175360441 Missense_Mutation SNP G A p010_tumor-52fccd-somatic.pcgr.vcf p.S497L NA
8 Avner-primary_tissue_subset LAMC2 37 1 183187559 183187559 Missense_Mutation SNP G A p010_tumor-52fccd-somatic.pcgr.vcf p.A147T NA
9 Avner-primary_tissue_subset OBSCN 37 1 228434396 228434396 Missense_Mutation SNP G A p010_tumor-52fccd-somatic.pcgr.vcf p.A1401T NA
It seems the program is updating both reference and alternate alleles. Can you please help me debug the issue? Thanks.
Hello,
The "Reference_Allele" will be updated to the new reference genome (hg38 in your case) after liftover. I checked, all the reference alleles in your "PAAD_atlas.liftover.maf" file are indeed matched to the sequence of hg38/GRCh38.
This probably indicates the somatic variants in your "PAAD_atlas.tmp.maf" file are NOT real mutations.
Hope this helps.
Liguo
Thanks for the response, Liguo.
The "Reference_Allele" will be updated to the new reference genome (hg38 in your case) after liftover.
That part makes sense. What I am unsure is why the Tumor_Seq_Allele2
has been updated? For example for row one (Hugo_Symbol - FAM231B) - Tumor_Seq_Allele2
has been updated to C
as compared to T
in my original file. Should Tumor_Seq_Allele2 still be T
as it wouldn't have been updated as part of liftover?
I am finding it hard to understand how so many calls in my "PAAD_atlas.tmp.maf"
are not real.
I have done MAF analysis on my v37 data and the SNV numbers are as below:
Carrying the same analysis on liftover (hg38) MAF, I get few to none SNV calls:
Which suggests I am losing 98% of variants after liftover?
Oh, I know what happened!! It looks like a bug. I will update and release a new version soon. Thanks
Thanks for the response, Liguo.
The "Reference_Allele" will be updated to the new reference genome (hg38 in your case) after liftover.
That part makes sense. What I am unsure is why the
Tumor_Seq_Allele2
has been updated? For example for row one (Hugo_Symbol - FAM231B) -Tumor_Seq_Allele2
has been updated toC
as compared toT
in my original file. Should Tumor_Seq_Allele2 still beT
as it wouldn't have been updated as part of liftover?I am finding it hard to understand how so many calls in my
"PAAD_atlas.tmp.maf"
are not real.I have done MAF analysis on my v37 data and the SNV numbers are as below:
Carrying the same analysis on liftover (hg38) MAF, I get few to none SNV calls:
![]()
Which suggests I am losing 98% of variants after liftover?
After read the MAF specificaion again, I think the original code is still correct. According to https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/, the Ref_Allele is in the 11th column, while your input file has the Ref_Allele in the 10th column, please double check.
Many thanks @liguowang - updating order of columns in my original data has resolved the issue.
Can I ask another question - how is CrossMap handling liftover of insertion in MAF. For example an entry in v37 MAF was:
sample_id Hugo_Symbol NCBI_Build Chromosome Start_Position End_Position Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele2 Tumor_Sample_Barcode HGVSp_Short aa_mutation
KRAS-wt_Croagh_subset LINC00383 37 13 69791655 69791656 5'Flank INS - T MON3__PRJ180359_MON3-T-somatic.pcgr_acmg.grch37.vcf NA
After liftover to hg38 it becomes:
sample_id Hugo_Symbol NCBI_Build Chromosome Start_Position End_Position Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele2 Tumor_Sample_Barcode HGVSp_Short aa_mutation
KRAS-wt_Croagh_subset LINC00383 hg38 chr13 69217523 69217524 5'Flank INS AT T MON3__PRJ180359_MON3-T-somatic.pcgr_acmg.grch37.vcf NA
Can you please explain the rationale behind updating Reference_Allele
from -
to AT
in this case or in general changing from -
to actual nucleotides on that position for INS?
"-/T" is NOT the standard way to represent an insertion (at least according to VCF's specification; maybe MAF has its own rule, which I am not sure).
"AT/T" indicates an "A" was inserted into the REF genome.
"-/T" is NOT the standard way to represent an insertion (at least according to VCF's specification; maybe MAF has its own rule, which I am not sure).
Indeed MAF has this rule to represent insertion using -
.
From https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/:
11 - Reference_Allele | The plus strand reference allele at this position. Includes the deleted sequence for a deletion or "-" for an insertion
"AT/T" indicates an "A" was inserted into the REF genome.
According to VCF spec, this would indicate a deletion i.e. A
from the reference was deleted in the alternate allele.
Based on this, would it make sense to stick to MAF spec and indicate INSERTIONS using -
in the Reference_Allele
column in the output of CrossMap maf module?