CrossMap icon indicating copy to clipboard operation
CrossMap copied to clipboard

MAF liftover resulting in same reference and alternate allele

Open skanwal opened this issue 11 months ago • 7 comments

Hello,

Thanks for this useful utility. I have data in MAF format (NCBI build 37). I am trying to lift it over to hg38 using the following Crossmap (v0.6.6) command:

CrossMap maf b37ToHg38.over.chain \\ 
PAAD_atlas.tmp.maf \\
/work/genomes/Hsapiens/hg38/seq/hg38.fa hg38 \\
/explore/liftover_maf/PAAD_atlas.liftover.maf \\
--chromid l

Liftover file was downloaded from https://github.com/broadinstitute/gatk/blob/083aac832cb64515fd0456008bf847dd22f6c234/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain

The command runs successfully with following output:

2024-02-26 10:29:50 [INFO]  Read the chain file "/g/data3/gx8/extras/liftover_chains/b37ToHg38.over.chain"
2024-02-26 10:29:51 [INFO]  Lifting over ...
2024-02-26 10:33:58 [INFO]  Total entries: 6630811
2024-02-26 10:33:58 [INFO]  Failed to map: 1372

However, after inspecting the output I have realised that the Reference and Tumor_Seq_Allele2 are both the same in the lifted over maf file. For example, the head of output looks like:

$ head PAAD_atlas.liftover.maf
#liftOver: Program=CrossMapv0.6.6, Time=February26,2024, ChainFile=/g/data3/gx8/extras/liftover_chains/b37ToHg38.over.chain, NewRefGenome=/g/data3/gx8/local/development/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa
Hugo_Symbol	sample_id	Hugo_Symbol	NCBI_Build	Chromosome	Start_Position	End_Position	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele2	Tumor_Sample_Barcode	HGVSp_Short	aa_mutation
1	Avner-primary_tissue_subset	FAM231B	hg38	chr1	16539492	16539492	Missense_Mutation	SNP	C	C	p010_tumor-52fccd-somatic.pcgr.vcf	p.R143C	NA
2	Avner-primary_tissue_subset	ZMYM4	hg38	chr1	35389029	35389029	Nonsense_Mutation	SNP	G	G	p010_tumor-52fccd-somatic.pcgr.vcf	p.E795*	NA
3	Avner-primary_tissue_subset	COL8A2	hg38	chr1	36099236	36099236	Nonsense_Mutation	SNP	G	G	p010_tumor-52fccd-somatic.pcgr.vcf	p.R149*	NA
4	Avner-primary_tissue_subset	PTGER3	hg38	chr1	70953763	70953763	Missense_Mutation	SNP	T	T	p010_tumor-52fccd-somatic.pcgr.vcf	p.Q368H	NA
5	Avner-primary_tissue_subset	C1orf52	hg38	chr1	85259561	85259561	Missense_Mutation	SNP	C	C	p010_tumor-52fccd-somatic.pcgr.vcf	p.E25K	NA
6	Avner-primary_tissue_subset	AMY2A	hg38	chr1	103617550	103617550	Missense_Mutation	SNP	T	T	p010_tumor-52fccd-somatic.pcgr.vcf	p.V37D	NA
7	Avner-primary_tissue_subset	TNR	hg38	chr1	175391305	175391305	Missense_Mutation	SNP	G	G	p010_tumor-52fccd-somatic.pcgr.vcf	p.S497L	NA
8	Avner-primary_tissue_subset	LAMC2	hg38	chr1	183218424	183218424	Missense_Mutation	SNP	G	G	p010_tumor-52fccd-somatic.pcgr.vcf	p.A147T	NA

In comparison, the head of original (genome build 37) file is:

$ head PAAD_atlas.tmp.maf
Hugo_Symbol	sample_id	Hugo_Symbol	NCBI_Build	Chromosome	Start_Position	End_Position	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele2	Tumor_Sample_Barcode	HGVSp_Short	aa_mutation
1	Avner-primary_tissue_subset	FAM231B	37	1	16865987	16865987	Missense_Mutation	SNP	C	T	p010_tumor-52fccd-somatic.pcgr.vcf	p.R143C	NA
2	Avner-primary_tissue_subset	ZMYM4	37	1	35854630	35854630	Nonsense_Mutation	SNP	G	T	p010_tumor-52fccd-somatic.pcgr.vcf	p.E795*	NA
3	Avner-primary_tissue_subset	COL8A2	37	1	36564837	36564837	Nonsense_Mutation	SNP	G	A	p010_tumor-52fccd-somatic.pcgr.vcf	p.R149*	NA
4	Avner-primary_tissue_subset	PTGER3	37	1	71419446	71419446	Missense_Mutation	SNP	T	G	p010_tumor-52fccd-somatic.pcgr.vcf	p.Q368H	NA
5	Avner-primary_tissue_subset	C1orf52	37	1	85725244	85725244	Missense_Mutation	SNP	C	T	p010_tumor-52fccd-somatic.pcgr.vcf	p.E25K	NA
6	Avner-primary_tissue_subset	AMY2A	37	1	104160172	104160172	Missense_Mutation	SNP	T	A	p010_tumor-52fccd-somatic.pcgr.vcf	p.V37D	NA
7	Avner-primary_tissue_subset	TNR	37	1	175360441	175360441	Missense_Mutation	SNP	G	A	p010_tumor-52fccd-somatic.pcgr.vcf	p.S497L	NA
8	Avner-primary_tissue_subset	LAMC2	37	1	183187559	183187559	Missense_Mutation	SNP	G	A	p010_tumor-52fccd-somatic.pcgr.vcf	p.A147T	NA
9	Avner-primary_tissue_subset	OBSCN	37	1	228434396	228434396	Missense_Mutation	SNP	G	A	p010_tumor-52fccd-somatic.pcgr.vcf	p.A1401T	NA

It seems the program is updating both reference and alternate alleles. Can you please help me debug the issue? Thanks.

skanwal avatar Feb 26 '24 00:02 skanwal

Hello,

The "Reference_Allele" will be updated to the new reference genome (hg38 in your case) after liftover. I checked, all the reference alleles in your "PAAD_atlas.liftover.maf" file are indeed matched to the sequence of hg38/GRCh38.

This probably indicates the somatic variants in your "PAAD_atlas.tmp.maf" file are NOT real mutations.

Hope this helps.

Liguo

liguowang avatar Feb 26 '24 15:02 liguowang

Thanks for the response, Liguo.

The "Reference_Allele" will be updated to the new reference genome (hg38 in your case) after liftover.

That part makes sense. What I am unsure is why the Tumor_Seq_Allele2 has been updated? For example for row one (Hugo_Symbol - FAM231B) - Tumor_Seq_Allele2 has been updated to C as compared to T in my original file. Should Tumor_Seq_Allele2 still be T as it wouldn't have been updated as part of liftover?

I am finding it hard to understand how so many calls in my "PAAD_atlas.tmp.maf" are not real.

I have done MAF analysis on my v37 data and the SNV numbers are as below:

Screen Shot 2024-02-27 at 6 08 54 am

Carrying the same analysis on liftover (hg38) MAF, I get few to none SNV calls: Screen Shot 2024-02-27 at 6 12 57 am

Which suggests I am losing 98% of variants after liftover?

skanwal avatar Feb 26 '24 19:02 skanwal

Oh, I know what happened!! It looks like a bug. I will update and release a new version soon. Thanks

liguowang avatar Feb 26 '24 19:02 liguowang

Thanks for the response, Liguo.

The "Reference_Allele" will be updated to the new reference genome (hg38 in your case) after liftover.

That part makes sense. What I am unsure is why the Tumor_Seq_Allele2 has been updated? For example for row one (Hugo_Symbol - FAM231B) - Tumor_Seq_Allele2 has been updated to C as compared to T in my original file. Should Tumor_Seq_Allele2 still be T as it wouldn't have been updated as part of liftover?

I am finding it hard to understand how so many calls in my "PAAD_atlas.tmp.maf" are not real.

I have done MAF analysis on my v37 data and the SNV numbers are as below:

Screen Shot 2024-02-27 at 6 08 54 am Carrying the same analysis on liftover (hg38) MAF, I get few to none SNV calls: Screen Shot 2024-02-27 at 6 12 57 am

Which suggests I am losing 98% of variants after liftover?

After read the MAF specificaion again, I think the original code is still correct. According to https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/, the Ref_Allele is in the 11th column, while your input file has the Ref_Allele in the 10th column, please double check.

liguowang avatar Feb 26 '24 20:02 liguowang

Many thanks @liguowang - updating order of columns in my original data has resolved the issue.

Can I ask another question - how is CrossMap handling liftover of insertion in MAF. For example an entry in v37 MAF was:

sample_id	Hugo_Symbol	NCBI_Build	Chromosome	Start_Position	End_Position	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele2	Tumor_Sample_Barcode	HGVSp_Short	aa_mutation
KRAS-wt_Croagh_subset	LINC00383	37	13	69791655	69791656	5'Flank	INS	-	T	MON3__PRJ180359_MON3-T-somatic.pcgr_acmg.grch37.vcf		NA

After liftover to hg38 it becomes:

sample_id	Hugo_Symbol	NCBI_Build	Chromosome	Start_Position	End_Position	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele2	Tumor_Sample_Barcode	HGVSp_Short	aa_mutation
KRAS-wt_Croagh_subset	LINC00383	hg38	chr13	69217523	69217524	5'Flank	INS	AT	T	MON3__PRJ180359_MON3-T-somatic.pcgr_acmg.grch37.vcf		NA

Can you please explain the rationale behind updating Reference_Allele from - to AT in this case or in general changing from - to actual nucleotides on that position for INS?

skanwal avatar Feb 27 '24 02:02 skanwal

"-/T" is NOT the standard way to represent an insertion (at least according to VCF's specification; maybe MAF has its own rule, which I am not sure).

"AT/T" indicates an "A" was inserted into the REF genome.

liguowang avatar Feb 27 '24 18:02 liguowang

"-/T" is NOT the standard way to represent an insertion (at least according to VCF's specification; maybe MAF has its own rule, which I am not sure).

Indeed MAF has this rule to represent insertion using -. From https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/:

11 - Reference_Allele | The plus strand reference allele at this position. Includes the deleted sequence for a deletion or "-" for an insertion

"AT/T" indicates an "A" was inserted into the REF genome.

According to VCF spec, this would indicate a deletion i.e. A from the reference was deleted in the alternate allele.

Based on this, would it make sense to stick to MAF spec and indicate INSERTIONS using - in the Reference_Allele column in the output of CrossMap maf module?

skanwal avatar Feb 27 '24 23:02 skanwal