cancer-data icon indicating copy to clipboard operation
cancer-data copied to clipboard

Which types of mutation effects should be ignored?

Open dhimmel opened this issue 7 years ago • 5 comments

The PANCAN_mutation dataset (online doc) contains several types of mutations under the effect column. My processing of the dataset (notebook) yielded the following mutation effect and frequencies (as counts and percentages):

Effect Count Percent
Missense_Mutation 1,044,846 58.152%
Silent 432,995 24.099%
Nonsense_Mutation 81,092 4.513%
RNA 71,493 3.979%
Frame_Shift_Del 46,941 2.613%
Splice_Site 43,262 2.408%
Frame_Shift_Ins 22,546 1.255%
missense_variant 20,241 1.127%
In_Frame_Del 11,455 0.638%
synonymous_variant 7,907 0.440%
Translation_Start_Site 3,258 0.181%
In_Frame_Ins 3,052 0.170%
stop_gained 1,573 0.088%
3_prime_UTR_variant 1,420 0.079%
Nonstop_Mutation 1,318 0.073%
exon_variant 945 0.053%
EXON 420 0.023%
5_prime_UTR_variant 395 0.022%
splice_acceptor_variant 294 0.016%
splice_region_variant 255 0.014%
3'UTR 211 0.012%
splice_donor_variant 203 0.011%
Intron 148 0.008%
5_prime_UTR_premature_start_codon_gain_variant 110 0.006%
NON_SYNONYMOUS_CODING 95 0.005%
INTRAGENIC 57 0.003%
UTR_3_PRIME 38 0.002%
SYNONYMOUS_CODING 36 0.002%
start_lost 32 0.002%
5'UTR 28 0.002%
UTR_5_PRIME 22 0.001%
stop_lost 19 0.001%
IGR 16 0.001%
stop_retained_variant 7 0.000%
STOP_GAINED 6 0.000%
initiator_codon_variant 2 0.000%
SPLICE_SITE_ACCEPTOR 2 0.000%
SYNONYMOUS_STOP 1 0.000%
5'Flank 1 0.000%

It appears that certain effects are duplicates — such as 5_prime_UTR_variant, 5'UTR, UTR_5_PRIME — which if true represents a poor case of standardization. If we want to improve the standardization, we can create our own mapping, or we can report the issue to the upstream creators (although these fixes usually take a long time).

Anyways, we'll have to decide which types of effects to consider as functionally relevant mutations. For example, a "Silent" mutation generally does not have a biological effect. We could also let users decide for themselves, but that adds complexity.

@clairemcleod, @mp8, @DCousminer, @gwaygenomics, @cgreene, @stephenshank — I thought you may have a better understanding than I do of the biology here. Can any of these categories be discarded as irrelevant to a tumor's function and classification? Are you interested in creating a consolidated set of effects with duplicates merged?

dhimmel avatar Jul 14 '16 22:07 dhimmel