cancer-data
cancer-data copied to clipboard
Which types of mutation effects should be ignored?
The PANCAN_mutation
dataset (online doc) contains several types of mutations under the effect
column. My processing of the dataset (notebook) yielded the following mutation effect and frequencies (as counts and percentages):
Effect | Count | Percent |
---|---|---|
Missense_Mutation | 1,044,846 | 58.152% |
Silent | 432,995 | 24.099% |
Nonsense_Mutation | 81,092 | 4.513% |
RNA | 71,493 | 3.979% |
Frame_Shift_Del | 46,941 | 2.613% |
Splice_Site | 43,262 | 2.408% |
Frame_Shift_Ins | 22,546 | 1.255% |
missense_variant | 20,241 | 1.127% |
In_Frame_Del | 11,455 | 0.638% |
synonymous_variant | 7,907 | 0.440% |
Translation_Start_Site | 3,258 | 0.181% |
In_Frame_Ins | 3,052 | 0.170% |
stop_gained | 1,573 | 0.088% |
3_prime_UTR_variant | 1,420 | 0.079% |
Nonstop_Mutation | 1,318 | 0.073% |
exon_variant | 945 | 0.053% |
EXON | 420 | 0.023% |
5_prime_UTR_variant | 395 | 0.022% |
splice_acceptor_variant | 294 | 0.016% |
splice_region_variant | 255 | 0.014% |
3'UTR | 211 | 0.012% |
splice_donor_variant | 203 | 0.011% |
Intron | 148 | 0.008% |
5_prime_UTR_premature_start_codon_gain_variant | 110 | 0.006% |
NON_SYNONYMOUS_CODING | 95 | 0.005% |
INTRAGENIC | 57 | 0.003% |
UTR_3_PRIME | 38 | 0.002% |
SYNONYMOUS_CODING | 36 | 0.002% |
start_lost | 32 | 0.002% |
5'UTR | 28 | 0.002% |
UTR_5_PRIME | 22 | 0.001% |
stop_lost | 19 | 0.001% |
IGR | 16 | 0.001% |
stop_retained_variant | 7 | 0.000% |
STOP_GAINED | 6 | 0.000% |
initiator_codon_variant | 2 | 0.000% |
SPLICE_SITE_ACCEPTOR | 2 | 0.000% |
SYNONYMOUS_STOP | 1 | 0.000% |
5'Flank | 1 | 0.000% |
It appears that certain effects are duplicates — such as 5_prime_UTR_variant
, 5'UTR
, UTR_5_PRIME
— which if true represents a poor case of standardization. If we want to improve the standardization, we can create our own mapping, or we can report the issue to the upstream creators (although these fixes usually take a long time).
Anyways, we'll have to decide which types of effects to consider as functionally relevant mutations. For example, a "Silent" mutation generally does not have a biological effect. We could also let users decide for themselves, but that adds complexity.
@clairemcleod, @mp8, @DCousminer, @gwaygenomics, @cgreene, @stephenshank — I thought you may have a better understanding than I do of the biology here. Can any of these categories be discarded as irrelevant to a tumor's function and classification? Are you interested in creating a consolidated set of effects with duplicates merged?