Exomiser
Exomiser copied to clipboard
Possible changes to Genomiser filtering and viz
Current nc variant behaviour is:
- Any variants found in the regulatory_regions db table of known regulatory regions from FANTOM and Ensembl regulatory build AND an effect of INTERGENIC_VARIANT or UPSTREAM_GENE_VARIANT get the effect changed to REGULATORY_REGION_VARIANT. This is to stop them getting removed in step 3 (? if Jannovar ever assigns this effect - I don't think so)
- Any variants with an effect of REGULATORY_REGION_VARIANT get reassigned to the gene in the TAD with the best pheno score
- The regulatoryFeature filter removes any variant with an effect of INTERGENIC_VARIANT or UPSTREAM_GENE_VARIANT AND >= 20kb away from gene
Max's preferred behaviour
- Reassign variants to best gene in TAD for most nc variants
CODING_TRANSCRIPT_INTRON_VARIANTCONSERVED_INTERGENIC_VARIANTCONSERVED_INTRON_VARIANTDOWNSTREAM_GENE_VARIANTINTERGENIC_REGIONINTERGENIC_VARIANTINTRAGENIC_VARIANTINTRON_VARIANTNON_CODING_TRANSCRIPT_INTRON_VARIANTREGULATORY_REGION_VARIANTTF_BINDING_SITE_VARIANTUPSTREAM_GENE_VARIANT
- Don't filter variants based on being >= 20 kb from a gene and not in a FANTOM/Ensembl reg build feature but rather use ReMM < 0.5 instead.
- I think we can achieve (2) with the minimum code changes by introducing a pathogenicityScoreFilter or remmScoreFilter so users can optionally choose to skip the regulatoryFeatureFilter and use a remmScore > 0.5 filter instead. This way variants in FANTOM/Ensembl regulatory regions will still get flagged as REGULATORY_REGION_VARIANTS for display purposes.
- Still need to decide whether to update the list of variant effects for TAD gene reassignment or maybe make it user-configurable?
- If we do it this way with the old behaviour still possible then we don't need to worry so much about repeating the whole simulated genomes benchmarking we did in the original paper i.e. users could test both options on their own datasets and make a decision based on compute time, nos of variants returned and identification of known diagnostic nc variants
- First attempt to look at running it with Max's suggestions on a WGS of 6943867 variants:
- Took 50 mins to run with 50Gb. Output results for 2,357 genes and 23,367 variants compared to 4521 genes and 43884 variants running it the usual way in 32 mins
- New top hit: 2 29538028 G C intronic variant now assigned to C2orf71 rather than ALK
Peter's preferred display behaviour
- More detail on what REGULATORY_REGION_VARIANT means and/or a better name as the other types are regulatory variants as well. Could we link to a suitable external resource such as Ensembl regulatory build e.g. http://grch37.ensembl.org/Homo_sapiens/Location/View?r=6%3A7261639-7261639 where regulatory build is a default track and shows this variant marked as regulatory_region_variant for RREB1 is predicted to be a Promoter Flanking Region. Linking to http://grch37.ensembl.org/Homo_sapiens/Regulation/Summary?db=core;fdb=funcgen;r=6:7261639-7261639;rf=ENSR00000260522 would give more relevant detail but not sure how to automate. We would have to store the ENSR ids in the db table
- Provide a more detailed breakdown and viz of the various UTR effects such as upstream ORFs, KOZAK etc
@visze @julesjacobsen @pnrobinson I combined the 3 prev issues discussing this into one new issue as they are all inter-related and to simplify the discussion!