cbioportal
cbioportal copied to clipboard
Rebased sv on master
Fix # (see https://help.github.com/en/articles/closing-issues-using-keywords)
Describe changes proposed in this pull request: Permanently removing support for fusions. Fusions are now supported as "structural variants" inside the structural_variants table. Queries to the backend API, no longer depends on the mutation table. *_fusion profiles are replaced with *_structural_variants profiles.
Migration Python migration script now has a check for this migration step (2.12.14) -- before migrating to new SV we run a check for fusion mutation events, and exit out if there are fusion events present. This is because we cannot make any assumptions about collapsing imported fusion events (e.g. TMPRSS2-EGFR was imported as both TMPRSS2-EGFR and EGFR-TMPRSS2, which one would we keep?) + other issues like renaming descriptions/ids for fusion profiles to structural variant profiles.
SV table updated to replace exon with region, region number, and contig Only required fields are technically sample id, sv status, and a valid gene (either site 1 or site 2). Old code only checked site 1 and duplicated records (by flipping) - now we validate a record by checking site 1 or site 2 and import if gene is found. SV status is defaulted to SOMATIC on import if no value is provided.
Explicit Fixes
- Queries were not even working -- using wrong sample identifier (stable vs. internal) to query the structural variant table. (e.g. using P-0000001 instead of 1234567)
- Queries were double counting records where site1 and site2 gene were the same (e.g. fusion of EGFR to EGFR (dif location?))
- Service was not returning alteration enrichments for structural variant -- updated the code to check for a structural variant filter
- Issues with level of specificity for profiled samples. Previously, there was a hack in place specifically for fusions to use case lists (and this was also because we were using the mutation profile for fusions on import). This is now fixed by requiring a gene panel matrix file (if you want to specify profiled samples) -- added support for WXS/WGS/NA so that this count can work for non-targeted sequencing studies.
Other Issues
- Existing issue where samples are double counted if they are present in the study but profiled under different panels for different genetic profiles. This behavior came up when SV did not have code that specified gene panel - and was imported with NULL versus same samples in the mutation profile. This behavior is still present, but should be slightly improved, now that curators have agreed to add a structural_variants column to the gene_panel_matrix file (could still come up if they are specified to have different panels, but I think that behavior is what we want)
Checks
- [ ] Runs on heroku
- [ ] Has tests or has a separate issue that describes the types of test that should be created. If no test is included it should explicitly be mentioned in the PR why there is no test.
- [ ] The commit log is comprehensible. It follows 7 rules of great commit messages. For most PRs a single commit should suffice, in some cases multiple topical commits can be useful. During review it is ok to see tiny commits (e.g. Fix reviewer comments), but right before the code gets merged to master or rc branch, any such commits should be squashed since they are useless to the other developers. Definitely avoid merge commits, use rebase instead.
- [ ] Is this PR adding logic based on one or more clinical attributes? If yes, please make sure validation for this attribute is also present in the data validation / data loading layers (in backend repo) and documented in File-Formats Clinical data section!
Any screenshots or GIFs?
If this is a new visual feature please add a before/after screenshot or gif here with e.g. Giphy CAPTURE or Peek
Notify reviewers
Read our Pull request merging
policy. It can help to figure out who worked on the
file before you. Please use git blame <filename>
to determine that
and notify them either through slack or by assigning them as a reviewer on the PR
I have added a fix for the compile bug (a typo in ImportGenePanelProfileMap.java) Now I am going through and eliminating (12) code smells. However, I also found references to the Site1_Exon column in some test files and I'm trying to purge those. Plus, I think the unit tests for the import function is missing tests for the new fields in an sv file so I'm adding those. validateData.py and core/src/test/scripts/unit_tests_validate_data.py also was not updated and still refers to exons
Code included as of this date looks good. Still needed before merging:
- [ ] Updates to validateData.py and associated unit tests (and input data files such as core/src/test/scripts/test_data/data_structural_variants_exon_not_in_transcript.txt and core/src/test/scripts/test_data/data_structural_variants_missing_values.txt)
- [ ] doc updates (docs/File-Formats.md in particular)