BALSAMIC
BALSAMIC copied to clipboard
feat: deduplicate with UMIs
This PR:
Blocked by:
- Update Sentieon: https://github.com/Clinical-Genomics/BALSAMIC/issues/1384
This update requires updating Sentieon as the current version in production:
sentieon-genomics-202010.02
Does not contain the consensus option for LocusCollector which is the step prior to Dedup:
Current prod:
Latest Sentieon version:
Background: Updating Sentieon to this version would allow us to use UMIs directly in the dedup step, which could rescue a significant number of reads wrongfully discarded as duplicates with a purely position-based approach.
It might also serve as the basis for the MRD-workflow which requires reaching very low VAFs, which might not be possible with the "3,1,1" approach in the UMI workflow, but still require some UMI error correction, which this solution offers.
For more info see user story: https://github.com/Clinical-Genomics/BALSAMIC/issues/1361
Issues to consider:
- [ ] Percent duplicate metrics:
After implementing this, we are no longer able to retrieve % Duplicates and Optical Duplicates info from the ".metrics" file from dedup. The values are "0". How can this be fixed? I have notified Sentieon about this issue and they say that they will add more useful statistics to the report in the next version.
- [x] Collapsed singleton or pairs?
Are the final collapsed reads maintained as Pairs or Singletons (as in the other Sentieon UMI collapse tool)? They are singletons, and Sentieon says this is intentional and beneficial.
Listing changes
Added: for new features. Changed: for changes in existing functionality. Deprecated: for soon-to-be removed features. Removed: for now removed features. Fixed: for any bug fixes. Security: in case of vulnerabilities.
Review and tests:
- [ ] Tests pass
- [ ] Code review
- [ ] New code is executed and covered by tests, and test approve