deepsomatic
deepsomatic copied to clipboard
Performance regression with a different HCC1395 Illumina library
Hi,
First of all, thank you for open sourcing DeepSomatic. Hope that it is going to be a useful tool for the community moving forward.
I am unable to reproduce the variant calling performance results shown in the Illumina case study page in the docs, with a different library of the same HCC1395 T/N sample pair from the SEQC2 consortium. WGS_NS_T_1 (https://www.ncbi.nlm.nih.gov/sra/SRX4728475) and WGS_NS_N_1 (https://www.ncbi.nlm.nih.gov/sra/SRX4728425) were used for this purpose as T/N sample pairs.
- You can find the whole genome T/N (85x/70x) bams for WGS_NS_T_1 and WGS_NS_N_1 here - gs://lancet2-test-datasets/SEQC2/single_library
- We downsampled coverage to the same T/N coverage as in the case study to check if makes a difference. (It didn't)
- Two runs of the case study data were performed with two evaluation tools (RTG vcfeval & hap.py) to show that the tools don't change the peformance results.
Results from all the runs that were tested are summarized in this google sheet. Red cells highlights the significant loss in precision when using the different library compared to the dataset provided in the case study example shown in green. https://docs.google.com/spreadsheets/d/1ReOMR85lPvC_Y6xZCPiY1SOFRDmQ5NXTSMUL0rOaaNA/edit?usp=sharing
- Do you have any thoughts or suggestions on why there is such a big difference in precision when using a different library of the same sample? Is this expected? If there is anything I am missing or doing incorrectly, please do let me know.
- It might be useful in general to document which specific library (among the SEQC2 consortium datasets - https://sites.google.com/view/seqc2/home/sequencing) is being used in the case study.