neusomatic
neusomatic copied to clipboard
testing pretrined model
Hello, I tested my data and dream_challenge data with all the pre-trained model than you provided. The test results show that the recall rate is very low, less than 50%, I do not know the reason. can you provide me with the test data to verify the model, thank you
@bingdiao-zbw The dream_challenge data should give you similar performance as what we presented in the paper. It may be the case that you are using wrong alignments, or parameters. Would you please explain the details of your dream_challenge test case, the alignment, and the preprocess/call parameters you use?
Thank you for your reply. The parameters are the same as mentioned in the run_test.sh. I found another problem when I run the preprocess.py. a lot of warnings appeared in the file scan.err. #Aligned read number: 146 Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag!
After running the call.py, The recall rate is less than 50% but the accuracy rate can reach more than 95%.
@bingdiao-zbw In the paper, we used the following parameters for Dream data: --scan_maf 0.01 --min_mapq 10 --snp_min_af 0.03 --snp_min_bq 20 --snp_min_ao 3 --ins_min_af 0.02 --del_min_af 0.02
. As you may see in the paper, we get 96% F1 score for SNVs. I suspect that your input bam file has some discrepancies. Can you share with me small region of the bam, along with the header?
1 @HD VN:1.4 GO:none SO:coordinate
2 @SQ SN:chr1 LN:249250621 AS:hg19_random.nix
3 @SQ SN:chr2 LN:243199373 AS:hg19_random.nix
4 @SQ SN:chr3 LN:198022430 AS:hg19_random.nix
5 @SQ SN:chr4 LN:191154276 AS:hg19_random.nix
6 @SQ SN:chr5 LN:180915260 AS:hg19_random.nix
7 @SQ SN:chr6 LN:171115067 AS:hg19_random.nix
8 @SQ SN:chr7 LN:159138663 AS:hg19_random.nix
9 @SQ SN:chr8 LN:146364022 AS:hg19_random.nix
10 @SQ SN:chr9 LN:141213431 AS:hg19_random.nix
11 @SQ SN:chr10 LN:135534747 AS:hg19_random.nix
12 @SQ SN:chr11 LN:135006516 AS:hg19_random.nix
13 @SQ SN:chr12 LN:133851895 AS:hg19_random.nix
14 @SQ SN:chr13 LN:115169878 AS:hg19_random.nix
15 @SQ SN:chr14 LN:107349540 AS:hg19_random.nix
16 @SQ SN:chr15 LN:102531392 AS:hg19_random.nix
17 @SQ SN:chr16 LN:90354753 AS:hg19_random.nix
18 @SQ SN:chr17 LN:81195210 AS:hg19_random.nix
19 @SQ SN:chr18 LN:78077248 AS:hg19_random.nix
20 @SQ SN:chr19 LN:59128983 AS:hg19_random.nix
21 @SQ SN:chr20 LN:63025520 AS:hg19_random.nix
22 @SQ SN:chr21 LN:48129895 AS:hg19_random.nix
23 @SQ SN:chr22 LN:51304566 AS:hg19_random.nix
24 @SQ SN:chrX LN:155270560 AS:hg19_random.nix
25 @SQ SN:chrY LN:59373566 AS:hg19_random.nix
26 @SQ SN:chrM LN:16571 AS:hg19_random.nix
27 @SQ SN:chr1_gl000191_random LN:106433 AS:hg19_random.nix
28 @SQ SN:chr1_gl000192_random LN:547496 AS:hg19_random.nix
29 @SQ SN:chr4_ctg9_hap1 LN:590426 AS:hg19_random.nix
30 @SQ SN:chr4_gl000193_random LN:189789 AS:hg19_random.nix
31 @SQ SN:chr4_gl000194_random LN:191469 AS:hg19_random.nix
32 @SQ SN:chr6_apd_hap1 LN:4622290 AS:hg19_random.nix
33 @SQ SN:chr6_cox_hap2 LN:4795371 AS:hg19_random.nix
34 @SQ SN:chr6_dbb_hap3 LN:4610396 AS:hg19_random.nix
35 @SQ SN:chr6_mann_hap4 LN:4683263 AS:hg19_random.nix
36 @SQ SN:chr6_mcf_hap5 LN:4833398 AS:hg19_random.nix
37 @SQ SN:chr6_qbl_hap6 LN:4611984 AS:hg19_random.nix
38 @SQ SN:chr6_ssto_hap7 LN:4928567 AS:hg19_random.nix
39 @SQ SN:chr7_gl000195_random LN:182896 AS:hg19_random.nix
40 @SQ SN:chr8_gl000196_random LN:38914 AS:hg19_random.nix
41 @SQ SN:chr8_gl000197_random LN:37175 AS:hg19_random.nix
42 @SQ SN:chr9_gl000198_random LN:90085 AS:hg19_random.nix
43 @SQ SN:chr9_gl000199_random LN:169874 AS:hg19_random.nix
44 @SQ SN:chr9_gl000200_random LN:187035 AS:hg19_random.nix
45 @SQ SN:chr9_gl000201_random LN:36148 AS:hg19_random.nix
46 @SQ SN:chr11_gl000202_random LN:40103 AS:hg19_random.nix
47 @SQ SN:chr17_ctg5_hap1 LN:1680828 AS:hg19_random.nix
48 @SQ SN:chr17_gl000203_random LN:37498 AS:hg19_random.nix
49 @SQ SN:chr17_gl000204_random LN:81310 AS:hg19_random.nix
50 @SQ SN:chr17_gl000205_random LN:174588 AS:hg19_random.nix
51 @SQ SN:chr17_gl000206_random LN:41001 AS:hg19_random.nix
52 @SQ SN:chr18_gl000207_random LN:4262 AS:hg19_random.nix
53 @SQ SN:chr19_gl000208_random LN:92689 AS:hg19_random.nix
54 @SQ SN:chr19_gl000209_random LN:159169 AS:hg19_random.nix
55 @SQ SN:chr21_gl000210_random LN:27682 AS:hg19_random.nix
56 @SQ SN:chrUn_gl000211 LN:166566 AS:hg19_random.nix
57 @SQ SN:chrUn_gl000212 LN:186858 AS:hg19_random.nix
58 @SQ SN:chrUn_gl000213 LN:164239 AS:hg19_random.nix
59 @SQ SN:chrUn_gl000214 LN:137718 AS:hg19_random.nix
60 @SQ SN:chrUn_gl000215 LN:172545 AS:hg19_random.nix
61 @SQ SN:chrUn_gl000216 LN:172294 AS:hg19_random.nix
62 @SQ SN:chrUn_gl000217 LN:172149 AS:hg19_random.nix
63 @SQ SN:chrUn_gl000218 LN:161147 AS:hg19_random.nix
64 @SQ SN:chrUn_gl000219 LN:179198 AS:hg19_random.nix
65 @SQ SN:chrUn_gl000220 LN:161802 AS:hg19_random.nix
66 @SQ SN:chrUn_gl000221 LN:155397 AS:hg19_random.nix
67 @SQ SN:chrUn_gl000222 LN:186861 AS:hg19_random.nix
68 @SQ SN:chrUn_gl000223 LN:180455 AS:hg19_random.nix
69 @SQ SN:chrUn_gl000224 LN:179693 AS:hg19_random.nix
70 @SQ SN:chrUn_gl000225 LN:211173 AS:hg19_random.nix
71 @SQ SN:chrUn_gl000226 LN:15008 AS:hg19_random.nix
72 @SQ SN:chrUn_gl000227 LN:128374 AS:hg19_random.nix
73 @SQ SN:chrUn_gl000228 LN:129120 AS:hg19_random.nix
74 @SQ SN:chrUn_gl000229 LN:19913 AS:hg19_random.nix
75 @SQ SN:chrUn_gl000230 LN:43691 AS:hg19_random.nix
76 @SQ SN:chrUn_gl000231 LN:27386 AS:hg19_random.nix
77 @SQ SN:chrUn_gl000232 LN:40652 AS:hg19_random.nix
78 @SQ SN:chrUn_gl000233 LN:45941 AS:hg19_random.nix
79 @SQ SN:chrUn_gl000234 LN:40531 AS:hg19_random.nix
80 @SQ SN:chrUn_gl000235 LN:34474 AS:hg19_random.nix
81 @SQ SN:chrUn_gl000236 LN:41934 AS:hg19_random.nix
82 @SQ SN:chrUn_gl000237 LN:45867 AS:hg19_random.nix
83 @SQ SN:chrUn_gl000238 LN:39939 AS:hg19_random.nix
84 @SQ SN:chrUn_gl000239 LN:33824 AS:hg19_random.nix
85 @SQ SN:chrUn_gl000240 LN:41933 AS:hg19_random.nix
86 @SQ SN:chrUn_gl000241 LN:42152 AS:hg19_random.nix
87 @SQ SN:chrUn_gl000242 LN:43523 AS:hg19_random.nix
88 @SQ SN:chrUn_gl000243 LN:43341 AS:hg19_random.nix
89 @SQ SN:chrUn_gl000244 LN:39929 AS:hg19_random.nix
90 @SQ SN:chrUn_gl000245 LN:36651 AS:hg19_random.nix
91 @SQ SN:chrUn_gl000246 LN:38154 AS:hg19_random.nix
92 @SQ SN:chrUn_gl000247 LN:36422 AS:hg19_random.nix
93 @SQ SN:chrUn_gl000248 LN:39786 AS:hg19_random.nix
94 @SQ SN:chrUn_gl000249 LN:38502 AS:hg19_random.nix
95 @RG ID:06d50d01-3235-4301-9130-f4fb64a2bcba PL:illumina PU:71b5428a-1a61-4401-9acf-83ad923e2ff0 LB:bamsurgeon SM:synthetic.challenge.set5.tumour CN:BS
96 @RG ID:0df07e94-8a25-4bc1-946b-dedff511a245 PL:illumina PU:50a5fd6e-5b0e-42c9-aa3a-d5ab637e0319 LB:bamsurgeon SM:synthetic.challenge.set5.tumour CN:BS
97 @RG ID:16254985-5db9-41b2-a3c9-916d364d609c PL:illumina PU:147210f1-db1a-449e-bea1-dfa8ce711273 LB:bamsurgeon SM:synthetic.challenge.set5.tumour CN:BS
98 @RG ID:eb68d7e0-08db-4d13-bc0a-972a970a9fe4 PL:illumina PU:c6a37027-dbcb-4207-b512-85727608725e LB:bamsurgeon SM:synthetic.challenge.set5.tumour CN:BS
99 @PG ID:bamsurgeon PN:bamsurgeon
100 @PG ID:GATK IndelRealigner VN:nightly-2014-04-27-g64280d1 CL:knownAlleles=[(RodBinding name=knownAlleles source=/lustre/users/taewing/gatk_bundle/ucsc/Mills_and_1000G_gold_standard.indels.ucsc.vcf), (RodBinding name=knownAlleles2 source=/lustre/users/taewing/gatk_bundle/ucsc/1000G_phase1.indels.ucsc.vcf)] targetIntervals=/lustre/users/taewing/bams/dream/IS5/tumour.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
101 7cd4cde4-9441-4a47-9cbf-dd6792807cbe 163 chr1 10002 106 7M1D63M31S = 10064 125 AACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCATAACCACACCCCTAACCCAACCCCAACACC C@CFFFFFHHHBHIGBGHIGJGIG@EGEEGCHIDCH<?B@DBBFG=BEGIAD>).=E?>AEEA;>@@@?A############################### MC:Z:38S63M MD:Z:7^A63 RG:Z:eb68d7e0-08db-4d13-bc0a-972a970a9fe4 NM:i:1 MQ:i:106 AS:i:232
102 4d70b67b-ee38-4546-9deb-534c0146c009 99 chr1 10011 52 101M = 10032 122 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAC @@CFFFFDHGDHHIJJJJEEBFGIGCFHIIIGGGIHHAFHIIII@G@FGEHG)=FGDGEHJHDCEEEDFBCAACDC;ABDDCCABB?CCBDB<CCB<?BA@ MC:Z:101M MD:Z:52A48 RG:Z:0df07e94-8a25-4bc1-946b-dedff511a245 NM:i:1 MQ:i:52 AS:i:13
103 c0ecdaa9-ee5e-4851-adca-f3e0d3bc727e 163 chr1 10023 150 5M1I21M1I69M4S = 10066 142 CCTAAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACGCTAACGCTAACGCTAACGCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTA @B<DDFFAB?HDBGAFHIJGH>F?CCBEGHJ@F?GHGIJIDII9?:CFH;;;@CECEB1(9>=@==3>95?9<?<<??CBB(2???B88AB95?CCA#### MC:Z:2S99M MD:Z:42C5C5C5C34 RG:Z:16254985-5db9-41b2-a3c9-916d364d609c NM:i:6 MQ:i:150 AS:i:236
104 6b130ee3-8022-4c44-b559-78174a23651e 163 chr1 10026 40 2S99M = 10030 105 CCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACC @C@FDFFFHGHHHJIGIJFIIJJJJJIGIIIHIIGIJIGIHGGEHGHIIGGIJIJIIGEEHH;CDBEF9@CCDD=C@BDBDCDB?BAB@BA9CB2?AC<BB MC:Z:101M MD:Z:99 RG:Z:eb68d7e0-08db-4d13-bc0a-972a970a9fe4 NM:i:0 MQ:i:40 AS:i:30
105 33666c35-6a44-446e-ae26-2c9c42754a3c 99 chr1 10030 37 31M1I69M = 10036 106 CTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAAC CCCFFFFFHGHHHGIHIGIIIJJJIGIIJJJJIJJJJJAFHHGIDFIJJJDHHIJIHGEHHGFFFFFFEECCDDA=AADB??BBDBDCDDDBCCBB<?CDB MC:Z:25M1I75M MD:Z:100 RG:Z:06d50d01-3235-4301-9130-f4fb64a2bcba NM:i:1 MQ:i:37 AS:i:45
106 6b130ee3-8022-4c44-b559-78174a23651e 83 chr1 10030 40 101M = 10026 -105 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACC A?A>DDDAA9DDDAA9DDA??>DCABA>DDBBA>DEA?@7FHE?C@JIHDC=JIGCC=GGGDD9IHG?B:IIGFF?IIGFDGIHEHFIFFCD=FDEDB=@? MC:Z:2S99M MD:Z:101 RG:Z:eb68d7e0-08db-4d13-bc0a-972a970a9fe4 NM:i:0 MQ:i:40 AS:i:1
107 05d25dc8-788a-4651-bb72-2437e9f3819f 163 chr1 10032 30 1S14M2I21M1D54M1D9M = 10070 107 AAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTACCCTAACCC B@@FFFBEHHHHHI>FGIGHCHEHHG@GGIIGEEDGGGF?HEH>FHIIIHFHGGGIC;?>AEEFED=6@>BBA?CCBD9858=88??<?A<3<A84?<2<8 MC:Z:32S69M MD:Z:35^T54^A9 RG:Z:0df07e94-8a25-4bc1-946b-dedff511a245 NM:i:4 MQ:i:30 AS:i:171
Thank you for your reply. The information is the small region of the bam of the dream_challenge set5
@bingdiao-zbw thanks for sharing this. First of all, this is a different dataset compared to what we use in the paper. This one is Set5, but in the paper we used Dream challenge dataset for stages 3 and 4. But, regardless of this point, I don't expect low recall for this dataset. What I am suspecting now is that the aligner used for this dataset is not "BWA-MEM" (maybe "NovoAlign" or "BWA"). That can cause problem since we use some features that depend on the aligner. If that is the case you need to convert the bam to fastq and align again using BWA-MEM.
Looking to mapping quality it seems to be that this is not aligned with BWA-MEM. And if you have downloaded the data from Dream Challenge website, I can see that they have used Novoalign v3.02.05 for stage 5 data (https://www.synapse.org/#!Synapse:syn312572/wiki/62018).
So, you need to realign with BWA-MEM
thank you very much. I will try according to your suggestion