neusomatic icon indicating copy to clipboard operation
neusomatic copied to clipboard

testing pretrined model

Open bingdiao-zbw opened this issue 4 years ago • 7 comments

Hello, I tested my data and dream_challenge data with all the pre-trained model than you provided. The test results show that the recall rate is very low, less than 50%, I do not know the reason. can you provide me with the test data to verify the model, thank you

bingdiao-zbw avatar Jun 09 '20 01:06 bingdiao-zbw

@bingdiao-zbw The dream_challenge data should give you similar performance as what we presented in the paper. It may be the case that you are using wrong alignments, or parameters. Would you please explain the details of your dream_challenge test case, the alignment, and the preprocess/call parameters you use?

msahraeian avatar Jun 09 '20 04:06 msahraeian

Thank you for your reply. The parameters are the same as mentioned in the run_test.sh. I found another problem when I run the preprocess.py. a lot of warnings appeared in the file scan.err. #Aligned read number: 146 Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag!

After running the call.py, The recall rate is less than 50% but the accuracy rate can reach more than 95%.

bingdiao-zbw avatar Jun 09 '20 05:06 bingdiao-zbw

@bingdiao-zbw In the paper, we used the following parameters for Dream data: --scan_maf 0.01 --min_mapq 10 --snp_min_af 0.03 --snp_min_bq 20 --snp_min_ao 3 --ins_min_af 0.02 --del_min_af 0.02. As you may see in the paper, we get 96% F1 score for SNVs. I suspect that your input bam file has some discrepancies. Can you share with me small region of the bam, along with the header?

msahraeian avatar Jun 15 '20 19:06 msahraeian

  1 @HD     VN:1.4  GO:none SO:coordinate
  2 @SQ     SN:chr1 LN:249250621    AS:hg19_random.nix
  3 @SQ     SN:chr2 LN:243199373    AS:hg19_random.nix
  4 @SQ     SN:chr3 LN:198022430    AS:hg19_random.nix
  5 @SQ     SN:chr4 LN:191154276    AS:hg19_random.nix
  6 @SQ     SN:chr5 LN:180915260    AS:hg19_random.nix
  7 @SQ     SN:chr6 LN:171115067    AS:hg19_random.nix
  8 @SQ     SN:chr7 LN:159138663    AS:hg19_random.nix
  9 @SQ     SN:chr8 LN:146364022    AS:hg19_random.nix
 10 @SQ     SN:chr9 LN:141213431    AS:hg19_random.nix
 11 @SQ     SN:chr10        LN:135534747    AS:hg19_random.nix
 12 @SQ     SN:chr11        LN:135006516    AS:hg19_random.nix
 13 @SQ     SN:chr12        LN:133851895    AS:hg19_random.nix
 14 @SQ     SN:chr13        LN:115169878    AS:hg19_random.nix
 15 @SQ     SN:chr14        LN:107349540    AS:hg19_random.nix
 16 @SQ     SN:chr15        LN:102531392    AS:hg19_random.nix
 17 @SQ     SN:chr16        LN:90354753     AS:hg19_random.nix
 18 @SQ     SN:chr17        LN:81195210     AS:hg19_random.nix
 19 @SQ     SN:chr18        LN:78077248     AS:hg19_random.nix
 20 @SQ     SN:chr19        LN:59128983     AS:hg19_random.nix
 21 @SQ     SN:chr20        LN:63025520     AS:hg19_random.nix
 22 @SQ     SN:chr21        LN:48129895     AS:hg19_random.nix
 23 @SQ     SN:chr22        LN:51304566     AS:hg19_random.nix
 24 @SQ     SN:chrX LN:155270560    AS:hg19_random.nix
 25 @SQ     SN:chrY LN:59373566     AS:hg19_random.nix
 26 @SQ     SN:chrM LN:16571        AS:hg19_random.nix
 27 @SQ     SN:chr1_gl000191_random LN:106433       AS:hg19_random.nix
 28 @SQ     SN:chr1_gl000192_random LN:547496       AS:hg19_random.nix
 29 @SQ     SN:chr4_ctg9_hap1       LN:590426       AS:hg19_random.nix
 30 @SQ     SN:chr4_gl000193_random LN:189789       AS:hg19_random.nix
 31 @SQ     SN:chr4_gl000194_random LN:191469       AS:hg19_random.nix
 32 @SQ     SN:chr6_apd_hap1        LN:4622290      AS:hg19_random.nix
 33 @SQ     SN:chr6_cox_hap2        LN:4795371      AS:hg19_random.nix
 34 @SQ     SN:chr6_dbb_hap3        LN:4610396      AS:hg19_random.nix
 35 @SQ     SN:chr6_mann_hap4       LN:4683263      AS:hg19_random.nix
 36 @SQ     SN:chr6_mcf_hap5        LN:4833398      AS:hg19_random.nix
 37 @SQ     SN:chr6_qbl_hap6        LN:4611984      AS:hg19_random.nix
 38 @SQ     SN:chr6_ssto_hap7       LN:4928567      AS:hg19_random.nix
 39 @SQ     SN:chr7_gl000195_random LN:182896       AS:hg19_random.nix
 40 @SQ     SN:chr8_gl000196_random LN:38914        AS:hg19_random.nix
 41 @SQ     SN:chr8_gl000197_random LN:37175        AS:hg19_random.nix
 42 @SQ     SN:chr9_gl000198_random LN:90085        AS:hg19_random.nix
 43 @SQ     SN:chr9_gl000199_random LN:169874       AS:hg19_random.nix
 44 @SQ     SN:chr9_gl000200_random LN:187035       AS:hg19_random.nix
 45 @SQ     SN:chr9_gl000201_random LN:36148        AS:hg19_random.nix
 46 @SQ     SN:chr11_gl000202_random        LN:40103        AS:hg19_random.nix
 47 @SQ     SN:chr17_ctg5_hap1      LN:1680828      AS:hg19_random.nix
 48 @SQ     SN:chr17_gl000203_random        LN:37498        AS:hg19_random.nix
 49 @SQ     SN:chr17_gl000204_random        LN:81310        AS:hg19_random.nix
 50 @SQ     SN:chr17_gl000205_random        LN:174588       AS:hg19_random.nix
 51 @SQ     SN:chr17_gl000206_random        LN:41001        AS:hg19_random.nix
 52 @SQ     SN:chr18_gl000207_random        LN:4262 AS:hg19_random.nix
 53 @SQ     SN:chr19_gl000208_random        LN:92689        AS:hg19_random.nix
 54 @SQ     SN:chr19_gl000209_random        LN:159169       AS:hg19_random.nix
 55 @SQ     SN:chr21_gl000210_random        LN:27682        AS:hg19_random.nix
 56 @SQ     SN:chrUn_gl000211       LN:166566       AS:hg19_random.nix
 57 @SQ     SN:chrUn_gl000212       LN:186858       AS:hg19_random.nix
 58 @SQ     SN:chrUn_gl000213       LN:164239       AS:hg19_random.nix
 59 @SQ     SN:chrUn_gl000214       LN:137718       AS:hg19_random.nix
 60 @SQ     SN:chrUn_gl000215       LN:172545       AS:hg19_random.nix
 61 @SQ     SN:chrUn_gl000216       LN:172294       AS:hg19_random.nix
 62 @SQ     SN:chrUn_gl000217       LN:172149       AS:hg19_random.nix
 63 @SQ     SN:chrUn_gl000218       LN:161147       AS:hg19_random.nix
 64 @SQ     SN:chrUn_gl000219       LN:179198       AS:hg19_random.nix
 65 @SQ     SN:chrUn_gl000220       LN:161802       AS:hg19_random.nix
 66 @SQ     SN:chrUn_gl000221       LN:155397       AS:hg19_random.nix
 67 @SQ     SN:chrUn_gl000222       LN:186861       AS:hg19_random.nix
 68 @SQ     SN:chrUn_gl000223       LN:180455       AS:hg19_random.nix
 69 @SQ     SN:chrUn_gl000224       LN:179693       AS:hg19_random.nix
 70 @SQ     SN:chrUn_gl000225       LN:211173       AS:hg19_random.nix
 71 @SQ     SN:chrUn_gl000226       LN:15008        AS:hg19_random.nix
 72 @SQ     SN:chrUn_gl000227       LN:128374       AS:hg19_random.nix
 73 @SQ     SN:chrUn_gl000228       LN:129120       AS:hg19_random.nix
 74 @SQ     SN:chrUn_gl000229       LN:19913        AS:hg19_random.nix
 75 @SQ     SN:chrUn_gl000230       LN:43691        AS:hg19_random.nix
 76 @SQ     SN:chrUn_gl000231       LN:27386        AS:hg19_random.nix
 77 @SQ     SN:chrUn_gl000232       LN:40652        AS:hg19_random.nix
 78 @SQ     SN:chrUn_gl000233       LN:45941        AS:hg19_random.nix
 79 @SQ     SN:chrUn_gl000234       LN:40531        AS:hg19_random.nix
 80 @SQ     SN:chrUn_gl000235       LN:34474        AS:hg19_random.nix
 81 @SQ     SN:chrUn_gl000236       LN:41934        AS:hg19_random.nix
 82 @SQ     SN:chrUn_gl000237       LN:45867        AS:hg19_random.nix
 83 @SQ     SN:chrUn_gl000238       LN:39939        AS:hg19_random.nix
 84 @SQ     SN:chrUn_gl000239       LN:33824        AS:hg19_random.nix
 85 @SQ     SN:chrUn_gl000240       LN:41933        AS:hg19_random.nix
 86 @SQ     SN:chrUn_gl000241       LN:42152        AS:hg19_random.nix
 87 @SQ     SN:chrUn_gl000242       LN:43523        AS:hg19_random.nix
 88 @SQ     SN:chrUn_gl000243       LN:43341        AS:hg19_random.nix
 89 @SQ     SN:chrUn_gl000244       LN:39929        AS:hg19_random.nix
 90 @SQ     SN:chrUn_gl000245       LN:36651        AS:hg19_random.nix
 91 @SQ     SN:chrUn_gl000246       LN:38154        AS:hg19_random.nix
 92 @SQ     SN:chrUn_gl000247       LN:36422        AS:hg19_random.nix
 93 @SQ     SN:chrUn_gl000248       LN:39786        AS:hg19_random.nix
 94 @SQ     SN:chrUn_gl000249       LN:38502        AS:hg19_random.nix
 95 @RG     ID:06d50d01-3235-4301-9130-f4fb64a2bcba PL:illumina     PU:71b5428a-1a61-4401-9acf-83ad923e2ff0 LB:bamsurgeon   SM:synthetic.challenge.set5.tumour      CN:BS
 96 @RG     ID:0df07e94-8a25-4bc1-946b-dedff511a245 PL:illumina     PU:50a5fd6e-5b0e-42c9-aa3a-d5ab637e0319 LB:bamsurgeon   SM:synthetic.challenge.set5.tumour      CN:BS
 97 @RG     ID:16254985-5db9-41b2-a3c9-916d364d609c PL:illumina     PU:147210f1-db1a-449e-bea1-dfa8ce711273 LB:bamsurgeon   SM:synthetic.challenge.set5.tumour      CN:BS
 98 @RG     ID:eb68d7e0-08db-4d13-bc0a-972a970a9fe4 PL:illumina     PU:c6a37027-dbcb-4207-b512-85727608725e LB:bamsurgeon   SM:synthetic.challenge.set5.tumour      CN:BS
 99 @PG     ID:bamsurgeon   PN:bamsurgeon
100 @PG     ID:GATK IndelRealigner  VN:nightly-2014-04-27-g64280d1  CL:knownAlleles=[(RodBinding name=knownAlleles source=/lustre/users/taewing/gatk_bundle/ucsc/Mills_and_1000G_gold_standard.indels.ucsc.vcf), (RodBinding name=knownAlleles2 source=/lustre/users/taewing/gatk_bundle/ucsc/1000G_phase1.indels.ucsc.vcf)] targetIntervals=/lustre/users/taewing/bams/dream/IS5/tumour.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
101 7cd4cde4-9441-4a47-9cbf-dd6792807cbe    163     chr1    10002   106     7M1D63M31S      =       10064   125     AACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCATAACCACACCCCTAACCCAACCCCAACACC   C@CFFFFFHHHBHIGBGHIGJGIG@EGEEGCHIDCH<?B@DBBFG=BEGIAD>).=E?>AEEA;>@@@?A###############################   MC:Z:38S63M     MD:Z:7^A63      RG:Z:eb68d7e0-08db-4d13-bc0a-972a970a9fe4       NM:i:1  MQ:i:106        AS:i:232
102 4d70b67b-ee38-4546-9deb-534c0146c009    99      chr1    10011   52      101M    =       10032   122     CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAC   @@CFFFFDHGDHHIJJJJEEBFGIGCFHIIIGGGIHHAFHIIII@G@FGEHG)=FGDGEHJHDCEEEDFBCAACDC;ABDDCCABB?CCBDB<CCB<?BA@   MC:Z:101M       MD:Z:52A48      RG:Z:0df07e94-8a25-4bc1-946b-dedff511a245       NM:i:1  MQ:i:52 AS:i:13
103 c0ecdaa9-ee5e-4851-adca-f3e0d3bc727e    163     chr1    10023   150     5M1I21M1I69M4S  =       10066   142     CCTAAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACGCTAACGCTAACGCTAACGCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTA   @B<DDFFAB?HDBGAFHIJGH>F?CCBEGHJ@F?GHGIJIDII9?:CFH;;;@CECEB1(9>=@==3>95?9<?<<??CBB(2???B88AB95?CCA####   MC:Z:2S99M      MD:Z:42C5C5C5C34        RG:Z:16254985-5db9-41b2-a3c9-916d364d609c       NM:i:6  MQ:i:150        AS:i:236
104 6b130ee3-8022-4c44-b559-78174a23651e    163     chr1    10026   40      2S99M   =       10030   105     CCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACC   @C@FDFFFHGHHHJIGIJFIIJJJJJIGIIIHIIGIJIGIHGGEHGHIIGGIJIJIIGEEHH;CDBEF9@CCDD=C@BDBDCDB?BAB@BA9CB2?AC<BB   MC:Z:101M       MD:Z:99 RG:Z:eb68d7e0-08db-4d13-bc0a-972a970a9fe4       NM:i:0  MQ:i:40 AS:i:30
105 33666c35-6a44-446e-ae26-2c9c42754a3c    99      chr1    10030   37      31M1I69M        =       10036   106     CTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAAC   CCCFFFFFHGHHHGIHIGIIIJJJIGIIJJJJIJJJJJAFHHGIDFIJJJDHHIJIHGEHHGFFFFFFEECCDDA=AADB??BBDBDCDDDBCCBB<?CDB   MC:Z:25M1I75M   MD:Z:100        RG:Z:06d50d01-3235-4301-9130-f4fb64a2bcba       NM:i:1  MQ:i:37 AS:i:45
106 6b130ee3-8022-4c44-b559-78174a23651e    83      chr1    10030   40      101M    =       10026   -105    CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACC   A?A>DDDAA9DDDAA9DDA??>DCABA>DDBBA>DEA?@7FHE?C@JIHDC=JIGCC=GGGDD9IHG?B:IIGFF?IIGFDGIHEHFIFFCD=FDEDB=@?   MC:Z:2S99M      MD:Z:101        RG:Z:eb68d7e0-08db-4d13-bc0a-972a970a9fe4       NM:i:0  MQ:i:40 AS:i:1
107 05d25dc8-788a-4651-bb72-2437e9f3819f    163     chr1    10032   30      1S14M2I21M1D54M1D9M     =       10070   107     AAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTACCCTAACCC   B@@FFFBEHHHHHI>FGIGHCHEHHG@GGIIGEEDGGGF?HEH>FHIIIHFHGGGIC;?>AEEFED=6@>BBA?CCBD9858=88??<?A<3<A84?<2<8   MC:Z:32S69M     MD:Z:35^T54^A9  RG:Z:0df07e94-8a25-4bc1-946b-dedff511a245       NM:i:4  MQ:i:30 AS:i:171

bingdiao-zbw avatar Jun 16 '20 06:06 bingdiao-zbw

Thank you for your reply. The information is the small region of the bam of the dream_challenge set5

bingdiao-zbw avatar Jun 16 '20 06:06 bingdiao-zbw

@bingdiao-zbw thanks for sharing this. First of all, this is a different dataset compared to what we use in the paper. This one is Set5, but in the paper we used Dream challenge dataset for stages 3 and 4. But, regardless of this point, I don't expect low recall for this dataset. What I am suspecting now is that the aligner used for this dataset is not "BWA-MEM" (maybe "NovoAlign" or "BWA"). That can cause problem since we use some features that depend on the aligner. If that is the case you need to convert the bam to fastq and align again using BWA-MEM.

Looking to mapping quality it seems to be that this is not aligned with BWA-MEM. And if you have downloaded the data from Dream Challenge website, I can see that they have used Novoalign v3.02.05 for stage 5 data (https://www.synapse.org/#!Synapse:syn312572/wiki/62018).

So, you need to realign with BWA-MEM

msahraeian avatar Jun 18 '20 23:06 msahraeian

thank you very much. I will try according to your suggestion

bingdiao-zbw avatar Jun 19 '20 00:06 bingdiao-zbw