neusomatic icon indicating copy to clipboard operation
neusomatic copied to clipboard

failed at the last step

Open cbiitbian opened this issue 4 years ago • 23 comments

example sample ran well, but when run with a pair of real data set, neusomatic failed at the last step with error:

NeuSomatic stand-alone FAILED: Files ../example/work_standalone/NeuSomatic_standalone.vcf and ../NeuSomatic_standalone.vcf Are Different!

NeuSomatic ensemble FAILED: Files ../example/work_ensemble/NeuSomatic_ensemble.vcf and ../NeuSomatic_ensemble.vcf Are Different!

Please advise what could be wrong

cbiitbian avatar Nov 24 '19 04:11 cbiitbian

I see what is wrong here. The last step of comparison is for the testing sample to make sure testing process run successfully. Can I substitute the input files with my sample sets including normal bam, tumor bam, bed and reference genome and consider the result vcf at /example/work_standalone/NeuSomatic_standalone.vcf and /example/work_ensemble/NeuSomatic_ensemble.vcf are the variant results for my data?

cbiitbian avatar Nov 24 '19 05:11 cbiitbian

@cbiitbian happy to see your interest in NeuSomatic. As you pointed out the test script will only succeed if it is used on test data. For your real application, you should follow the instructions in README. The inputs/models/parameters to use may need modification depending on your data. Particularly, 1- I would recommend to use newer trained models (NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth for standalone and NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth for ensemble), more information here. Depending how different your data is from our training set you may be interested to train the network too. 2- For ensemble mode, you need to have somatic calls from five other tools (MuTect2, VarDict, Strelka2, SomaticSniper, and MuSE) for your samples, you can find instruction for that here 3- The reference/regions/minimum AFs can also be adjusted as explained https://github.com/bioinform/neusomatic#example-usage. As discussed there, for inference, you need to follow three steps (preprocess/call/postprocess).

Please let me know if you need more information

msahraeian avatar Nov 24 '19 08:11 msahraeian

Hi, Msahraeian Thanks for the quick response. I followed your instruction and successfully rerun a standalone test of chr1 of my sample data, but I have a couple of questions I’d like to get your help.

  1. I basically use the same code as the testing script, but substitute the ref genome, bed, tumor bam and normal bam, I use the NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth as pretrained model and ran the three steps preprocess/call/postprocess, is this the right way?
  2. Is “work_standalone/NeuSomatic_standalone.vcf” the final variant call result? The reason I am asking is because the number of variant (270) is almost triple that in the grand truth (95), which represent a lot of false positive even before a comparison. Maybe the parameters need to adjust or the pretrained model is not a good match?
  3. I saw in the paper, you use cll sample and got very good results. I al so have the sample and grand truth, I’d like to follow your protocol to do it, I also have the DREAM sets mentioned in your paper. Would you let me know the parameters and trained model you use so that I can re-run?
  4. If it is necessary, I’d like to learn bamsuegeon and produce my own trained model, so any information about how you did spike in would also be greatly appreciated.

Thanks very much! Xiaopeng Bian CBIIT/NCI

From: msahraeian [email protected] Sent: Sunday, November 24, 2019 3:17 AM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian happy to see your interest in NeuSomatic. As you pointed out the test script will only succeed if it is used on test data. For your real application, you should follow the instructions in README. The inputs/models/parameters to use may need modification depending on your data. Particularly, 1- I would recommend to use newer trained models (NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth for standalone and NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth for ensemble), more information herehttps://github.com/bioinform/neusomatic#trained-network-models. Depending how different your data is from our training set you may be interested to train the network too. 2- For ensemble mode, you need to have somatic calls from five other tools (MuTect2, VarDict, Strelka2, SomaticSniper, and MuSE) for your samples, you can find instruction for that herehttps://github.com/bioinform/neusomatic#ensemble-mode 3- The reference/regions/minimum AFs can also be adjusted as explained https://github.com/bioinform/neusomatic#example-usage. As discussed there, for inference, you need to follow three steps (preprocess/call/postprocess).

Please let me know if you need more information

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTN4QGZ4TQKYUSKNS6DQVIZ7ZA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFAGCNA#issuecomment-557867316, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTOTSR7HFE3VTUF4NP3QVIZ7ZANCNFSM4JQ5LG4A.

cbiitbian avatar Nov 26 '19 04:11 cbiitbian

@cbiitbian

  1. Yes, it should be the right steps to go. You may also set the number of threads based on your resources. Another important note is that on the test script I used min AF of 5%. If it is too high for your sample, you can modify the parameters. For our new SEQC-II robustness analysis paper, we use the following setting: --scan_maf 0.01 --min_mapq 10 --snp_min_af 0.03 --snp_min_bq 15 --snp_min_ao 3 --ins_min_af 0.02 --del_min_af 0.02.
  2. Each variant call is assigned a quality score in the VCF. Based on this score we categorized the calls to three classes PASS (score>=0.7), LowQUAL (0.4<=score<0.7), and REJECT (score<0.4). This is the default limits we use. You can modify these limits in the postprocessing step using input parameters. You should then use the PASS calls for analysis.
  3. For real samples in NatureComm paper, we used DREAM challenge model NeuSomatic_v0.1.3_standalone_Dream3.pth to infer. There, we used 0.97 pass threshold for CLL1 and COLO-829 samples. For the DREAM challenge stage 3 we also used DREAM challenge model to infer with 0.7 pass threshold. For the SEQC-II robustness analysis paper, we used the default 0.7 pass threshold for whole analysis. With the new SEQC trained models, 0.7 pass threshold should give you similar accuracy on CLL1 and COLO samples.
  4. You can find the information on how to use BAMSurgeon to spike in mutations here.

Also note that the best performance is usually achieved in the ensemble mode.

msahraeian avatar Nov 26 '19 05:11 msahraeian

Hi, Another question, I am running WES (Whole Exome Sequencing) sample, do you have trained model for that or I can use WGS model? I saw you have a older model for WEX, “NeuSomatic_v0.1.0_standalone_WEX_100purity.pth”, is that a WES model I can use? Thanks. Xiaopeng

From: msahraeian [email protected] Sent: Tuesday, November 26, 2019 12:51 AM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian

  1. Yes, it should be the right steps to go. You may also set the number of threads based on your resources. Another important note is that on the test script I used min AF of 5%. If it is too high for your sample, you can modify the parameters. For our new SEQC-II robustness analysis paperhttps://doi.org/10.1101/667261, we use the following setting: --scan_maf 0.01 --min_mapq 10 --snp_min_af 0.03 --snp_min_bq 15 --snp_min_ao 3 --ins_min_af 0.02 --del_min_af 0.02.
  2. Each variant call is assigned a quality score in the VCF. Based on this score we categorized the calls to three classes PASS (score>=0.7), LowQUAL (0.4<=score<0.7), and REJECT (score<0.4). This is the default limits we use. You can modify these limits in the postprocessing step using input parameters. You should then use the PASS calls for analysis.
  3. For real samples in NatureComm paperhttps://doi.org/10.1038/s41467-019-09027-x, we used DREAM challenge model NeuSomatic_v0.1.3_standalone_Dream3.pth to infer. There, we used 0.97 pass threshold for CLL1 and COLO-829 samples. For the DREAM challenge stage 3 we also used DREAM challenge model to infer with 0.7 pass threshold. For the SEQC-II robustness analysis paperhttps://doi.org/10.1101/667261, we used the default 0.7 pass threshold for whole analysis. With the new SEQC trained modelshttps://github.com/bioinform/neusomatic#trained-network-models, 0.7 pass threshold should give you similar accuracy on CLL1 and COLO samples.
  4. You can find the information on how to use BAMSurgeon to spike in mutations herehttps://github.com/bioinform/neusomatic#creating-training-data.

Also note that the best performance is usually achieved in the ensemble mode.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTLIZH4U3IUGHOKONE3QVS2N7A5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFEZVWI#issuecomment-558471897, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTKKEVTCRAFPKO45QODQVS2N7ANCNFSM4JQ5LG4A.

cbiitbian avatar Nov 28 '19 06:11 cbiitbian

@cbiitbian The NeuSomatic_v0.1.0_standalone_WEX_100purity.pth is the model we used in original paper for WES. We have also applied our WGS trained model NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth on WES data in the SEQC paper and it seems to perform good.

msahraeian avatar Nov 28 '19 23:11 msahraeian

Thanks, I am trying both model based on your information. The process terminated due to error. I could not figure out how to fix it. I am attaching the error log for the run and a log file suggested by the process log: Please check error log at work_standalone/work_tumor/work.212/scan.err Could you please have a look and give me some advice on how to fix it? Thanks very much. Xiaopeng Bian

From: msahraeian [email protected] Sent: Thursday, November 28, 2019 6:53 PM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian The NeuSomatic_v0.1.0_standalone_WEX_100purity.pth is the model we used in original paper for WES. We have also applied our WGS trained model NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth on WES data in the SEQC paper and it seems to perform good.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTMRYANQS732LXTTSX3QWBKVTA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNSFPI#issuecomment-559620797, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTLQEZ6567BDYY3NX43QWBKVTANCNFSM4JQ5LG4A.

cbiitbian avatar Dec 01 '19 17:12 cbiitbian

@cbiitbian I cannot see the attachment. Would you please attach the scan.err file again?

msahraeian avatar Dec 01 '19 21:12 msahraeian

Here is all that in the scan.err file: terminate called after throwing an instance of 'std::ios_base::failure[abi:cxx11]' what(): basic_ios::clear: iostream error

From: msahraeian [email protected] Sent: Sunday, December 1, 2019 4:28 PM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian I cannot see the attachment. Would you please attach the scan.err file again?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTIHYO52AN3JA4WCBGDQWQT5TA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFRWFBY#issuecomment-560161415, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTMA7FSSDCEB7IEMP5DQWQT5TANCNFSM4JQ5LG4A.

cbiitbian avatar Dec 02 '19 03:12 cbiitbian

@cbiitbian it is not clear to me. Can you make sure you alignment .bam files have index files .bam.bai with them. Also please make sure the reference file used to align reads matches the one you use here. If it didn't help, can you restrict your bam to a smaller region and send it to me (if the code fails on the restricted bam).

msahraeian avatar Dec 02 '19 08:12 msahraeian

Hi, I am testing what you suggested and will let you know. In the mean time I got another error for another sample, would you please help me out. ERROR 2019-12-03 05:39:01,890 run_scan_alignments (ForkPoolWorker-14) Please check error log at work_standalone/work_tumor/work.148/scan.err Thanks. Xiaopeng

From: msahraeian [email protected] Sent: Monday, December 2, 2019 3:01 AM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian it is not clear to me. Can you make sure you alignment .bam files have index files .bam.bai with them. Also please make sure the reference file used to align reads matches the one you use here. If it didn't help, can you restrict your bam to a smaller region and send it to me (if the code fails on the restricted bam).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTLRQHTNWMMDM2IDY7DQWS6EXA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFSS2HQ#issuecomment-560278814, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTLSXIVW5YJIOGOXCJDQWS6EXANCNFSM4JQ5LG4A.

cbiitbian avatar Dec 03 '19 20:12 cbiitbian

Hi, I did the test you suggested, I used the problematic region bed to restricted the bam to a small region without change anything else, it worked well without any error. I’ll use the full bed file and try again, hope it is just a random error, but I do see this error in a lot of samples, though some samples ran successfully, they are all very similar samples. Thanks. Xiaopeng Bian

From: msahraeian [email protected] Sent: Monday, December 2, 2019 3:01 AM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian it is not clear to me. Can you make sure you alignment .bam files have index files .bam.bai with them. Also please make sure the reference file used to align reads matches the one you use here. If it didn't help, can you restrict your bam to a smaller region and send it to me (if the code fails on the restricted bam).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTLRQHTNWMMDM2IDY7DQWS6EXA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFSS2HQ#issuecomment-560278814, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTLSXIVW5YJIOGOXCJDQWS6EXANCNFSM4JQ5LG4A.

cbiitbian avatar Dec 03 '19 23:12 cbiitbian

@cbiitbian if it is a public sample, it would be good if you could share with me the whole bam file and the command line parameters you use and I will take a look.

msahraeian avatar Dec 03 '19 23:12 msahraeian

I am sorry this is not a public data set. We are comparing performance of 5 variant caller, including Neusomatic Most of the samples ran successfully, however, the results have no overlap with the results of other 4 callers which have pretty good overlap among them. I am worried that I may not have run the process correctly? I am attaching the scripts I used, which was basically modified from the example script included in the package. Would you have a look and give some suggestions? I am attaching both WGS and WEX models scripts. Thanks very much! Xiaopeng Bian

From: msahraeian [email protected] Sent: Tuesday, December 3, 2019 6:26 PM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian if it is a public sample, it would be good if you could share with me the whole bam file and the command line parameters you use and I will take a look.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTPKU6XCEMWGJ3ILNMLQW3TJJA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF3FN6I#issuecomment-561403641, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTJCDSSYDSHLF7CFLUTQW3TJJANCNFSM4JQ5LG4A.

cbiitbian avatar Dec 06 '19 04:12 cbiitbian

@cbiitbian I think there should be sth wrong with how you run NeuSomatic. I cannot see your attachments. Would you please send them again. You can attach them to the Github message.

msahraeian avatar Dec 07 '19 08:12 msahraeian

Hi, When I compared with results from two other centers, there are good overlaps, so I think there maybe something wrong with the data from the other center. The only thing is that Neusomatic called more variants than other callers. I am attaching my run scripts and a VCFcomparator somparison result for your reference. My scripts were named with .sh, which maybe blocked by your system last time, so I renamed it to .txt, hope will get through. Let me know if you still cannot receive them. Thanks. Xiaopeng

From: msahraeian [email protected] Sent: Saturday, December 7, 2019 3:19 AM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian I think there should be sth wrong with how you run NeuSomatic. I cannot see your attachments. Would you please send them again. You can attach them to the Github message.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTLJTYJWJZCJ6NHBI4DQXNL7ZA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGGBDCQ#issuecomment-562827658, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTONP6CHBAEM6ZLZUPTQXNL7ZANCNFSM4JQ5LG4A.

#!/bin/bash #module load neusomatic set -e tumor=$1 normal=$2 center=$3 refGenome=/data/nextgen/CIMAC/ref/hg38.fa bed=/data/nextgen/Xiaopeng/match/neusomatic/MATCH/region_hg38.bed tumor=/data/nextgen/Xiaopeng/match/neusomatic/MATCH/${center}/${tumor} normal=/data/nextgen/Xiaopeng/match/neusomatic/MATCH/${center}/${normal}

test_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

neusomatic_dir="$( dirname ${test_dir} )"

test_dir=.. neusomatic_dir=../..

cd ${test_dir}

tum=$(echo "${tumor}" | cut -f 1 -d '.') mkdir -p ${tum} cd ${tum}

rm -rf work_standalone #Stand-alone NeuSomatic test

python ${neusomatic_dir}/neusomatic/python/preprocess.py \

preprocess.py
--mode call
--reference ${refGenome}
--region_bed ${bed}
--tumor_bam ${tumor}
--normal_bam ${normal}
--work work_standalone
--scan_maf 0.01
--min_mapq 10
--snp_min_af 0.03
--snp_min_bq 15
--snp_min_ao 3
--ins_min_af 0.02
--del_min_af 0.02
--num_threads 20
--scan_alignments_binary ${NEUSOMATIC_SCAN_ALIGNMENTS} # --scan_alignments_binary ${neusomatic_dir}/neusomatic/bin/scan_alignments

CUDA_VISIBLE_DEVICES= python ${neusomatic_dir}/neusomatic/python/call.py \

call.py
--candidates_tsv work_standalone/dataset//candidates.tsv
--reference ${refGenome}
--out work_standalone
--checkpoint /data/nextgen/Xiaopeng/match/neusomatic/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth
--num_threads 1
--batch_size 100

python ${neusomatic_dir}/neusomatic/python/postprocess.py \

postprocess.py
--reference ${refGenome}
--tumor_bam ${tumor}
--pred_vcf work_standalone/pred.vcf
--candidates_vcf work_standalone/work_tumor/filtered_candidates.vcf
--pass_threshold 0.7
--lowqual_threshold 0.4
--output_vcf work_standalone/NeuSomatic_standalone.vcf
--work work_standalone

:' #comment start rm -rf work_ensemble #Ensemble NeuSomatic test

python ${neusomatic_dir}/neusomatic/python/preprocess.py \

preprocess.py
--mode call
--reference ${refGenome}
--region_bed ${bed}
--tumor_bam ${tumor}
--normal_bam ${normal}
--work work_ensemble
--scan_maf 0.05
--min_mapq 10
--snp_min_af 0.05
--snp_min_bq 20
--snp_min_ao 10
--ins_min_af 0.05
--del_min_af 0.05
--num_threads 1
--ensemble_tsv ${test_dir}/ensemble.tsv
--scan_alignments_binary ${NEUSOMATIC_SCAN_ALIGNMENTS}

CUDA_VISIBLE_DEVICES= python ${neusomatic_dir}/neusomatic/python/call.py \

call.py
--candidates_tsv work_ensemble/dataset//candidates.tsv
--reference ${refGenome}
--out work_ensemble
--checkpoint ${neusomatic_dir}/neusomatic/models/NeuSomatic_v0.1.0_ensemble_Dream3_70purity.pth
--num_threads 1
--ensemble
--batch_size 100

python ${neusomatic_dir}/neusomatic/python/postprocess.py \

postprocess.py
--reference ${refGenome}
--tumor_bam ${tumor}
--pred_vcf work_ensemble/pred.vcf
--candidates_vcf work_ensemble/work_tumor/filtered_candidates.vcf
--ensemble_tsv ${test_dir}/ensemble.tsv
--output_vcf work_ensemble/NeuSomatic_ensemble.vcf
--work work_ensemble

cd ..

file1=${test_dir}/example/work_standalone/NeuSomatic_standalone.vcf file2=${test_dir}/NeuSomatic_standalone.vcf

cmp --silent $file1 $file2 && echo "### NeuSomatic stand-alone: SUCCESS! ###"
|| echo "### NeuSomatic stand-alone FAILED: Files ${file1} and ${file2} Are Different! ###"

file1=${test_dir}/example/work_ensemble/NeuSomatic_ensemble.vcf file2=${test_dir}/NeuSomatic_ensemble.vcf

cmp --silent $file1 $file2 && echo "### NeuSomatic ensemble: SUCCESS! ###"
|| echo "### NeuSomatic ensemble FAILED: Files ${file1} and ${file2} Are Different! ###" #comment end ' #!/bin/bash #module load neusomatic set -e tumor=$1 normal=$2 center=$3 refGenome=/data/nextgen/CIMAC/ref/hg38.fa bed=/data/nextgen/Xiaopeng/match/neusomatic/MATCH/region_hg38.bed tumor=/data/nextgen/Xiaopeng/match/neusomatic/MATCH/${center}/${tumor} normal=/data/nextgen/Xiaopeng/match/neusomatic/MATCH/${center}/${normal}

test_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

neusomatic_dir="$( dirname ${test_dir} )"

test_dir=.. neusomatic_dir=../..

cd ${test_dir}

tum=$(echo "${tumor}" | cut -f 1 -d '.') mkdir -p ${tum}_wex cd ${tum}_wex

rm -rf work_standalone #Stand-alone NeuSomatic test

python ${neusomatic_dir}/neusomatic/python/preprocess.py \

preprocess.py
--mode call
--reference ${refGenome}
--region_bed ${bed}
--tumor_bam ${tumor}
--normal_bam ${normal}
--work work_standalone
--scan_maf 0.01
--min_mapq 10
--snp_min_af 0.03
--snp_min_bq 15
--snp_min_ao 3
--ins_min_af 0.02
--del_min_af 0.02
--num_threads 20
--scan_alignments_binary ${NEUSOMATIC_SCAN_ALIGNMENTS} # --scan_alignments_binary ${neusomatic_dir}/neusomatic/bin/scan_alignments

CUDA_VISIBLE_DEVICES= python ${neusomatic_dir}/neusomatic/python/call.py \

call.py
--candidates_tsv work_standalone/dataset//candidates.tsv
--reference ${refGenome}
--out work_standalone
--checkpoint /data/nextgen/Xiaopeng/match/neusomatic/neusomatic/models/NeuSomatic_v0.1.0_standalone_WEX_100purity.pth
--num_threads 1
--batch_size 100

python ${neusomatic_dir}/neusomatic/python/postprocess.py \

postprocess.py
--reference ${refGenome}
--tumor_bam ${tumor}
--pred_vcf work_standalone/pred.vcf
--candidates_vcf work_standalone/work_tumor/filtered_candidates.vcf
--pass_threshold 0.7
--lowqual_threshold 0.4
--output_vcf work_standalone/NeuSomatic_standalone.vcf
--work work_standalone

:' #comment start rm -rf work_ensemble #Ensemble NeuSomatic test

python ${neusomatic_dir}/neusomatic/python/preprocess.py \

preprocess.py
--mode call
--reference ${refGenome}
--region_bed ${bed}
--tumor_bam ${tumor}
--normal_bam ${normal}
--work work_ensemble
--scan_maf 0.05
--min_mapq 10
--snp_min_af 0.05
--snp_min_bq 20
--snp_min_ao 10
--ins_min_af 0.05
--del_min_af 0.05
--num_threads 1
--ensemble_tsv ${test_dir}/ensemble.tsv
--scan_alignments_binary ${NEUSOMATIC_SCAN_ALIGNMENTS}

CUDA_VISIBLE_DEVICES= python ${neusomatic_dir}/neusomatic/python/call.py \

call.py
--candidates_tsv work_ensemble/dataset//candidates.tsv
--reference ${refGenome}
--out work_ensemble
--checkpoint ${neusomatic_dir}/neusomatic/models/NeuSomatic_v0.1.0_ensemble_Dream3_70purity.pth
--num_threads 1
--ensemble
--batch_size 100

python ${neusomatic_dir}/neusomatic/python/postprocess.py \

postprocess.py
--reference ${refGenome}
--tumor_bam ${tumor}
--pred_vcf work_ensemble/pred.vcf
--candidates_vcf work_ensemble/work_tumor/filtered_candidates.vcf
--ensemble_tsv ${test_dir}/ensemble.tsv
--output_vcf work_ensemble/NeuSomatic_ensemble.vcf
--work work_ensemble

cd ..

file1=${test_dir}/example/work_standalone/NeuSomatic_standalone.vcf file2=${test_dir}/NeuSomatic_standalone.vcf

cmp --silent $file1 $file2 && echo "### NeuSomatic stand-alone: SUCCESS! ###"
|| echo "### NeuSomatic stand-alone FAILED: Files ${file1} and ${file2} Are Different! ###"

file1=${test_dir}/example/work_ensemble/NeuSomatic_ensemble.vcf file2=${test_dir}/NeuSomatic_ensemble.vcf

cmp --silent $file1 $file2 && echo "### NeuSomatic ensemble: SUCCESS! ###"
|| echo "### NeuSomatic ensemble FAILED: Files ${file1} and ${file2} Are Different! ###" #comment end ' VCF Comparator Settings:

R1-2-F.somatic.vcf Key vcf file region_hg38.bed Key interrogated regions file mda_R1-2-F_standalone_pass.vcf Test vcf file region_hg38.bed Test interrogated regions file mda_R1-2-F Save directory for parsed datasets TRUE Require matching alternate bases FALSE Require matching genotypes FALSE Use record VQSLOD score as ranking statistic FALSE Exclude non PASS or . records TRUE Compare SNPs, not non-SNP variants

3088286401 Interrogated bps in key 3088286401 Interrogated bps in test 3088286401 Interrogated bps in common 426 Key variants 426 Key variants in shared regions 0.537906137 Shared key variants Ti/Tv 2459 Test variants 2459 Test variants in shared regions 0.721988796 Shared test variants Ti/Tv

QUALThreshold NumMatchTest NumNonMatchTest FDR=nonMatchTest/(matchTest+nonMatchTest) decreasingFDR TPR=matchTest/totalKey FPR=nonMatchTest/totalKey PPV=matchTest/(matchTest+nonMatchTest) none 360 2099 0.853599 0.853599 0.8450704 4.92723 0.14640097 5.2394 360 2097 0.85347986 0.85347986 0.8450704 4.9225354 0.14652015 5.2485 360 2096 0.8534202 0.8534202 0.8450704 4.920188 0.1465798 5.2496 360 2095 0.8533605 0.8533605 0.8450704 4.9178405 0.14663951 5.2538 360 2094 0.85330075 0.85330075 0.8450704 4.915493 0.14669926 5.2569 360 2093 0.8532409 0.8532409 0.8450704 4.9131455 0.14675908 5.2675 360 2092 0.85318106 0.85318106 0.8450704 4.910798 0.14681892 5.2782 360 2091 0.85312116 0.85312116 0.8450704 4.9084506 0.14687882 5.2797 360 2090 0.8530612 0.8530612 0.8450704 4.906103 0.14693877 5.2822 360 2089 0.85300124 0.85300124 0.8450704 4.9037557 0.14699878 5.2983 360 2088 0.85294116 0.85294116 0.8450704 4.9014087 0.14705883 5.3032 360 2087 0.8528811 0.8528811 0.8450704 4.899061 0.14711893 5.3096 360 2086 0.85282093 0.85282093 0.8450704 4.8967137 0.14717907 5.3118 360 2085 0.85276073 0.85276073 0.8450704 4.8943663 0.14723927

cbiitbian avatar Dec 08 '19 04:12 cbiitbian

@cbiitbian are you using only PASS calls during evaluation? And if your sample is WEX you should intersect the final vcf with the exons bed.

msahraeian avatar Dec 09 '19 08:12 msahraeian

Hi, The Neusomatic output default to the same folder as the bam file folder. Can I redirect to another directory? Thanks. Xiaopeng

From: msahraeian [email protected] Sent: Monday, December 2, 2019 3:01 AM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian it is not clear to me. Can you make sure you alignment .bam files have index files .bam.bai with them. Also please make sure the reference file used to align reads matches the one you use here. If it didn't help, can you restrict your bam to a smaller region and send it to me (if the code fails on the restricted bam).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTLRQHTNWMMDM2IDY7DQWS6EXA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFSS2HQ#issuecomment-560278814, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTLSXIVW5YJIOGOXCJDQWS6EXANCNFSM4JQ5LG4A.

cbiitbian avatar Dec 20 '19 19:12 cbiitbian

@cbiitbian you don't need to output to the same directory. For each step there is an input parameter that specifies the output location. for preprocess.py you can specify the work folder with --work, for call.py you can specify the output folder that with --out, and for postprocess.py you can specify the output vcf using --output_vcf and the work folder with --work.

msahraeian avatar Dec 20 '19 20:12 msahraeian

Hi, I ran Neusomatic on some Hapmap samples but the results still showed very low overlap with the “ground truth” they provided. These samples were aligned to hg38 genome, does that have to be the same with the trained model? I am very anxious to get some idea why the overlap is so low? Thanks. Xiaopeng

From: msahraeian [email protected] Sent: Friday, December 20, 2019 3:37 PM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian you don't need to output to the same directory. For each step there is an input parameter that specifies the output location. for preprocess.py you can specify the work folder with --work, for call.py you can specify the output folder that with --out, and for postprocess.py you can specify the output vcf using --output_vcf and the work folder with --work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTLTCB55RKZ3EM6VHPLQZUUFRA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHODGTI#issuecomment-568079181, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTLKOAQLOE4Q4Y4BAH3QZUUFRANCNFSM4JQ5LG4A.

cbiitbian avatar Dec 24 '19 02:12 cbiitbian

For hapmap samples, which trained model do you suggest I should use, any particular attention, like the parameters, I should pay to? Thanks. Xiaopeng

From: msahraeian [email protected] Sent: Friday, December 20, 2019 3:37 PM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian you don't need to output to the same directory. For each step there is an input parameter that specifies the output location. for preprocess.py you can specify the work folder with --work, for call.py you can specify the output folder that with --out, and for postprocess.py you can specify the output vcf using --output_vcf and the work folder with --work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTLTCB55RKZ3EM6VHPLQZUUFRA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHODGTI#issuecomment-568079181, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTLKOAQLOE4Q4Y4BAH3QZUUFRANCNFSM4JQ5LG4A.

cbiitbian avatar Dec 24 '19 02:12 cbiitbian

@cbiitbian I think there is something wrong here. Are you using HapMap samples as tumor-normal pairs? As HapMap data is public, it would be great if you could share with me the ID of samples you are using as well as your neusomatic run scripts and I can make sure you are using the right settings.

msahraeian avatar Dec 25 '19 07:12 msahraeian

Hi, Sorry, I don’t have the hapmap id of the files because the data were re-processed and renamed. But I found that the “ground truth” provided contains a lot of duplicate after annotation, so I removed the duplicate and re-compared them, the overlap got much better but still low and the false positive is quite high. I attached my run script and one of the vcfcomparator results FYI. Maybe can manipulate the parameters to get better results? Your suggestion would be highly appreciated. Thanks. Xiaopeng

From: msahraeian [email protected] Sent: Wednesday, December 25, 2019 2:48 AM To: bioinform/neusomatic [email protected] Cc: Bian, Xiaopeng (NIH/NCI) [E] [email protected]; Mention [email protected] Subject: Re: [bioinform/neusomatic] failed at the last step (#53)

@cbiitbianhttps://github.com/cbiitbian I think there is something wrong here. Are you using HapMap samples as tumor-normal pairs? As HapMap data is public, it would be great if you could share with me the ID of samples you are using as well as your neusomatic run scripts and I can make sure you are using the right settings.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bioinform/neusomatic/issues/53?email_source=notifications&email_token=ABRXDTJOM5L2Z6E4FN7THHDQ2MF4JA5CNFSM4JQ5LG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHUBVKA#issuecomment-568859304, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRXDTNSSSS73RL2MAQ7WJDQ2MF4JANCNFSM4JQ5LG4A.

cbiitbian avatar Dec 27 '19 06:12 cbiitbian