neusomatic
neusomatic copied to clipboard
Error: unable to open file or unable to determine types for file synthetic.vcf
Hi,
I am trying to run neusomatic in ensemble mode, but got stuck after the SomaticSeq.Wrapper.sh step and at the "preprocess.py --mode train" step. I get the following error Error: unable to open file or unable to determine types for file synthetic.vcf
the synthetic.vcf is generated by processing the Ensemble.s*.tsv files generated by SomaticSeq.Wrapper.sh, following the details on your repository (my understanding of it, of course) cat <(cat Ensemble.s*.tsv |grep CHROM|head -1) <(cat Ensemble.s*.tsv |grep -v CHROM) | sed "s/nan/0/g" > ensemble_ann1.tsv
python preprocess.py --mode train --reference $GENFILE.fa --region_bed $INTERVALFILE --tumor_bam $syntheticTumor.bam --normal_bam $syntheticNormal.bam --work WORK --truth_vcf synthetic.vcf --ensemble_tsv ensemble_ann.tsv --min_mapq 10 --num_threads 20 --scan_alignments_binary $HOME/bin/neusomatic/neusomatic/bin/scan_alignments
A few lines from the synthetic.vcf are below #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SPIKEIN chr1 1503492 . G A 100 PASS SOMATIC;VAF=0.2;DPR=9.66666666667 GT 0/1 chr1 3752754 . G A 100 PASS SOMATIC;VAF=0.307692307692;DPR=90.0 GT 0/1 chr1 3763621 . C A 100 PASS SOMATIC;VAF=0.222222222222;DPR=17.6666666667 GT 0/1 chr1 6152482 . T A 100 PASS SOMATIC;VAF=0.127868852459;DPR=304.666666667 GT 0/1 chr1 6199629 . G C 100 PASS SOMATIC;VAF=0.21978021978;DPR=181.333333333 GT 0/1
Can you please help me understand my mistake ?
Thanks Gianfilippo
@gianfilippo happy to see your interest in NeuSomatic.
I think there are some confusions here. synthetic.vcf
that is needed for training and the ensemble tsv file that is generated by SomaticSeq.Wrapper.sh
are two separate things.
-
If you don't have a training data with known truth VCF: as noted here to generate training data you should follow the BAMSurgeon workflow that gives you synthetic bams and truth vcf file (In fact it gives a
synthetic_snvs.vcf
and asynthetic_indels.leftAlign.vcf
. you need to combine them to a single VCF). These training data can then be used in NeuSomatic. -
Regardless of how you get your data you can run NeuSomatic in standalone or ensmeble modes (in training or calling phases). Details can be found here. The process you mentioned with give you
ensemble_ann.tsv
that can be used in ensemble mode. So, this tsv file is only output of other callers for your data and should not be used as truth vcf.
In addition, from the few lines of the VCF you have provided, I see that the VCF is in the wrong format. You need a header line like ##fileformat=VCFv4.2
in the first line of VCF that specifies the format.
thanks. you are right, clearly got confused.....I fixed it. The VCF is ok. I just sent you a few lines. Now, I rerun it, but got another error at preprocess stage. Below I am reporting the relevant part of the error message. It is failing at find_records. What else am I doing wrong ? Just in case, here is how I joined snvs and indels bcftools concat -a -o synthetic.vcf.gz -O z synthetic_snvs.vcf.gz synthetic_indels.leftAlign.vcf.gz
Thanks for your help!
INFO 2020-04-04 21:25:27,955 find_records (ForkPoolWorker-54) N_none: 536 INFO 2020-04-04 21:25:27,956 find_records (ForkPoolWorker-55) N_none: 531 INFO 2020-04-04 21:25:27,964 find_records (ForkPoolWorker-46) N_none: 531 ERROR 2020-04-04 21:25:27,974 find_records (ForkPoolWorker-51) Traceback (most recent call last): File "neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records mt2, eqs2 = push_lr(fasta_file, mt, 2) File "neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r) AssertionError
ERROR 2020-04-04 21:25:27,974 find_records (ForkPoolWorker-51)
INFO 2020-04-04 21:25:27,988 find_records (ForkPoolWorker-44) N_none: 484
INFO 2020-04-04 21:25:27,997 find_records (ForkPoolWorker-60) N_none: 497
ERROR 2020-04-04 21:25:28,009 main Traceback (most recent call last):
File "neusomatic/neusomatic/python/preprocess.py", line 435, in
ERROR 2020-04-04 21:25:28,010 main Aborting!
ERROR 2020-04-04 21:25:28,010 main preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, filter_duplicate=False, first_do_without_qual=False, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, max_dp=100000, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='train', normal_bam='syntheticNormal.bam', num_threads=20, reference='hg38.fa', region_bed='SeqCap_EZ_Exome_v3_hg38_primary_targets.sorted.bed', restart=False, scan_alignments_binary='neusomatic/neusomatic/bin/scan_alignments', scan_maf=0.01, scan_window_size=2000, snp_min_af=0.05, snp_min_ao=3, snp_min_bq=10, truth_vcf='synthetic.vcf.gz', tsv_batch_size=50000, tumor_bam='syntheticTumor.bam', work='Sample_Test')
Traceback (most recent call last):
File "neusomatic/neusomatic/python/preprocess.py", line 441, in
@gianfilippo would you please make sure that in the truth.vcf
file all the bases in REF and ALT fields are capital. Also, please make sure you are using the same reference file, you have used for alignment.
Hi, I generated the truth.vcf with BAMSurgeon and the reference file is then same
@gianfilippo I added more logging for failed asserts in a new branch called logging_for_asserts
. Would you please checkout this branch as follows, rerun your code and send me your log?
cd neusomatic
git pull
git checkout logging_for_asserts
Hi, thanks for the help! I can now see from the file that your guess was correct. I found some bases in lower case. Below is the error (part of it). I am not sure this is described in BamSurgeon, but from the following open issue (https://github.com/adamewing/bamsurgeon/issues/137), it seems that "the bases added for the length resolution are always at the end of the read and are in lower case." If this is the case, should I filter out any line in the VCF with a lower case, or would you retain them ?
INFO 2020-04-06 12:37:31,556 find_records (ForkPoolWorker-51) N_none: 546
INFO 2020-04-06 12:37:31,557 find_records (ForkPoolWorker-59) N_none: 506
INFO 2020-04-06 12:37:31,561 find_records (ForkPoolWorker-50) N_none: 535
ERROR 2020-04-06 12:37:31,563 push_lr
ERROR 2020-04-06 12:37:31,563 push_lr Failed at ['chr1', 36091182, 'a', 'C']
ERROR 2020-04-06 12:37:31,563 push_lr Assertion Failed: A==a
@gianfilippo I think we should assume those bases are correct synthetic truth variants. So, we should keep them to have valid training.
I fixed this in the code, so all bases will be capitalized internally. Please pull logging_for_asserts
branch again and retry.
the preprocess step now works, but the train step fails with the error below. What do you think ?
INFO 2020-04-06 17:24:37,805 make_weights_for_balanced_classes count length classes: [(0, 3846), (1, 1753), (2, 52), (3, 121)]
INFO 2020-04-06 17:24:37,805 make_weights_for_balanced_classes weight length classes: [(0, 0.08341995841995842), (1, 0.17407311157311156), (2, 0.24774774774774774), (3, 0.24475918225918225)]
INFO 2020-04-06 17:24:37,805 train_neusomatic weights_type:[0.24311331 0.24250693 8.34199584 0.18095981], weights_length:[8.34199584 0.17407311 0.24774775 0.24475918]
INFO 2020-04-06 17:24:37,807 train_neusomatic Number of candidater per epoch: 5769
ERROR 2020-04-06 17:24:38,365 main Traceback (most recent call last):
File "neusomatic/neusomatic/python/train.py", line 576, in
@gianfilippo sounds good.
To train or call in the ensemble mode you should use --ensemble
argument in train.py
and call.py
.
I am using the --ensemble already (see below the command line). python $NEUSOMATIC/train.py --candidates_tsv $DIR/dataset//candidates.tsv --out $DIR --num_threads $Ncores --batch_size 100 --ensemble
@gianfilippo did you also use --ensemble_tsv
when you ran preprocess.py
?
yes
below is the preprocess call python $NEUSOMATIC/preprocess.py --mode train --reference hg38.fa --region_bed $INTERVALFILE --tumor_bam syntheticTumor.bam --normal_bam syntheticNormal.bam --work $DIR --ensemble_tsv ensemble_ann.tsv --truth_vcf synthetic.vcf.gz --min_mapq 10 --num_threads 20 --scan_alignments_binary $NEUSOMATIC/neusomatic/bin/scan_alignments
@gianfilippo Can you share with me a few lines from ensemble_ann.tsv
and a few lines from one of candidates.tsv
files?
the zipped files (few lines) are attached. For the candidates.tsv I attached the lines from the work.0 dir. I had a typo in the command line, now corrected
@gianfilippo The candidates.tsv
file does not seem like the one that as produced in ensemble mode. Can you rerurn preprocess.py
in ensemble mode in a new (fresh) output folder?
it is working now, going through the iterations. I will let you know if it completes. Thanks!
the network completed training. Below are the last few output lines INFO 2020-04-07 01:44:34,805 train_neusomatic epoch: 999, iter: 57925, lr: 0.0001, loss: 0.20255 INFO 2020-04-07 01:44:39,472 train_neusomatic epoch: 999, iter: 57939, lr: 0.0001, loss: 0.22420 INFO 2020-04-07 01:44:45,971 train_neusomatic epoch: 1000, iter: 57955, lr: 0.0001, loss: 0.19262 INFO 2020-04-07 01:44:50,754 train_neusomatic epoch: 1000, iter: 57969, lr: 0.0001, loss: 0.20993 INFO 2020-04-07 01:44:55,448 train_neusomatic epoch: 1000, iter: 57983, lr: 0.0001, loss: 0.19704 INFO 2020-04-07 01:45:00,229 train_neusomatic epoch: 1000, iter: 57997, lr: 0.0001, loss: 0.21646 INFO 2020-04-07 01:45:01,207 train_neusomatic Finished Training INFO 2020-04-07 01:45:01,212 train_neusomatic Total Epochs: 1000 INFO 2020-04-07 01:45:01,212 train_neusomatic Total Epochs: 1000 INFO 2020-04-07 01:45:01,212 train_neusomatic Training is Done.
@gianfilippo from the loss, it seems that the training has not converged as expected. Can you send me the training log.
Hi, the file is attached. I think there may be something wrong in my synthetic.bam and vcfs. dsq-task.TRAIN-9242965_0-c17n02.txt
@gianfilippo
-
One thing I can observe is that you don't have enough spiked variants in your tumor sample (~2000). We normally, expect >50K spiked mutations (for SEQC model we used 2M spikes across 20 WGS samples ). This can be in one sample or across multiple samples. If you have a WES data you can spike with 1-2 mutations per Kbp frequency.
-
To monitor the progress of the training it is good to use a few of candidate TSV files as
--validation_candidates_tsv
which shows performance over epochs.
thanks for the suggestion. I have WES. I used a modified version of somaticseq/bamsurgeon script. The modifications simply allow the script to run on the Slurm Workload Manager. I required 100000 snvs, 20000 indels, 1000 svs, and ended up with 3179 SNVs and 609 INDELs
@gianfilippo I think that is unexpected for the somaticseq/bamsurgeon. There should be sth wrong over there. Can you share your run script for bamsurgeon?
Hi, the script is attached. I ran it on 4 threads and the attached file works on thread 1. The others are the same. I also included the mergeFiles script. BamSimulator.2020-04-10_18-19-06_702757150.cmd.txt mergeFiles.2020-04-10_18-19-06_702757150.cmd.txt
Can you try the following somaticseq command, to see what's in the bam files for your snv/indel positions?
If you have the latest SomaticSeq v3.4.0 installed, somatic_vcf2tsv.py
is in your path. Otherwise, you can directly point to PATH/TO/somaticseq/somaticseq/somatic_vcf2tsv.py
for the following command:
somatic_vcf2tsv.py -myvcf Sample_G1700T_012/synthetic_snvs.vcf -truth Sample_G1700T_012/synthetic_snvs.vcf -nbam Sample_G1700T_012/syntheticNormal.bam -tbam Sample_G1700T_012/syntheticTumor.bam -ref hg38.fa -dedup -mincaller 0 -outfile OUTPUT.SNV.tsv
You may also add caller vcf files as input above. All of the following are optional, e.g.,
-mutect mutect.vcf -strelka strelka.snv.vcf -vardict vardict.vcf -muse muse.vcf ......
And do one for indel as well. Basically, for every position in synthetic_snvs.vcf, it'll extract features from bam files in that coordinate.
@gianfilippo Also, can you make sure you use --selector
argument when you run BamSimulator_multiThreads.sh
and use you WES .bed
file there. I suspect you are spiking 100K SNVs in WGS instead of WES, so you get ~3K SNVs.
Hi, thanks for the suggestions. The output from somatic_vcf2tsv.py for both snvs and indels, using also the callers vcf files are attached. You are right, I guess I did not notice the --selector argument and have not used it. I will rerun it. OUTPUT.INDEL.PLUS.tsv.txt OUTPUT.SNV.PLUS.tsv.txt
Looking at the two tsv.txt files, it's not obvious to me why the four callers are all "negative" for all the variant positions. Column #6, 7, 10, and 14 are are MuTect(2), VarScan2, VarDict, and Strelka, and they're 0 all the way through. Column #90 and #91 are forward and reverse variant-supporting reads and there seem to be variant-supporting reads in these BAM files, so it seems the synthetic VCF file and the BAM files are good. Wonder why the four callers don't detect any of those variants.
if you think the synthetic VCF file and the BAM files are ok, there must be something wrong in the way i set up the script using the callers. Likely some stupid mistake, at this point. I will let you know. Thanks for your help!
Hi, it seems that now, using "--selector", BamSimulator_multiThreads.sh does not complete and I get core dumps and errors. I tried on 4 and 20 threads. Now I am rerunning it on a single thread just to see if I still get the same errors.