neusomatic icon indicating copy to clipboard operation
neusomatic copied to clipboard

Error: unable to open file or unable to determine types for file synthetic.vcf

Open gianfilippo opened this issue 4 years ago • 31 comments

Hi,

I am trying to run neusomatic in ensemble mode, but got stuck after the SomaticSeq.Wrapper.sh step and at the "preprocess.py --mode train" step. I get the following error Error: unable to open file or unable to determine types for file synthetic.vcf

the synthetic.vcf is generated by processing the Ensemble.s*.tsv files generated by SomaticSeq.Wrapper.sh, following the details on your repository (my understanding of it, of course) cat <(cat Ensemble.s*.tsv |grep CHROM|head -1) <(cat Ensemble.s*.tsv |grep -v CHROM) | sed "s/nan/0/g" > ensemble_ann1.tsv

python preprocess.py --mode train --reference $GENFILE.fa --region_bed $INTERVALFILE --tumor_bam $syntheticTumor.bam --normal_bam $syntheticNormal.bam --work WORK --truth_vcf synthetic.vcf --ensemble_tsv ensemble_ann.tsv --min_mapq 10 --num_threads 20 --scan_alignments_binary $HOME/bin/neusomatic/neusomatic/bin/scan_alignments

A few lines from the synthetic.vcf are below #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SPIKEIN chr1 1503492 . G A 100 PASS SOMATIC;VAF=0.2;DPR=9.66666666667 GT 0/1 chr1 3752754 . G A 100 PASS SOMATIC;VAF=0.307692307692;DPR=90.0 GT 0/1 chr1 3763621 . C A 100 PASS SOMATIC;VAF=0.222222222222;DPR=17.6666666667 GT 0/1 chr1 6152482 . T A 100 PASS SOMATIC;VAF=0.127868852459;DPR=304.666666667 GT 0/1 chr1 6199629 . G C 100 PASS SOMATIC;VAF=0.21978021978;DPR=181.333333333 GT 0/1

Can you please help me understand my mistake ?

Thanks Gianfilippo

gianfilippo avatar Apr 04 '20 01:04 gianfilippo

@gianfilippo happy to see your interest in NeuSomatic. I think there are some confusions here. synthetic.vcf that is needed for training and the ensemble tsv file that is generated by SomaticSeq.Wrapper.sh are two separate things.

  1. If you don't have a training data with known truth VCF: as noted here to generate training data you should follow the BAMSurgeon workflow that gives you synthetic bams and truth vcf file (In fact it gives a synthetic_snvs.vcf and a synthetic_indels.leftAlign.vcf. you need to combine them to a single VCF). These training data can then be used in NeuSomatic.

  2. Regardless of how you get your data you can run NeuSomatic in standalone or ensmeble modes (in training or calling phases). Details can be found here. The process you mentioned with give you ensemble_ann.tsv that can be used in ensemble mode. So, this tsv file is only output of other callers for your data and should not be used as truth vcf.

In addition, from the few lines of the VCF you have provided, I see that the VCF is in the wrong format. You need a header line like ##fileformat=VCFv4.2 in the first line of VCF that specifies the format.

msahraeian avatar Apr 04 '20 21:04 msahraeian

thanks. you are right, clearly got confused.....I fixed it. The VCF is ok. I just sent you a few lines. Now, I rerun it, but got another error at preprocess stage. Below I am reporting the relevant part of the error message. It is failing at find_records. What else am I doing wrong ? Just in case, here is how I joined snvs and indels bcftools concat -a -o synthetic.vcf.gz -O z synthetic_snvs.vcf.gz synthetic_indels.leftAlign.vcf.gz

Thanks for your help!

INFO 2020-04-04 21:25:27,955 find_records (ForkPoolWorker-54) N_none: 536 INFO 2020-04-04 21:25:27,956 find_records (ForkPoolWorker-55) N_none: 531 INFO 2020-04-04 21:25:27,964 find_records (ForkPoolWorker-46) N_none: 531 ERROR 2020-04-04 21:25:27,974 find_records (ForkPoolWorker-51) Traceback (most recent call last): File "neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records mt2, eqs2 = push_lr(fasta_file, mt, 2) File "neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r) AssertionError

ERROR 2020-04-04 21:25:27,974 find_records (ForkPoolWorker-51) INFO 2020-04-04 21:25:27,988 find_records (ForkPoolWorker-44) N_none: 484 INFO 2020-04-04 21:25:27,997 find_records (ForkPoolWorker-60) N_none: 497 ERROR 2020-04-04 21:25:28,009 main Traceback (most recent call last): File "neusomatic/neusomatic/python/preprocess.py", line 435, in args.scan_alignments_binary) File "neusomatic/neusomatic/python/preprocess.py", line 335, in preprocess ensemble_beds[i] if ensemble_tsv else None, tsv_batch_size) File neusomatic/neusomatic/python/preprocess.py", line 129, in generate_dataset_region tsv_batch_size) File "neusomatic/neusomatic/python/generate_dataset.py", line 1461, in generate_dataset raise Exception("find_records failed!") Exception: find_records failed!

ERROR 2020-04-04 21:25:28,010 main Aborting! ERROR 2020-04-04 21:25:28,010 main preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, filter_duplicate=False, first_do_without_qual=False, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, max_dp=100000, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='train', normal_bam='syntheticNormal.bam', num_threads=20, reference='hg38.fa', region_bed='SeqCap_EZ_Exome_v3_hg38_primary_targets.sorted.bed', restart=False, scan_alignments_binary='neusomatic/neusomatic/bin/scan_alignments', scan_maf=0.01, scan_window_size=2000, snp_min_af=0.05, snp_min_ao=3, snp_min_bq=10, truth_vcf='synthetic.vcf.gz', tsv_batch_size=50000, tumor_bam='syntheticTumor.bam', work='Sample_Test') Traceback (most recent call last): File "neusomatic/neusomatic/python/preprocess.py", line 441, in raise e File "neusomatic/neusomatic/python/preprocess.py", line 435, in args.scan_alignments_binary) File "neusomatic/neusomatic/python/preprocess.py", line 335, in preprocess ensemble_beds[i] if ensemble_tsv else None, tsv_batch_size) File "neusomatic/neusomatic/python/preprocess.py", line 129, in generate_dataset_region tsv_batch_size) File "neusomatic/neusomatic/python/generate_dataset.py", line 1461, in generate_dataset raise Exception("find_records failed!") Exception: find_records failed!

gianfilippo avatar Apr 05 '20 01:04 gianfilippo

@gianfilippo would you please make sure that in the truth.vcf file all the bases in REF and ALT fields are capital. Also, please make sure you are using the same reference file, you have used for alignment.

msahraeian avatar Apr 05 '20 04:04 msahraeian

Hi, I generated the truth.vcf with BAMSurgeon and the reference file is then same

gianfilippo avatar Apr 05 '20 04:04 gianfilippo

@gianfilippo I added more logging for failed asserts in a new branch called logging_for_asserts. Would you please checkout this branch as follows, rerun your code and send me your log? cd neusomatic git pull git checkout logging_for_asserts

msahraeian avatar Apr 05 '20 08:04 msahraeian

Hi, thanks for the help! I can now see from the file that your guess was correct. I found some bases in lower case. Below is the error (part of it). I am not sure this is described in BamSurgeon, but from the following open issue (https://github.com/adamewing/bamsurgeon/issues/137), it seems that "the bases added for the length resolution are always at the end of the read and are in lower case." If this is the case, should I filter out any line in the VCF with a lower case, or would you retain them ?

INFO 2020-04-06 12:37:31,556 find_records (ForkPoolWorker-51) N_none: 546 INFO 2020-04-06 12:37:31,557 find_records (ForkPoolWorker-59) N_none: 506 INFO 2020-04-06 12:37:31,561 find_records (ForkPoolWorker-50) N_none: 535 ERROR 2020-04-06 12:37:31,563 push_lr
ERROR 2020-04-06 12:37:31,563 push_lr Failed at ['chr1', 36091182, 'a', 'C'] ERROR 2020-04-06 12:37:31,563 push_lr Assertion Failed: A==a

gianfilippo avatar Apr 06 '20 17:04 gianfilippo

@gianfilippo I think we should assume those bases are correct synthetic truth variants. So, we should keep them to have valid training. I fixed this in the code, so all bases will be capitalized internally. Please pull logging_for_asserts branch again and retry.

msahraeian avatar Apr 06 '20 18:04 msahraeian

the preprocess step now works, but the train step fails with the error below. What do you think ?

INFO 2020-04-06 17:24:37,805 make_weights_for_balanced_classes count length classes: [(0, 3846), (1, 1753), (2, 52), (3, 121)] INFO 2020-04-06 17:24:37,805 make_weights_for_balanced_classes weight length classes: [(0, 0.08341995841995842), (1, 0.17407311157311156), (2, 0.24774774774774774), (3, 0.24475918225918225)] INFO 2020-04-06 17:24:37,805 train_neusomatic weights_type:[0.24311331 0.24250693 8.34199584 0.18095981], weights_length:[8.34199584 0.17407311 0.24774775 0.24475918] INFO 2020-04-06 17:24:37,807 train_neusomatic Number of candidater per epoch: 5769 ERROR 2020-04-06 17:24:38,365 main Traceback (most recent call last): File "neusomatic/neusomatic/python/train.py", line 576, in use_cuda) File "neusomatic/neusomatic/python/train.py", line 436, in train_neusomatic outputs, _ = net(inputs) File ".local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/gpfs/ycga/project/coppola/gc223/bin/neusomatic/neusomatic/python/network.py", line 67, in forward x = self.pool1(F.relu(self.bn1(self.conv1(x)))) File ".local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File ".local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward return self.conv2d_forward(input, self.weight) File ".local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward self.padding, self.dilation, self.groups) RuntimeError: Given groups=1, weight of size 64 119 1 3, expected input[100, 26, 5, 32] to have 119 channels, but got 26 channels instead

gianfilippo avatar Apr 06 '20 21:04 gianfilippo

@gianfilippo sounds good. To train or call in the ensemble mode you should use --ensemble argument in train.py and call.py.

msahraeian avatar Apr 06 '20 22:04 msahraeian

I am using the --ensemble already (see below the command line). python $NEUSOMATIC/train.py --candidates_tsv $DIR/dataset//candidates.tsv --out $DIR --num_threads $Ncores --batch_size 100 --ensemble

gianfilippo avatar Apr 06 '20 22:04 gianfilippo

@gianfilippo did you also use --ensemble_tsv when you ran preprocess.py ?

msahraeian avatar Apr 06 '20 22:04 msahraeian

yes

gianfilippo avatar Apr 06 '20 22:04 gianfilippo

below is the preprocess call python $NEUSOMATIC/preprocess.py --mode train --reference hg38.fa --region_bed $INTERVALFILE --tumor_bam syntheticTumor.bam --normal_bam syntheticNormal.bam --work $DIR --ensemble_tsv ensemble_ann.tsv --truth_vcf synthetic.vcf.gz --min_mapq 10 --num_threads 20 --scan_alignments_binary $NEUSOMATIC/neusomatic/bin/scan_alignments

gianfilippo avatar Apr 06 '20 22:04 gianfilippo

@gianfilippo Can you share with me a few lines from ensemble_ann.tsv and a few lines from one of candidates.tsv files?

msahraeian avatar Apr 06 '20 22:04 msahraeian

the zipped files (few lines) are attached. For the candidates.tsv I attached the lines from the work.0 dir. I had a typo in the command line, now corrected

Archive.zip

gianfilippo avatar Apr 06 '20 22:04 gianfilippo

@gianfilippo The candidates.tsv file does not seem like the one that as produced in ensemble mode. Can you rerurn preprocess.py in ensemble mode in a new (fresh) output folder?

msahraeian avatar Apr 06 '20 23:04 msahraeian

it is working now, going through the iterations. I will let you know if it completes. Thanks!

gianfilippo avatar Apr 07 '20 00:04 gianfilippo

the network completed training. Below are the last few output lines INFO 2020-04-07 01:44:34,805 train_neusomatic epoch: 999, iter: 57925, lr: 0.0001, loss: 0.20255 INFO 2020-04-07 01:44:39,472 train_neusomatic epoch: 999, iter: 57939, lr: 0.0001, loss: 0.22420 INFO 2020-04-07 01:44:45,971 train_neusomatic epoch: 1000, iter: 57955, lr: 0.0001, loss: 0.19262 INFO 2020-04-07 01:44:50,754 train_neusomatic epoch: 1000, iter: 57969, lr: 0.0001, loss: 0.20993 INFO 2020-04-07 01:44:55,448 train_neusomatic epoch: 1000, iter: 57983, lr: 0.0001, loss: 0.19704 INFO 2020-04-07 01:45:00,229 train_neusomatic epoch: 1000, iter: 57997, lr: 0.0001, loss: 0.21646 INFO 2020-04-07 01:45:01,207 train_neusomatic Finished Training INFO 2020-04-07 01:45:01,212 train_neusomatic Total Epochs: 1000 INFO 2020-04-07 01:45:01,212 train_neusomatic Total Epochs: 1000 INFO 2020-04-07 01:45:01,212 train_neusomatic Training is Done.

gianfilippo avatar Apr 08 '20 03:04 gianfilippo

@gianfilippo from the loss, it seems that the training has not converged as expected. Can you send me the training log.

msahraeian avatar Apr 08 '20 20:04 msahraeian

Hi, the file is attached. I think there may be something wrong in my synthetic.bam and vcfs. dsq-task.TRAIN-9242965_0-c17n02.txt

gianfilippo avatar Apr 10 '20 05:04 gianfilippo

@gianfilippo

  • One thing I can observe is that you don't have enough spiked variants in your tumor sample (~2000). We normally, expect >50K spiked mutations (for SEQC model we used 2M spikes across 20 WGS samples ). This can be in one sample or across multiple samples. If you have a WES data you can spike with 1-2 mutations per Kbp frequency.

  • To monitor the progress of the training it is good to use a few of candidate TSV files as --validation_candidates_tsv which shows performance over epochs.

msahraeian avatar Apr 10 '20 06:04 msahraeian

thanks for the suggestion. I have WES. I used a modified version of somaticseq/bamsurgeon script. The modifications simply allow the script to run on the Slurm Workload Manager. I required 100000 snvs, 20000 indels, 1000 svs, and ended up with 3179 SNVs and 609 INDELs

gianfilippo avatar Apr 10 '20 14:04 gianfilippo

@gianfilippo I think that is unexpected for the somaticseq/bamsurgeon. There should be sth wrong over there. Can you share your run script for bamsurgeon?

msahraeian avatar Apr 10 '20 17:04 msahraeian

Hi, the script is attached. I ran it on 4 threads and the attached file works on thread 1. The others are the same. I also included the mergeFiles script. BamSimulator.2020-04-10_18-19-06_702757150.cmd.txt mergeFiles.2020-04-10_18-19-06_702757150.cmd.txt

gianfilippo avatar Apr 12 '20 04:04 gianfilippo

Can you try the following somaticseq command, to see what's in the bam files for your snv/indel positions?

If you have the latest SomaticSeq v3.4.0 installed, somatic_vcf2tsv.py is in your path. Otherwise, you can directly point to PATH/TO/somaticseq/somaticseq/somatic_vcf2tsv.py for the following command:

somatic_vcf2tsv.py -myvcf Sample_G1700T_012/synthetic_snvs.vcf -truth Sample_G1700T_012/synthetic_snvs.vcf -nbam Sample_G1700T_012/syntheticNormal.bam -tbam Sample_G1700T_012/syntheticTumor.bam -ref hg38.fa -dedup -mincaller 0 -outfile OUTPUT.SNV.tsv

You may also add caller vcf files as input above. All of the following are optional, e.g.,

-mutect mutect.vcf -strelka strelka.snv.vcf -vardict vardict.vcf -muse muse.vcf ......

And do one for indel as well. Basically, for every position in synthetic_snvs.vcf, it'll extract features from bam files in that coordinate.

litaifang avatar Apr 16 '20 22:04 litaifang

@gianfilippo Also, can you make sure you use --selector argument when you run BamSimulator_multiThreads.sh and use you WES .bed file there. I suspect you are spiking 100K SNVs in WGS instead of WES, so you get ~3K SNVs.

msahraeian avatar Apr 16 '20 23:04 msahraeian

Hi, thanks for the suggestions. The output from somatic_vcf2tsv.py for both snvs and indels, using also the callers vcf files are attached. You are right, I guess I did not notice the --selector argument and have not used it. I will rerun it. OUTPUT.INDEL.PLUS.tsv.txt OUTPUT.SNV.PLUS.tsv.txt

gianfilippo avatar Apr 17 '20 03:04 gianfilippo

Looking at the two tsv.txt files, it's not obvious to me why the four callers are all "negative" for all the variant positions. Column #6, 7, 10, and 14 are are MuTect(2), VarScan2, VarDict, and Strelka, and they're 0 all the way through. Column #90 and #91 are forward and reverse variant-supporting reads and there seem to be variant-supporting reads in these BAM files, so it seems the synthetic VCF file and the BAM files are good. Wonder why the four callers don't detect any of those variants.

litaifang avatar Apr 17 '20 03:04 litaifang

if you think the synthetic VCF file and the BAM files are ok, there must be something wrong in the way i set up the script using the callers. Likely some stupid mistake, at this point. I will let you know. Thanks for your help!

gianfilippo avatar Apr 17 '20 03:04 gianfilippo

Hi, it seems that now, using "--selector", BamSimulator_multiThreads.sh does not complete and I get core dumps and errors. I tried on 4 and 20 threads. Now I am rerunning it on a single thread just to see if I still get the same errors.

gianfilippo avatar Apr 20 '20 01:04 gianfilippo