BayesTyper
BayesTyper copied to clipboard
error bayestyper cluster
Dear,
I try to ran bayestyper. First of all I ran KMC and bayestypertools without any problem, but when I want to run bayestyper cluster appear the following error:
bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.4.1_static/BayesTyper-1.4.1/src/bayesTyper/VariantFileParser.cpp:307: void VariantFileParser::parseVariants(ProducerConsumerQueue<std::vector<std::unordered_map<unsigned int, VariantCluster*>>>*, uint, const Chromosomes&): Assertion `prev_position < cur_position' failed. /scratch/slurm/job5138623/slurm_script: line 14: 328998 Aborted
I ran the program like this: ./bayesTyper cluster -v insertions_filtered.vcf -s insilico3.tsv -g GRCh37_canon.fa -d GRCh37_decoy.fa -p 8
The vcf is from a Pamir variant caller only, which only detects insertions, I dont combine any variant caller more. the insilico3.tsv contains this information tab delimited: insilico3 M path/insilico3
so I dont know where is the problem... Do you have any idea why have this error?? could you help me...
Thanks a lot for your time.
Jordi
Hi Jordi,
Thank you for writing.
This error occurs when the input vcf is either not sorted or contains multiple variants on the same position. The vcf input for BayesTyper needs to be sorted and for variants that are on the same position multi-allelic. You can solve this by running bayesTyperTools combine
on the vcf file before bayesTyper cluster
. Alternative you can run a combination of bcftools norm -m +any
and bcftools sort
I will change this assert to a more informative error message in the next release.
Best,
Jonas
Hi Jonas thanks for your reply.
I tried to use bayestypertools for combine my vcf... this is my command: bayesTyperTools combine -v Pamir:insertions_filtered.vcf -o insilico3 -z
When I use this the program give me this error: ERROR: Variants need to be sorted by position; "100542677" is before "10060499" on contig "10"
when I look in my vcf I found this: #CHROM POS ID REF ALT QUAL FILTER INFO 10 100542677 . T <INS> 97.208244 PASS SVTYPE=INS;SVLEN=5;END=100542681;Cluster=493129;Support=1;SEQ=ATTTT;FLSUP=4;FRSUP=3;FSUP=7 10 10060499 . G <INS> 95.444649 FAIL SVTYPE=INS;SVLEN=34;END=10060532;Cluster=467594;Support=1;SEQ=CATGTGTTTGTTGGCCATAAGTATGTCTTTTTTT;FLSUP=0;FRSUP=0;FSUP=0 10 100611872 . T <INS> 96.920555 PASS SVTYPE=INS;SVLEN=7;END=100611878;Cluster=493143;Support=1;SEQ=TGTGTGG;FLSUP=3;FRSUP=4;FSUP=7
As you can see is not sorted... But i suppose that combine do that... So I try to sort the vcf with bcftools after norm... and I cannot because give me this exit code:
[W::vcf_parse] FILTER 'FAIL' is not defined in the header [W::vcf_parse] INFO 'FRSUP' is not defined in the header, assuming Type=String [E::bcf_write] Unchecked error (2), exiting
So I cannot sort my vcf and I dont know why. For this reason I sorted manually and when I've the vcf :
and when i try to combine the vcf, like this:
bayesTyper_v1.4.1_linux_x86_64_old/bin/bayesTyperTools combine -v Pamir:sorted_insertion.vcf -o insilico3 -z
my output is 0,
this is the output:
[13/03/2019 20:26:46] Number of included alternative alleles (excluding id and missing):
- Pamir: 0
[13/03/2019 20:26:46] Number of alternative alleles in the combined set: 0
So I thougth that the problem comes from because I use only one vcf, so I used the vcf consensus prior with decoy37 ref genome, the command is this:
bayesTyperTools combine -v Pamir:insilico.vcf,prior:SNP_dbSNP150common_SV_1000g_dbSNP150all_GDK_GoNL_GTEx_GRCh37.vcf.gz -o insilico3 -z
and I obtain this error:
[13/03/2019 20:23:35] You are using BayesTyperTools (v1.4.1)
[13/03/2019 20:23:35] Running BayesTyperTools (v1.4.1) combine on 2 files ...
bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.4.1_static/BayesTyper-1.4.1/src/bayesTyperTools/Combine.cpp:301: uint Combine::addVariant(Variant*, std::map<unsigned int, Variant*>*, const string&, bool): Assertion `contig_variants_it.first->second->ref().seq().substr(0, min_ref_length) == cur_var->ref().seq().substr(0, min_ref_length)' failed. Aborted (core dumped)
This could be because I analyse only insertions and this consensus is for SNP? why when I use combine only for my vcf alone breakes?? I can solve the problem??
THanks for your help and time again.
Jordi
Hi Jordi,
combine
only solves the multi-allelic problem and does not sort the variants. Sorry that this was not clear from my previous message.
Regarding the other errors, would it be possible to upload a piece of the vcf (including header)? The lines what you copied from the file only contains 7 columns (either REF or ALT seem to be missing). If that is also the case in the vcf file, that would explain the errors.
Also, I am not sure what the output from Pamir is, but it seems like it is using the symbolic <INS> allele to represent insertions. BayesTyper cluster
can not parse these symbolic alleles. You can convert symbolic alleles to sequence using bayesTyperTools convertAllele
, but <INS> is currently not supported by the tool. Support for these when the sequence is available in either a fasta file or as a INFO attribute have been on my todo list for a long time. If you need it I can try to have support implemented for it in the coming release, which is planned for next week.
Hope it helps.
Best,
Jonas
Hi Jonas,
thanks a lot for your help, This is a piece of Pamir output: ##fileformat=VCFv4.2 ##FILTER=<ID=PASS,Description="All filters passed"> ##reference=genome.fa ##source=Pamir ##ALT=<ID=<INS>,Type=String,Description="Insertion of novel sequence"> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> ##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles"> ##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate of this variant"> ##INFO=<ID=Cluster,Number=1,Type=Integer,Description="ID of the cluster the variant is extracted from"> ##INFO=<ID=Support,Number=1,Type=Integer,Description="Number of reads/contigs supporting the contig"> ##INFO=<ID=SEQ,Number=1,Type=String,Description="Variant sequence"> ##INFO=<ID=FLSUP,Number=1,Type=Integer,Description="Number of left supporting reads in filtering"> ##INFO=<ID=FLRSUP,Number=1,Type=Integer,Description="Number of right supporting reads in filtering"> ##INFO=<ID=FSUP,Number=1,Type=Integer,Description="Number of total supporting reads in filtering"> ##contig=<ID=1,length=249250621> ##contig=<ID=2,length=243199373> ##contig=<ID=3,length=198022430> ##contig=<ID=4,length=191154276> ##contig=<ID=5,length=180915260> ##contig=<ID=6,length=171115067> ##contig=<ID=7,length=159138663> ##contig=<ID=8,length=146364022> ##contig=<ID=9,length=141213431> ##contig=<ID=10,length=135534747> ##contig=<ID=11,length=135006516> ##contig=<ID=12,length=133851895> ##contig=<ID=13,length=115169878> ##contig=<ID=14,length=107349540> ##contig=<ID=15,length=102531392> ##contig=<ID=16,length=90354753> ##contig=<ID=17,length=81195210> ##contig=<ID=18,length=78077248> ##contig=<ID=19,length=59128983> ##contig=<ID=20,length=63025520> ##contig=<ID=21,length=48129895> ##contig=<ID=22,length=51304566> ##contig=<ID=X,length=155270560> ##contig=<ID=Y,length=59373566> ##contig=<ID=MT,length=16569> ##contig=<ID=GL000207.1,length=4262> ##contig=<ID=GL000226.1,length=15008> ##contig=<ID=GL000229.1,length=19913> ##contig=<ID=GL000231.1,length=27386> ##contig=<ID=GL000210.1,length=27682> ##contig=<ID=GL000239.1,length=33824> ##contig=<ID=GL000235.1,length=34474> ##contig=<ID=GL000201.1,length=36148> ##contig=<ID=GL000247.1,length=36422> ##contig=<ID=GL000245.1,length=36651> ##contig=<ID=GL000197.1,length=37175> ##contig=<ID=GL000203.1,length=37498> ##contig=<ID=GL000246.1,length=38154> ##contig=<ID=GL000249.1,length=38502> ##contig=<ID=GL000196.1,length=38914> ##contig=<ID=GL000248.1,length=39786> ##contig=<ID=GL000244.1,length=39929> ##contig=<ID=GL000238.1,length=39939> ##contig=<ID=GL000202.1,length=40103> ##contig=<ID=GL000234.1,length=40531> ##contig=<ID=GL000232.1,length=40652> ##contig=<ID=GL000206.1,length=41001> ##contig=<ID=GL000240.1,length=41933> ##contig=<ID=GL000236.1,length=41934> ##contig=<ID=GL000241.1,length=42152> ##contig=<ID=GL000243.1,length=43341> ##contig=<ID=GL000242.1,length=43523> ##contig=<ID=GL000230.1,length=43691> ##contig=<ID=GL000237.1,length=45867> ##contig=<ID=GL000233.1,length=45941> ##contig=<ID=GL000204.1,length=81310> ##contig=<ID=GL000198.1,length=90085> ##contig=<ID=GL000208.1,length=92689> ##contig=<ID=GL000191.1,length=106433> ##contig=<ID=GL000227.1,length=128374> ##contig=<ID=GL000228.1,length=129120> ##contig=<ID=GL000214.1,length=137718> ##contig=<ID=GL000221.1,length=155397> ##contig=<ID=GL000209.1,length=159169> ##contig=<ID=GL000218.1,length=161147> ##contig=<ID=GL000220.1,length=161802> ##contig=<ID=GL000213.1,length=164239> ##contig=<ID=GL000211.1,length=166566> ##contig=<ID=GL000199.1,length=169874> ##contig=<ID=GL000217.1,length=172149> ##contig=<ID=GL000216.1,length=172294> ##contig=<ID=GL000215.1,length=172545> ##contig=<ID=GL000205.1,length=174588> ##contig=<ID=GL000219.1,length=179198> ##contig=<ID=GL000224.1,length=179693> ##contig=<ID=GL000223.1,length=180455> ##contig=<ID=GL000195.1,length=182896> ##contig=<ID=GL000212.1,length=186858> ##contig=<ID=GL000222.1,length=186861> ##contig=<ID=GL000200.1,length=187035> ##contig=<ID=GL000193.1,length=189789> ##contig=<ID=GL000194.1,length=191469> ##contig=<ID=GL000225.1,length=211173> ##contig=<ID=GL000192.1,length=547496> ##contig=<ID=NC_007605,length=171823> ##contig=<ID=hs37d5,length=35477943> #CHROM POS ID REF ALT QUAL FILTER INFO 1 99009 . T <INS> 95.362411 FAIL SVTYPE=INS;SVLEN=37;END=99045;Cluster=28;Support=1;SEQ=GTATTTATTTATTTATTTGTTTATTTACTTATTTACC;FLSUP=0;FRSUP=0;FSUP=0 1 99035 . T <INS> 96.360939 FAIL SVTYPE=INS;SVLEN=13;END=99047;Cluster=28;Support=1;SEQ=CCTTATTTACCTT;FLSUP=0;FRSUP=1;FSUP=1 1 99036 . T <INS> 96.360939 FAIL SVTYPE=INS;SVLEN=13;END=99048;Cluster=28;Support=1;SEQ=ATTATTTACCTTA;FLSUP=0;FRSUP=1;FSUP=1 1 99100 . T <INS> 95.336441 FAIL SVTYPE=INS;SVLEN=38;END=99137;Cluster=28;Support=1;SEQ=CCTTTCTCTTTTTTTCTCTTTTCTTTCTTTTTGTTTTT;FLSUP=0;FRSUP=2;FSUP=2 1 170387 . C <INS> 95.920555 FAIL SVTYPE=INS;SVLEN=7;END=170393;Cluster=77;Support=1;SEQ=GCCCGTC;FLSUP=0;FRSUP=0;FSUP=0 1 170387 . C <INS> 95.920555 FAIL SVTYPE=INS;SVLEN=7;END=170393;Cluster=77;Support=1;SEQ=GCCTGTC;FLSUP=0;FRSUP=0;FSUP=0 1 232510 . A <INS> 96.431458 PASS SVTYPE=INS;SVLEN=64;END=232573;Cluster=87;Support=2;SEQ=TTTTTTGACGAATTCTGTGTAAGTACAAAAAAGACGTAAAATAAAACTTTATTTAAAACACTTT;FLSUP=3;FRSUP=1;FSUP=4 1 245366 . T <INS> 94.741905 FAIL SVTYPE=INS;SVLEN=25;END=245390;Cluster=95;Support=1;SEQ=ATTTCCTTATTTACCTTATTTATTT;FLSUP=0;FRSUP=1;FSUP=1 1 255925 . T <INS> 96.004265 PASS SVTYPE=INS;SVLEN=19;END=255943;Cluster=102;Support=1;SEQ=ATGCGTGTGTGTGTGGTCT;FLSUP=4;FRSUP=1;FSUP=5 1 255925 . T <INS> 96.004265 PASS SVTYPE=INS;SVLEN=19;END=255943;Cluster=102;Support=1;SEQ=ATGTGTGGGTGTGTGGTCT;FLSUP=4;FRSUP=1;FSUP=5 1 339913 . T <INS> 96.406067 PASS SVTYPE=INS;SVLEN=17;END=339929;Cluster=126;Support=4;SEQ=TGTAATACATATATGTA;FLSUP=2;FRSUP=3;FSUP=5 1 415261 . G <INS> 94.781128 PASS SVTYPE=INS;SVLEN=24;END=415284;Cluster=168;Support=1;SEQ=AACAACAACAAAAAAAAAAAAACG;FLSUP=1;FRSUP=1;FSUP=2 1 575206 . T <INS> 94.781128 PASS SVTYPE=INS;SVLEN=24;END=575229;Cluster=212;Support=1;SEQ=CGTTTTTTTTTTTTTGTTGTTGTT;FLSUP=1;FRSUP=1;FSUP=2 1 775257 . A <INS> 96.208244 PASS SVTYPE=INS;SVLEN=5;END=775261;Cluster=323;Support=1;SEQ=TAAAG;FLSUP=4;FRSUP=5;FSUP=9 1 811692 . C <INS> 95.955475 FAIL SVTYPE=INS;SVLEN=20;END=811711;Cluster=328;Support=1;SEQ=TCACAGGCGCCCACACTCCC;FLSUP=0;FRSUP=1;FSUP=1 1 874951 . C <INS> 95.360939 PASS SVTYPE=INS;SVLEN=13;END=874963;Cluster=351;Support=1;SEQ=TCCCTGGAGGACC;FLSUP=10;FRSUP=8;FSUP=18 1 874951 . C <INS> 96.360939 PASS SVTYPE=INS;SVLEN=13;END=874963;Cluster=351;Support=1;SEQ=TACCTGGAGGACC;FLSUP=9;FRSUP=8;FSUP=17 1 900007 . G <INS> 95.392281 PASS SVTYPE=INS;SVLEN=40;END=900046;Cluster=358;Support=2;SEQ=TCCGCGCGTCAGCAGTGGGGCTGTGCTGCGGGGAAGGGGG;FLSUP=2;FRSUP=6;FSUP=8 1 919989 . T <INS> 97.455109 FAIL SVTYPE=INS;SVLEN=9;END=919997;Cluster=366;Support=6;SEQ=CTTCTTTAT;FLSUP=1;FRSUP=0;FSUP=1 1 919995 . T <INS> 97.455109 FAIL SVTYPE=INS;SVLEN=5;END=919999;Cluster=366;Support=6;SEQ=ATTTC;FLSUP=0;FRSUP=1;FSUP=1 1 945862 . T <INS> 95.503494 PASS SVTYPE=INS;SVLEN=32;END=945893;Cluster=377;Support=1;SEQ=GTATTTATTTATTTGAATCTTATTTAAATATT;FLSUP=3;FRSUP=1;FSUP=4 1 991808 . G <INS> 95.602104 FAIL SVTYPE=INS;SVLEN=10;END=991817;Cluster=392;Support=1;SEQ=TGTGGGTGGG;FLSUP=0;FRSUP=1;FSUP=1 1 1069484 . A <INS> 94.667793 FAIL SVTYPE=INS;SVLEN=27;END=1069510;Cluster=421;Support=1;SEQ=GCAAGACTCCGTCTCAAAAAAAAAAAA;FLSUP=2;FRSUP=0;FSUP=2 1 1532001 . G <INS> 95.238800 FAIL SVTYPE=INS;SVLEN=42;END=1532042;Cluster=532;Support=1;SEQ=GGAGACAGAGACAGACAGAGAGAGGGAAAGAGGCAGAGACAT;FLSUP=1;FRSUP=0;FSUP=1 1 1594841 . C <INS> 97.208244 PASS SVTYPE=INS;SVLEN=5;END=1594845;Cluster=553;Support=1;SEQ=TCTCG;FLSUP=3;FRSUP=2;FSUP=5 1 1727191 . A <INS> 95.697418 PASS SVTYPE=INS;SVLEN=9;END=1727199;Cluster=594;Support=1;SEQ=TCAAAAAAA;FLSUP=6;FRSUP=6;FSUP=12 1 1823094 . G <INS> 96.208244 PASS SVTYPE=INS;SVLEN=5;END=1823098;Cluster=631;Support=1;SEQ=CAGAG;FLSUP=7;FRSUP=7;FSUP=14
As you can see I've 8 columns, but you are right about <INS>, I will try to convert allele in order to use cluster. In my opinion to genotype large structural variants like in this case, normally the outputs from different variant callers are give it like in this example <INS>, <DEL>.... so I will recommended that bayestyper will be able to work with this symbolic allele for Structural variants, otherwise for small indels or SNP you will not have any problem! I well try to run convertallele and after that use cluster!
Thanks again for your help Jonas.
Jordi
sorry the github deleted me the " < INS > column, here the example:
1 99009 . T < INS > 95.362411 FAIL SVTYPE=INS;SVLEN=37;END=99045;Cluster=28;Support=1;SEQ=GTATTTATTTATTTATTTGTTTATTTACTTATTTACC;FLSUP=0;FRSUP=0;FSUP=0 1 99035 . T < INS > 96.360939 FAIL SVTYPE=INS;SVLEN=13;END=99047;Cluster=28;Support=1;SEQ=CCTTATTTACCTT;FLSUP=0;FRSUP=1;FSUP=1 1 99036 . T < INS > 96.360939 FAIL SVTYPE=INS;SVLEN=13;END=99048;Cluster=28;Support=1;SEQ=ATTATTTACCTTA;FLSUP=0;FRSUP=1;FSUP=1 1 99100 . T < INS > 95.336441 FAIL SVTYPE=INS;SVLEN=38;END=99137;Cluster=28;Support=1;SEQ=CCTTTCTCTTTTTTTCTCTTTTCTTTCTTTTTGTTTTT;FLSUP=0;FRSUP=2;FSUP=2 1 170387 . C < INS > 95.920555 FAIL SVTYPE=INS;SVLEN=7;END=170393;Cluster=77;Support=1;SEQ=GCCCGTC;FLSUP=0;FRSUP=0;FSUP=0 1 170387 . C < INS > 95.920555 FAIL SVTYPE=INS;SVLEN=7;END=170393;Cluster=77;Support=1;SEQ=GCCTGTC;FLSUP=0;FRSUP=0;FSUP=0 1 232510 . A < INS > 96.431458 PASS SVTYPE=INS;SVLEN=64;END=232573;Cluster=87;Support=2;SEQ=TTTTTTGACGAATTCTGTGTAAGTACAAAAAAGACGTAAAATAAAACTTTATTTAAAACACTTT;FLSUP=3;FRSUP=1;FSUP=4 1 245366 . T < INS > 94.741905 FAIL SVTYPE=INS;SVLEN=25;END=245390;Cluster=95;Support=1;SEQ=ATTTCCTTATTTACCTTATTTATTT;FLSUP=0;FRSUP=1;FSUP=1 1 255925 . T < INS > 96.004265 PASS SVTYPE=INS;SVLEN=19;END=255943;Cluster=102;Support=1;SEQ=ATGCGTGTGTGTGTGGTCT;FLSUP=4;FRSUP=1;FSUP=5 1 255925 . T < INS > 96.004265 PASS SVTYPE=INS;SVLEN=19;END=255943;Cluster=102;Support=1;SEQ=ATGTGTGGGTGTGTGGTCT;FLSUP=4;FRSUP=1;FSUP=5 1 339913 . T < INS > 96.406067 PASS SVTYPE=INS;SVLEN=17;END=339929;Cluster=126;Support=4;SEQ=TGTAATACATATATGTA;FLSUP=2;FRSUP=3;FSUP=5 1 415261 . G < INS > 94.781128 PASS SVTYPE=INS;SVLEN=24;END=415284;Cluster=168;Support=1;SEQ=AACAACAACAAAAAAAAAAAAACG;FLSUP=1;FRSUP=1;FSUP=2 1 575206 . T < INS > 94.781128 PASS SVTYPE=INS;SVLEN=24;END=575229;Cluster=212;Support=1;SEQ=CGTTTTTTTTTTTTTGTTGTTGTT;FLSUP=1;FRSUP=1;FSUP=2 1 775257 . A < INS > 96.208244 PASS SVTYPE=INS;SVLEN=5;END=775261;Cluster=323;Support=1;SEQ=TAAAG;FLSUP=4;FRSUP=5;FSUP=9 1 811692 . C < INS > 95.955475 FAIL SVTYPE=INS;SVLEN=20;END=811711;Cluster=328;Support=1;SEQ=TCACAGGCGCCCACACTCCC;FLSUP=0;FRSUP=1;FSUP=1 1 874951 . C < INS > 95.360939 PASS SVTYPE=INS;SVLEN=13;END=874963;Cluster=351;Support=1;SEQ=TCCCTGGAGGACC;FLSUP=10;FRSUP=8;FSUP=18 1 874951 . C < INS > 96.360939 PASS SVTYPE=INS;SVLEN=13;END=874963;Cluster=351;Support=1;SEQ=TACCTGGAGGACC;FLSUP=9;FRSUP=8;FSUP=17 1 900007 . G < INS > 95.392281 PASS SVTYPE=INS;SVLEN=40;END=900046;Cluster=358;Support=2;SEQ=TCCGCGCGTCAGCAGTGGGGCTGTGCTGCGGGGAAGGGGG;FLSUP=2;FRSUP=6;FSUP=8 1 919989 . T < INS > 97.455109 FAIL SVTYPE=INS;SVLEN=9;END=919997;Cluster=366;Support=6;SEQ=CTTCTTTAT;FLSUP=1;FRSUP=0;FSUP=1 1 919995 . T < INS > 97.455109 FAIL SVTYPE=INS;SVLEN=5;END=919999;Cluster=366;Support=6;SEQ=ATTTC;FLSUP=0;FRSUP=1;FSUP=1
Hi Jonas,
I revew your previous mail and I found that bayestypertools convertallele dont support the < INS > allele... I will appreciate if this could be supported by the next week will be fantastic, because I found that all programs which find de novo insertions (large) cannot genotype these kind of events, so I was planned use bayestyper in order to genotype de novo insertions... but if the symbol < INS > is not supported... I cannot used and I will have a problem about genotyping these events...
If is possible solve this I will appreciate a lot.
Thanks for all
Jordi
Hi Jordi,
I will look into it later in the week and will let you know when the new release is available.
Best,
Jonas
Thanks a lot Jonas, I will wait!
Jordi
Hi Jordi,
Sorry for the silence.
I did unfortunately not have time to finish the release last week and in the beginning of this week I have been really busy with a Hackathon. I am back to working on the release now. Hope to have it ready by the end of this week.
Best,
Jonas
Ok Jonas,
really thanks for your help, We will use bayestyper for genotype large insertions and one program named whamg which doesnt genotype and we want to use it to genotype all kinds of variants.
So thanks and I will tell you how genotype structural variants from different programs.
Jordi
Hi Jordi,
The new release is now available (v1.5). Let me know if you run into any problems.
Best,
Jonas
Hi Jonas,
now this release will allow us to genotype large structural variants? and solve the problem if in the ALT column from vcf are <INS> ??
I will download and if it runs I will check it!
Thanks a lot
Jordi
Hi Jonas,
two questions, one is to run bayestyper, If I have in the ALT column < INS > or < DEL > etc... first of all in the vcf result I've to use bayestyperTools convertAllele to change the < INS > to something right?? after that i've to run bayestyoer cluster right?? Could you tell me which steps i've to follow?
Secondly, I tried to use convert allele to a vcf file obtained from whamg. this is the vcf:
##fileformat=VCFv4.2 ##source=WHAM-GRAPHENING:v1.7.0-311-g4e8c ##reference=/gpfs/projects/bsc05/jordivalls/GCAT/human_ref_PANCANCER//genome.fa ##INFO=<ID=A,Number=1,Type=Integer,Description="Total pieces of evidence"> ##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants"> ##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants"> ##INFO=<ID=CF,Number=1,Type=Float,Description="Fraction of reads in graph that cluster with SVTYPE pattern"> ##INFO=<ID=CW,Number=5,Type=Float,Description="SVTYPE weight 0-1; DEL,DUP,INV,INS,BND"> ##INFO=<ID=D,Number=1,Type=Integer,Description="Number of reads supporting a deletion"> ##INFO=<ID=DI,Number=1,Type=Float,Description="Average distance of mates to breakpoint"> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> ##INFO=<ID=EV,Number=1,Type=Integer,Description="Number everted mate-pairs"> ##INFO=<ID=I,Number=1,Type=Integer,Description="Number of reads supporting an insertion"> ##INFO=<ID=SR,Number=1,Type=Integer,Description="Number of split-reads supporing SV"> ##INFO=<ID=SS,Number=1,Type=Integer,Description="Number of split-reads supporing SV"> ##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles"> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> ##INFO=<ID=T,Number=1,Type=Integer,Description="Number of reads supporting a BND"> ##INFO=<ID=TAGS,Number=.,Type=String,Description="SM tags with breakpoint support"> ##INFO=<ID=TF,Number=1,Type=Integer,Description="Number of reads mapped too far"> ##INFO=<ID=U,Number=1,Type=Integer,Description="Number of reads supporting a duplication"> ##INFO=<ID=V,Number=1,Type=Integer,Description="Number of reads supporting an inversion"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Per sample SV support"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT insil 1 3082624 . A <DUP> . PASS A=41;CIEND=-10,10;CIPOS=-10,10;CF=0.615385;CW=0.0731707,0.707317,0,0.219512,0;D=3;DI=0;END=99820588;EV=14;I=9;SR=15;SS=0;SVLEN=96737964;SVTYPE=DUP;T=0;TAGS=insil;TF=0;U=29;V=0 GT:DP:SP .:.:46 1 5445724 . T <DUP> . PASS A=13;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0.538462,0,0.461538,0;D=0;DI=0;END=5446020;EV=0;I=6;SR=7;SS=0;SVLEN=296;SVTYPE=DUP;T=0;TAGS=insil;TF=0;U=7;V=0 GT:DP:SP .:.:16 1 6432548 . G <INS> . PASS A=8;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0,0,1,0;D=0;DI=0;END=6432548;EV=0;I=8;SR=0;SS=0;SVLEN=300;SVTYPE=INS;T=0;TAGS=insil;TF=0;U=0;V=0 GT:DP:SP .:.:9 1 6432549 . C <INS> . PASS A=9;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0,0,1,0;D=0;DI=0;END=6432549;EV=0;I=9;SR=0;SS=0;SVLEN=360;SVTYPE=INS;T=0;TAGS=insil;TF=0;U=0;V=0 GT:DP:SP .:.:10 1 6625347 . A <INS> . PASS A=10;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0,0,1,0;D=0;DI=0;END=6625347;EV=0;I=10;SR=0;SS=0;SVLEN=276;SVTYPE=INS;T=0;TAGS=insil;TF=0;U=0;V=0 GT:DP:SP .:.:11 1 6625348 . A <INS> . PASS A=10;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0,0,1,0;D=0;DI=0;END=6625348;EV=0;I=10;SR=0;SS=0;SVLEN=48;SVTYPE=INS;T=0;TAGS=insil;TF=0;U=0;V=0 GT:DP:SP .:.:11 1 7412663 . A <DEL> . PASS A=42;CIEND=-10,10;CIPOS=-10,10;CF=1;CW=0.619048,0.285714,0,0.0952381,0;D=26;DI=383.714;END=7412996;EV=0;I=4;SR=12;SS=0;SVLEN=-333;SVTYPE=DEL;T=0;TAGS=insil;TF=14;U=12;V=0 GT:DP:SP .:.:37 1 7531556 . C <DEL> . PASS A=28;CIEND=-10,10;CIPOS=-10,10;CF=1;CW=0.607143,0.285714,0,0.107143,0;D=17;DI=385.75;END=7531648;EV=0;I=3;SR=8;SS=0;SVLEN=-92;SVTYPE=DEL;T=0;TAGS=insil;TF=9;U=8;V=0 GT:DP:SP .:.:26
I ran the convert allele like this:
bayesTyperTools convertAllele -v insilico3.vcf -g genome.fa -o insilico3_allele_change
and I I've the following error:
terminate called after throwing an instance of 'std::regex_error' what(): regex_error /scratch/slurm/job5344245/slurm_script: line 14: 277736 Aborted bayesTyperTools convertAllele -v insilico3.vcf -g genome.fa -o insilico3_allele_change
Tue 2 Apr 17:13:43 CEST 2019
[02/04/2019 17:13:43] You are using BayesTyperTools (v1.5 )
[02/04/2019 17:13:43] Running BayesTyperTools (v1.5 ) convertAllele ...
[02/04/2019 17:14:01] Parsed 86 chromosome(s) [02/04/2019 17:14:01] Parsed 0 alternative allele sequence(s) [02/04/2019 17:14:01] Parsed 0 mobile element insertion sequence(s)
So I dont know where is the problem.....
Thanks for your help
Jordi
Sorry again the previous vcf file for some reason is deleted the column ALT
##fileformat=VCFv4.2 ##source=WHAM-GRAPHENING:v1.7.0-311-g4e8c ##reference=/gpfs/projects/bsc05/jordivalls/GCAT/human_ref_PANCANCER//genome.fa ##INFO=<ID=A,Number=1,Type=Integer,Description="Total pieces of evidence"> ##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants"> ##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants"> ##INFO=<ID=CF,Number=1,Type=Float,Description="Fraction of reads in graph that cluster with SVTYPE pattern"> ##INFO=<ID=CW,Number=5,Type=Float,Description="SVTYPE weight 0-1; DEL,DUP,INV,INS,BND"> ##INFO=<ID=D,Number=1,Type=Integer,Description="Number of reads supporting a deletion"> ##INFO=<ID=DI,Number=1,Type=Float,Description="Average distance of mates to breakpoint"> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> ##INFO=<ID=EV,Number=1,Type=Integer,Description="Number everted mate-pairs"> ##INFO=<ID=I,Number=1,Type=Integer,Description="Number of reads supporting an insertion"> ##INFO=<ID=SR,Number=1,Type=Integer,Description="Number of split-reads supporing SV"> ##INFO=<ID=SS,Number=1,Type=Integer,Description="Number of split-reads supporing SV"> ##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles"> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> ##INFO=<ID=T,Number=1,Type=Integer,Description="Number of reads supporting a BND"> ##INFO=<ID=TAGS,Number=.,Type=String,Description="SM tags with breakpoint support"> ##INFO=<ID=TF,Number=1,Type=Integer,Description="Number of reads mapped too far"> ##INFO=<ID=U,Number=1,Type=Integer,Description="Number of reads supporting a duplication"> ##INFO=<ID=V,Number=1,Type=Integer,Description="Number of reads supporting an inversion"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Per sample SV support"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT insil 1 3082624 . A < DUP > . PASS A=41;CIEND=-10,10;CIPOS=-10,10;CF=0.615385;CW=0.0731707,0.707317,0,0.219512,0;D=3;DI=0;END=99820588;EV=14;I=9;SR=15;SS=0;SVLEN=96737964;SVTYPE=DUP;T=0;TAGS=insil;TF=0;U=29;V=0 GT:DP:SP .:.:46 1 5445724 . T < DUP > . PASS A=13;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0.538462,0,0.461538,0;D=0;DI=0;END=5446020;EV=0;I=6;SR=7;SS=0;SVLEN=296;SVTYPE=DUP;T=0;TAGS=insil;TF=0;U=7;V=0 GT:DP:SP .:.:16 1 6432548 . G < INS > . PASS A=8;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0,0,1,0;D=0;DI=0;END=6432548;EV=0;I=8;SR=0;SS=0;SVLEN=300;SVTYPE=INS;T=0;TAGS=insil;TF=0;U=0;V=0 GT:DP:SP .:.:9 1 6432549 . C < INS > . PASS A=9;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0,0,1,0;D=0;DI=0;END=6432549;EV=0;I=9;SR=0;SS=0;SVLEN=360;SVTYPE=INS;T=0;TAGS=insil;TF=0;U=0;V=0 GT:DP:SP .:.:10 1 6625347 . A < INS > . PASS A=10;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0,0,1,0;D=0;DI=0;END=6625347;EV=0;I=10;SR=0;SS=0;SVLEN=276;SVTYPE=INS;T=0;TAGS=insil;TF=0;U=0;V=0 GT:DP:SP .:.:11 1 6625348 . A < INS > . PASS A=10;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0,0,1,0;D=0;DI=0;END=6625348;EV=0;I=10;SR=0;SS=0;SVLEN=48;SVTYPE=INS;T=0;TAGS=insil;TF=0;U=0;V=0 GT:DP:SP .:.:11 1 7412663 . A < DEL > . PASS A=42;CIEND=-10,10;CIPOS=-10,10;CF=1;CW=0.619048,0.285714,0,0.0952381,0;D=26;DI=383.714;END=7412996;EV=0;I=4;SR=12;SS=0;SVLEN=-333;SVTYPE=DEL;T=0;TAGS=insil;TF=14;U=12;V=0 GT:DP:SP .:.:37
Hi Jordi,
Would it be possible for you to share the vcf file from whamg that results in the regex error?
Also, are you compiling BayesTyper yourself or using the release binary? If you are compiling yourself what version of gcc are you using?
Best,
Jonas
Hi Jonas,
of course I can send you the VCF of wham, tell me your email address please. Yes I download the source code and we did the compilation. Which is the best option, take your compilation or did by ourself?? The program is compiled with icc... now I try the compilation with gcc...
I download the release binary from github, and I ran the tool again:
bayesTyperTools convertAllele -v insilico3.vcf -g genome.fa -o insilico3_allele_change
bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/vcf++/Attribute.cpp:177: Attribute::DetailedDescriptor::DetailedDescriptor(const std::vector<std::pair<std::__cxx11::basic_string
I just remember, that now in the ALT column, I 've several < INS >, < DEL >, < DUP >.... no only INS, I dont know if this is the problem too.
Thanks a lot for your help
Hi Jonas,
now I tried to run the convertAllele with gcc program load and appears the same problem... which I show you above. The gcc and boost used to compile the bayestyper are:
gcc 7.2.0 and boost 1.66
Thanks again for your help
Jordi
Hi Jordi,
Sorry for the silence, I have been on vacation for the last couple of days. You can send the file to me at [email protected]
All the symbolic alleles (ALT) you mention are supported so they should not pose a problem. Also, I assume using the binary or compiling using gcc fixed the regex error issue?
Thanks,
Jonas
Hi Jonas,
ok no problem! Yes is fixed, but appear the error which I mentioned above:
bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/vcf++/Attribute.cpp:177: Attribute::DetailedDescriptor::DetailedDescriptor(const std::vector<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string > >&): Assertion `false' failed. /scratch/slurm/job5366334/slurm_script: line 14: 238173 Aborted
Thank you for sending the file, Jordi.
I found the issue and it is a limitation of our vcf parser. Currently it can not parse fields/attributes with a "number" of 5 or higher. It therefore crashes on your vcf since it contains the following:
##INFO=<ID=CW,Number=5,Type=Float,Description="SVTYPE weight 0-1; DEL,DUP,INV,INS,BND">
I unfortunately won't have time to fix this right now, however you can solve it quickly by just removing the above row from the header.
Also, I noticed that the <INS> symbolic alleles in the vcf does not contain any field/attribute with the inserted sequence or any reference to a fasta file with the sequence. convertAllele
is therefore not able to convert these since it does not know what the sequence is. The other symbolic alleles <DEL>, <INV> and <DUP> are no problem since they do not contain any new sequence.
Cheers,
Jonas
Hi Jonas,
first of all thanks for your help, is really use full. Finally I only want to clarify some concepts...
1 If I delete the line which you mention from wham vcf and I run the convertAllele I will shift the <INV> for example, for a sequence right? But I'have to filter the <INS> before run the convertAllele? This will be the command which I use to run convertallel:
bayesTyperTools convertAllele -v insilico3.vcf -g genome.fa -o insilico3_allele_change
2 Could you indicate which steps we have to follow to genotype the structural variants?? for example: 1 run convertAlelle (sort if is required) 2 run kmc pipeline 3 run makeBloom 4 run bayestyper cluster 5 run bayesTyper genotype
And we will obtain a vcf with genotypes for each structural variant in exception for INS?
3 So for insertions, in specifically for Insertions from wham is not possible genotype as well as <BND>? Which programs we could use the bayestyper in order to genotype de novo insertions?? PINDEL? manta? To genotype de novo insertions you need the sequence? is not any way for example count the reads in order to know the genotype? why is important the sequence??
Thanks again for your help Jordi
Hi Jonas I ran the vcf from wham removing the line ##INFO=<ID=CW,Number=5,Type=Float,Description="SVTYPE weight 0-1; DEL,DUP,INV,INS,BND">
with this command: bayesTyperTools convertAllele -v insilico3.vcf -g genome.fa -o insilico3_allele_change
and the error persist:
bayesTyperTools: /apps/BAYESTYPER/SRC/BayesTyper-1.5/src/vcf++/Attribute.cpp:177: Attribute::DetailedDescriptor::DetailedDescriptor(const std::vector<std::pair<std::__cxx11::basic_string
I have to remove the < INS > variants??
Thanks
That is strange. It worked fine for me when I tried it on the vcf file you send me (insilico3_tumor.vcf) after I removed that line from the header. Is it the same file that you are using?
Regarding the questions in the previous post:
-
You do not need to filter any variants beforehand.
bayesTyperTools convertAllele
will filter all variants that are not supported. -
The pipeline that you mention looks correct. If you have multiple vcf files you will need to also run
bayesTyperTools combine
. Try to look at the readme, all the steps are explained in more detail there. -
BayesTyper needs to have the sequences in order to count the k-mers in the insertions. The k-mers in the insertion are compared to the k-mers in the raw reads when estimating the genotypes. BayesTyper does not use the mapped reads. Regarding alternative methods, we have used manta before to predict insertions followed by genotyping with BayesTyper.
Cheers,
Jonas
Hi Jonas,
sorry for my silence I've been working in other projects and now I tried again to run bayestyper with vcf from whamg variant caller (the file which I send you the insilico_tumor.vcf).
I remove the line ##INFO=<ID=CW,Number=5,Type=Float,Description="SVTYPE weight 0-1; DEL,DUP,INV,INS,BND"> and I ran the bayestyper like this bayesTyperTools convertAllele -v /gpfs/scratch/bsc05/bsc05832/GCAT/Benchmarking/wham/insilico_3/insilico3_tumor.vcf -g /gpfs/projects/bsc05/jordivalls/GCAT/human_ref_PANCANCER/genome.fa -o insilico3_allele_change
and the error is this:
bayesTyperTools: /apps/BAYESTYPER/SRC/BayesTyper-1.5/src/vcf++/Attribute.cpp:177: Attribute::DetailedDescriptor::DetailedDescriptor(const std::vector<std::pair<std::__cxx11::basic_string
In order to solve the problem for convertAllele step, I will appreciate post here the command which you use to run the bayestyper convertAllele, on the other hand, the versions which I use to run complie the bayestyper are:
gcc/7.2.0 boost/1.66.0_py3 cmake/3.12.0
I tried to run bayestyper which compliator that you provide, but the result is the same...
Thanks and I will reinstall the bayestyper with versions which you use to complie the tool.
Thanks again for your help
Jordi
sorry jonas finally I do it!! Was a mistake from my path.... so sorry for your time, now I'm able to run convert allele step.
Now I've a vcf with a strings really long, I cannot put it here because I cant... It is correct??
thanks again for your interest.
Jordi
Hi Jonas,
I think there maybe a bug in the convertAllele step.
I obtain the vcf after run the command:
"bayesTyperTools convertAllele -v insilico3_tumor.vcf -z 1 -g /gpfs/projects/bsc05/jordivalls/GCAT/human_ref_PANCANCER/genome.fa -o insilico3_allele_change"
after this step, I run the KMC pipeline without any problem.
I generate the file tsv with this information: insilico3 M /path/to/prefix/KMC/insilico3
And finally I run the command to cluster: bayesTyper cluster -v insilico_allele_change.vcf -s insilico3.tsv -g GRCh37_canon.fa -d GRCh37_decoy.fa -p 8
After 1 minute it breaks with this error:
ERROR: Variants on the same position need to be multi-allelic; multiple variants observed on position "2662744" on contig "12"
bayesTyper: /apps/BAYESTYPER/SRC/BayesTyper-1.5/src/bayesTyper/VariantFileParser.cpp:852: void VariantFileParser::clusterVariants(VariantCluster::Variant&, uint, std::set
/scratch/slurm/job5915751/slurm_script: line 15: 79106 Aborted bayesTyper cluster -v /gpfs/projects/bsc05/jordivalls/GCAT/Bayestyper/insilico_allele_change.vcf -s /gpfs/projects/bsc05/jordivalls/GCAT/Bayestyper/insilico.tsv -g /gpfs/projects/bsc05/jordivalls/apps/BayesTyper/ref_genomes/bayestyper_GRCh37_bundle_v1.3/GRCh37_canon.fa -d /gpfs/projects/bsc05/jordivalls/apps/BayesTyper/ref_genomes/bayestyper_GRCh37_bundle_v1.3/GRCh37_decoy.fa -p 8
So my surprise comes from the error "multiple variants observed on position "2662744" on contig "12" because in the original file which I send you by email insilico3_tumor.vcf I found this lines:
12 2662744 . G < INV > . PASS A=98;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0.0833333,0.541667,0.375,0;D=0;DI=2;END=2663227;EV=0;I=36;SR=35;SS=25;SVLEN=483;SVTYPE=INV;T=0;TAGS=insil;TF=0;U=8;V=52 GT:DP:SP .:.:88
12 2662745 . T < DUP > . PASS A=16;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0.4375,0.375,0.1875,0;D=0;DI=0;END=2663226;EV=0;I=3;SR=7;SS=6;SVLEN=481;SVTYPE=DUP;T=0;TAGS=insil;TF=6;U=7;V=6 GT:DP:SP .:.:23
This inversion and Duplication are speard by one basepair. But after run bayestypertools convert Allele I found this:
12 2662745 . TGCAGTCAGCCGTATTTTTGGCAGACTCTGAAGCCTGAGAGCACTAAAAAGAGACAGCGGCGTGTCCCAGGGTGGGGCTTAGCCATGCAGGCATGGCAGAAGGCCCAGGCGGTGTGGAGTCTGTGTGGTATCGCGGGCAACATTGCCATGACTGTAGTCAAGCCTGGTGTTAAAGGGTGAGAGTGATGGGGAGGATCCAGTGAGGACAAATGTTGGGTCCTCTGAAAGAGCAGGAGGGATCATAGAAAGGGAGGCTCTGGCCAGAGTCATGGAGCATCTGAGCTCAGCACCCAGGAGTGGAAGCAATTCTATGCCGTGCAAGGGCAGGATGTGGCCCTCCCAATAATAAGAGCTGTGGGGGAGAGGACGTCCTGGAGTTAAGGATGCTGAGTCAGACGCATTGTCAATGTGGATGATGACTGCAGTGCCATCTGACTTATATTAAAGGCGAATTTGGGACAATTCAGGGAAATACTATTTTAC GTAAAATAGTATTTCCCTGAATTGTCCCAAATTCGCCTTTAATATAAGTCAGATGGCACTGCAGTCATCATCCACATTGACAATGCGTCTGACTCAGCATCCTTAACTCCAGGACGTCCTCTCCCCCACAGCTCTTATTATTGGGAGGGCCACATCCTGCCCTTGCACGGCATAGAATTGCTTCCACTCCTGGGTGCTGAGCTCAGATGCTCCATGACTCTGGCCAGAGCCTCCCTTTCTATGATCCCTCCTGCTCTTTCAGAGGACCCAACATTTGTCCTCACTGGATCCTCCCCATCACTCTCACCCTTTAACACCAGGCTTGACTACAGTCATGGCAATGTTGCCCGCGATACCACACAGACTCCACACCGCCTGGGCCTTCTGCCATGCCTGCATGGCTAAGCCCCACCCTGGGACACGCCGCTGTCTCTTTTTAGTGCTCTCAGGCTTCAGAGTCTGCCAAAAATACGGCTGACTGCA . PASS A=98;CF=0;CIEND=-10,10;CIPOS=-10,10;D=0;DI=2;END=2663227;EV=0;I=36;SR=35;SS=25;SVLEN=483;SVTYPE=INV;T=0;TAGS=insil;TF=0;U=8;V=52 GT:DP:SP .:.:88 12 2662745 . TGCAGTCAGCCGTATTTTTGGCAGACTCTGAAGCCTGAGAGCACTAAAAAGAGACAGCGGCGTGTCCCAGGGTGGGGCTTAGCCATGCAGGCATGGCAGAAGGCCCAGGCGGTGTGGAGTCTGTGTGGTATCGCGGGCAACATTGCCATGACTGTAGTCAAGCCTGGTGTTAAAGGGTGAGAGTGATGGGGAGGATCCAGTGAGGACAAATGTTGGGTCCTCTGAAAGAGCAGGAGGGATCATAGAAAGGGAGGCTCTGGCCAGAGTCATGGAGCATCTGAGCTCAGCACCCAGGAGTGGAAGCAATTCTATGCCGTGCAAGGGCAGGATGTGGCCCTCCCAATAATAAGAGCTGTGGGGGAGAGGACGTCCTGGAGTTAAGGATGCTGAGTCAGACGCATTGTCAATGTGGATGATGACTGCAGTGCCATCTGACTTATATTAAAGGCGAATTTGGGACAATTCAGGGAAATACTATTTTA TGCAGTCAGCCGTATTTTTGGCAGACTCTGAAGCCTGAGAGCACTAAAAAGAGACAGCGGCGTGTCCCAGGGTGGGGCTTAGCCATGCAGGCATGGCAGAAGGCCCAGGCGGTGTGGAGTCTGTGTGGTATCGCGGGCAACATTGCCATGACTGTAGTCAAGCCTGGTGTTAAAGGGTGAGAGTGATGGGGAGGATCCAGTGAGGACAAATGTTGGGTCCTCTGAAAGAGCAGGAGGGATCATAGAAAGGGAGGCTCTGGCCAGAGTCATGGAGCATCTGAGCTCAGCACCCAGGAGTGGAAGCAATTCTATGCCGTGCAAGGGCAGGATGTGGCCCTCCCAATAATAAGAGCTGTGGGGGAGAGGACGTCCTGGAGTTAAGGATGCTGAGTCAGACGCATTGTCAATGTGGATGATGACTGCAGTGCCATCTGACTTATATTAAAGGCGAATTTGGGACAATTCAGGGAAATACTATTTTAGCAGTCAGCCGTATTTTTGGCAGACTCTGAAGCCTGAGAGCACTAAAAAGAGACAGCGGCGTGTCCCAGGGTGGGGCTTAGCCATGCAGGCATGGCAGAAGGCCCAGGCGGTGTGGAGTCTGTGTGGTATCGCGGGCAACATTGCCATGACTGTAGTCAAGCCTGGTGTTAAAGGGTGAGAGTGATGGGGAGGATCCAGTGAGGACAAATGTTGGGTCCTCTGAAAGAGCAGGAGGGATCATAGAAAGGGAGGCTCTGGCCAGAGTCATGGAGCATCTGAGCTCAGCACCCAGGAGTGGAAGCAATTCTATGCCGTGCAAGGGCAGGATGTGGCCCTCCCAATAATAAGAGCTGTGGGGGAGAGGACGTCCTGGAGTTAAGGATGCTGAGTCAGACGCATTGTCAATGTGGATGATGACTGCAGTGCCATCTGACTTATATTAAAGGCGAATTTGGGACAATTCAGGGAAATACTATTTTA . PASS A=16;CF=0;CIEND=-10,10;CIPOS=-10,10;D=0;DI=0;END=2663226;EV=0;I=3;SR=7;SS=6;SVLEN=481;SVTYPE=DUP;T=0;TAGS=insil;TF=6;U=7;V=6 GT:DP:SP .:.:23
As you can see there are exactly same position, but dont corresponds to the position of the error... even more I think that repeat the duplication position and remove the inversion, because the first info corresponds to Inversion and second one corresponds to Duplication.
So I think that convertAllele has any bug... because repeat this position..... Or why I cannot run the cluster step, could you try it??
Thanks again for your time
Jordi
hi Jonas,
sorry for insistence but do you have a look my previous post??
Thanks
Jordi
Hi Jordi,
Sorry for the silence. Have been traveling and on vacation for last couple of weeks.
Yes, the really long strings are correct. convertAllele
converts symbolic alternative alleles (e.g. <DEL>) to sequence. Therefore large variants will result in long sequences in the vcf.
The reason for the error when running cluster
is that the variants is on the same position, but are recorded as two different variants. The input for cluster
needs to be in multi-allelic format. You can change your vcf file to this by running bayesTyperTools combine
on your file. Alternatively you can also use bcftools: bcftools norm -m +any
.
The reason for the difference in positions between the error message and the vcf is that the vcf is 1-based and the positions internally in bayesTyper are 0-based.
Best,
Jonas
Hi Jonas,
I hope that the vacation goes well! I just only to mention which I only have one caller ran by on sample, so for the combine step, I've to use this vcf (after the convertallele and bcftools norm steps) with the prior vcfs... After apply the convertallele which finish succesfully: bayesTyperTools convertAllele -v insilico3_tumor.vcf -z 1 -g /gpfs/projects/bsc05/jordivalls/GCAT/human_ref_PANCANCER/genome.fa -o insilico3_allele_change after apply the samtools bgzip and tabix for bcftools norm: bgzip insilico3_allele_change.vcf tabix -p vcf insilico3_allele_change.vcf.gz bcftools norm -m +any insilico3_allele_change.vcf.gz | gzip > insilico3_allele_change.vcf.gz_norm.vcf.gz and finally apply the combine: bayesTyperTools combine -v wham:/gpfs/projects/bsc05/jordivalls/GCAT/Bayestyper/insilico3_allele_change.vcf.gz_norm.vcf.gz,prior:/gpfs/projects/bsc05/jordivalls/apps/BayesTyper/ref_genomes/bayestyper_GRCh37_bundle_v1.3/SNP_dbSNP150common_SV_1000g_dbSNP150all_GDK_GoNL_GTEx_GRCh37.vcf.gz -o insilico3 -z
appear the following error:
bayesTyperTools: /apps/BAYESTYPER/SRC/BayesTyper-1.5/src/vcf++/Contig.cpp:38: Contig::Contig(const std::vector<std::__cxx11::basic_string
I've to mention two questions:
1- when I ran the cluster module of bayestyper with bcftools -norm file it seems that works. Why it works?
2- When I try to run the combine module of bayestyperTools without file obtained of bcftools -norm,
the command is this:
bayesTyperTools combine -v wham:/gpfs/projects/bsc05/jordivalls/GCAT/Bayestyper/insilico3_allele_change.vcf.gz,prior:/gpfs/projects/bsc05/jordivalls/apps/BayesTyper/ref_genomes/bayestyper_GRCh37_bundle_v1.3/SNP_dbSNP150common_SV_1000g_dbSNP150all_GDK_GoNL_GTEx_GRCh37.vcf.gz -o insilico3 -z
the error obtaned is this:
bayesTyperTools: /apps/BAYESTYPER/SRC/BayesTyper-1.5/src/vcf++/Contig.cpp:38: Contig::Contig(const std::vector<std::__cxx11::basic_string
I dont know what is this error.... I'm doing something wrong??, why I'm not able to run combine module?? because you need more vcfs???
Thanks for your help again
Jordi
Hi Jonas, finally I can run the genotype without any problem!
I've to mention 2 things which I thing that are important in the final output of vcf from bayestyper...
the final output doesn't appear the SVTYPE, so after run this tool I dont know if the variant is a Duplication, Inversion etc... and the wham tool give this information... so Will be nice that in the following releases appear this information, as well as the SVLEN, END position. I put here an example:
bayestyper output:
1 5445724 . T TCCTGTGCCCTCACATGGTCGTCTCTCTGTGTGTCTGTGTCCTAATCTCCTCTTCTCATAAGGACAACAGTCCTATTGAGTTAGGGCCCACCCTAATGACCTCAATTTGATGTACTGACCTCTTTAAAGATCCTATCTCCATACACAGTCACATCCTCAGGCACAGGGGTCTAGGACTTTAATACATAAATTTGGAGGGCACACAATTTGGTCCATAACAGTGGGTGACAGACCCCTGACACCTGCTTGACAGATAGCTCCCAACAGCCGACCACAATGTCTCCTTTTGGAGGGGGT 99 PASS AC=1;AF=0.5;AN=2;ACP=1,1;VCS=1;VCR=1:5445724-5445724;VCGS=1;VCGR=1:5445724-5445724;HC=2;ACO=. GT:GQ:GPP:APP:NAK:FAK:MAC:SAF 0/1:99:0,1,0:1,1:5.15,34.95:1,1:5.2306,4.55946:0,0
Wham output: 1 5445724 . T <DUP> . PASS A=13;CIEND=-10,10;CIPOS=-10,10;CF=0;CW=0,0.538462,0,0.461538,0;D=0;DI=0;END=5446020;EV=0;I=6;SR=7;SS=0;SVLEN=296;SVTYPE=DUP;T=0;TAGS=insil;TF=0;U=7;V=0 GT:DP:SP .:.:16
not all programs give this information, but is really useful for some projects...
Thanks for your help
Jordi