svtools icon indicating copy to clipboard operation
svtools copied to clipboard

lsort doest not merge multisamples run by lumpy

Open Jordi-V opened this issue 5 years ago • 3 comments

Dear all,

I run all my samples with lumpy, the lumpy developers recommended me use the -P option in order to run svtools lsort.

All my 808 samples are ran okay, but when I try to use svtools lsort, it doesnt merge anything... the command which I use is: svtools lsort -f all_vcfs_lumpygz | bgzip -c > GCAT_all_samples_lumpy_sort.vcf.gz

the samples are compressed with gzip. and the file where I have all my samples is like above: path/sample.vcf.gz path/sample2.vcf.gz path/sample3.vcf.gz ...

the output file contains the header with all samples but without any SV merged.... so there's any bug or problem with lsort? how can I have to merge the SV which are the same between samples? All my samples are from same population but they are not family related...

Thanks for your help and time

JOrdi

Jordi-V avatar Jan 07 '19 16:01 Jordi-V

This si my header obtained by svtools lsort... I havent any call... ##fileformat=VCFv4.2 ##source=LUMPY ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> ##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles"> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> ##INFO=<ID=STRANDS,Number=.,Type=String,Description="Strand orientation of the adjacency in BEDPE format (DEL:+-, DUP:-+, INV:++/--)"> ##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation"> ##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants"> ##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants"> ##INFO=<ID=CIPOS95,Number=2,Type=Integer,Description="Confidence interval (95%) around POS for imprecise variants"> ##INFO=<ID=CIEND95,Number=2,Type=Integer,Description="Confidence interval (95%) around END for imprecise variants"> ##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakends"> ##INFO=<ID=EVENT,Number=1,Type=String,Description="ID of event associated to breakend"> ##INFO=<ID=SECONDARY,Number=0,Type=Flag,Description="Secondary breakend in a multi-line variants"> ##INFO=<ID=SU,Number=.,Type=Integer,Description="Number of pieces of evidence supporting the variant across all samples"> ##INFO=<ID=PE,Number=.,Type=Integer,Description="Number of paired-end reads supporting the variant across all samples"> ##INFO=<ID=SR,Number=.,Type=Integer,Description="Number of split reads supporting the variant across all samples"> ##INFO=<ID=BD,Number=.,Type=Integer,Description="Amount of BED evidence supporting the variant across all samples"> ##INFO=<ID=EV,Number=.,Type=String,Description="Type of LUMPY evidence contributing to the variant call"> ##INFO=<ID=PRPOS,Number=.,Type=String,Description="LUMPY probability curve of the POS breakend"> ##INFO=<ID=PREND,Number=.,Type=String,Description="LUMPY probability curve of the END breakend"> ##INFO=<ID=SNAME,Number=.,Type=String,Description="Source sample name"> ##INFO=<ID=ALG,Number=1,Type=String,Description="Evidence PDF aggregation algorithm"> ##ALT=<ID=DEL,Description="Deletion"> ##ALT=<ID=DUP,Description="Duplication"> ##ALT=<ID=INV,Description="Inversion"> ##ALT=<ID=DUP:TANDEM,Description="Tandem duplication"> ##ALT=<ID=INS,Description="Insertion of novel sequence"> ##ALT=<ID=CNV,Description="Copy number variable region"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=SU,Number=1,Type=Integer,Description="Number of pieces of evidence supporting the variant"> ##FORMAT=<ID=PE,Number=1,Type=Integer,Description="Number of paired-end reads supporting the variant"> ##FORMAT=<ID=SR,Number=1,Type=Integer,Description="Number of split reads supporting the variant"> ##FORMAT=<ID=BD,Number=1,Type=Integer,Description="Amount of BED evidence supporting the variant"> ##SAMPLE=<ID=CWGS837> ##SAMPLE=<ID=CWGS838> ##SAMPLE=<ID=CWGS839> ##SAMPLE=<ID=CWGS840> ##SAMPLE=<ID=CWGS842> ##SAMPLE=<ID=CWGS843> ##SAMPLE=<ID=CWGS844> ##SAMPLE=<ID=CWGS845> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT VARIOUS

Jordi-V avatar Jan 08 '19 09:01 Jordi-V

If your calls are not genotyped, you will need to use the -r option to lsort. (By default, calls with missing genotypes are excluded).

Haley

abelhj avatar Jan 08 '19 13:01 abelhj

thanks a lot! Now it works!

One question more, how I can know if the different calls are merged if they share the same position and type?? because I ran lmerge and I the example below only appear one sample, ant two different samples has the same one...: sample obtained from lumpyexpress: sample1 and sample 2

1 829170 1 N . . SVTYPE=DEL;STRANDS=+-:4;SVLEN=-35;END=829205;CIPOS=-9,8;CIEND=-9,8;CIPOS95=0,0;CIEND95=0,0;SU=4;PE=0;SR=4;PRPOS=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;PREND=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13 GT:SU:PE:SR ./.:4:0:4 sample2: 1 829170 2 N . . SVTYPE=DEL;STRANDS=+-:4;SVLEN=-35;END=829205;CIPOS=-9,8;CIEND=-9,8;CIPOS95=0,0;CIEND95=0,0;SU=4;PE=0;SR=4;PRPOS=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;PREND=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13 GT:SU:PE:SR ./.:4:0:4

after merge:

1 829170 3548 N 295.44 . SVTYPE=DEL;SVLEN=-35;END=829205;STRANDS=+-:4;CIPOS=-9,8;CIEND=-9,8;CIPOS95=0,0;CIEND95=0,0;SU=4;PE=0;SR=4;PRPOS=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;PREND=3.78597e-15,1.50722e-13,6.00036e-12,2.38879e-10,9.50993e-09,3.78597e-07,1.50722e-05,0.000600036,0.0238879,0.950993,0.0238879,0.000600036,1.50722e-05,3.78597e-07,9.50993e-09,2.38879e-10,6.00036e-12,1.50722e-13;SNAME=sample1:2;ALG=PROD

As you can see in SNAME only appear one sample... so the merge how it works, can you post an small example?? the header dont contain in one line all samples only appear the column INFO and dont the FORMAT column... the header example is below:

##fileformat=VCFv4.2 ##reference= ##source=LUMPY ##SAMPLE=<ID=sample1> ##SAMPLE=<ID=sample2> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> ##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles"> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> ##INFO=<ID=STRANDS,Number=.,Type=String,Description="Strand orientation of the adjacency in BEDPE format (DEL:+-, DUP:-+, INV:++/--)"> ##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation"> ##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants"> ##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants"> ##INFO=<ID=CIPOS95,Number=2,Type=Integer,Description="Confidence interval (95%) around POS for imprecise variants"> ##INFO=<ID=CIEND95,Number=2,Type=Integer,Description="Confidence interval (95%) around END for imprecise variants"> ##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakends"> ##INFO=<ID=EVENT,Number=1,Type=String,Description="ID of event associated to breakend"> ##INFO=<ID=SECONDARY,Number=0,Type=Flag,Description="Secondary breakend in a multi-line variants"> ##INFO=<ID=SU,Number=.,Type=Integer,Description="Number of pieces of evidence supporting the variant across all samples"> ##INFO=<ID=PE,Number=.,Type=Integer,Description="Number of paired-end reads supporting the variant across all samples"> ##INFO=<ID=SR,Number=.,Type=Integer,Description="Number of split reads supporting the variant across all samples"> ##INFO=<ID=BD,Number=.,Type=Integer,Description="Amount of BED evidence supporting the variant across all samples"> ##INFO=<ID=EV,Number=.,Type=String,Description="Type of LUMPY evidence contributing to the variant call"> ##INFO=<ID=PRPOS,Number=.,Type=String,Description="LUMPY probability curve of the POS breakend"> ##INFO=<ID=PREND,Number=.,Type=String,Description="LUMPY probability curve of the END breakend"> ##INFO=<ID=SNAME,Number=.,Type=String,Description="Source sample name"> ##INFO=<ID=ALG,Number=1,Type=String,Description="Evidence PDF aggregation algorithm"> ##ALT=<ID=DEL,Description="Deletion"> ##ALT=<ID=DUP,Description="Duplication"> ##ALT=<ID=INV,Description="Inversion"> ##ALT=<ID=DUP:TANDEM,Description="Tandem duplication"> ##ALT=<ID=INS,Description="Insertion of novel sequence"> ##ALT=<ID=CNV,Description="Copy number variable region"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=SU,Number=1,Type=Integer,Description="Number of pieces of evidence supporting the variant"> ##FORMAT=<ID=PE,Number=1,Type=Integer,Description="Number of paired-end reads supporting the variant"> ##FORMAT=<ID=SR,Number=1,Type=Integer,Description="Number of split reads supporting the variant"> ##FORMAT=<ID=BD,Number=1,Type=Integer,Description="Amount of BED evidence supporting the variant"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality"> ##FORMAT=<ID=SQ,Number=1,Type=Float,Description="Phred-scaled probability that this site is variant (non-reference in this sample"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth"> ##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of reference observations"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of alternate observations"> ##FORMAT=<ID=RS,Number=1,Type=Integer,Description="Reference allele split-read observation count, with partial observations recorded fractionally"> ##FORMAT=<ID=AS,Number=A,Type=Integer,Description="Alternate allele split-read observation count, with partial observations recorded fractionally"> ##FORMAT=<ID=ASC,Number=A,Type=Integer,Description="Alternate allele clipped-read observation count, with partial observations recorded fractionally"> ##FORMAT=<ID=RP,Number=1,Type=Integer,Description="Reference allele paired-end observation count, with partial observations recorded fractionally"> ##FORMAT=<ID=AP,Number=A,Type=Integer,Description="Alternate allele paired-end observation count, with partial observations recorded fractionally"> ##FORMAT=<ID=AB,Number=A,Type=Float,Description="Allele balance, fraction of observations from alternate allele, QA/(QR+QA)"> #CHROM POS ID REF ALT QUAL FILTER INFO

The result is correct??? I've to run svtyper in order to gent the format with all samples in the header??

Thanks so much for your help!

Jordi

Jordi-V avatar Jan 08 '19 15:01 Jordi-V