svim
svim copied to clipboard
[BUG] Unsorted positions on sequence
Bug description
Running SVIM on some long read PacBio bam files, I get an error from tabix, that the resulting sequences are unsorted.
[E::hts_idx_push] Unsorted positions on sequence #1: 2345914 followed by 2345913
tbx_index_build failed: MtSinai_PacBio/variants.vcf.gz
How to reproduce:
Running SVIM on MtSinai PacBio data
BAM: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/PacBio_minimap2_bam/HG002_PacBio_GRCh38.bam
Reference genome (mentioned in readme): GRCh38/hg38 - ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
SVIM call:
svim alignment --sample HG002 --partition_max_distance 1000 --cluster_max_distance 0.5 --min_sv_size 40 --segment_gap_tolerance 20 --segment_overlap_tolerance 20 --read_names --max_sv_size 1000000 MtSinai_PacBio/ ../data/long_reads/HG002_PacBio_GRCh38.bam ../data/reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna &>> 2022_01_20_svim_output.MtSinai_PacBio.log
bgzip call:
bgzip -c MtSinai_PacBio/variants.vcf > MtSinai_PacBio/variants.vcf.gz
tabix call:
tabix -p vcf MtSinai_PacBio/variants.vcf.gz
resulting error:
[E::hts_idx_push] Unsorted positions on sequence #1: 2345914 followed by 2345913
tbx_index_build failed: MtSinai_PacBio/variants.vcf.gz
Grep positions in vcf
$ less MtSinai_PacBio/variants.vcf | grep -n -sw 2345914
5386:chr1 2345914 svim.BND.216 N ]chr1:113504903]N 1 PASS SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=m141223_012647_42163R_c100748432550000001823169907081526_s1_p0/57386/7145_37078 GT:DP:AD ./.:.:.,.
277551:chr1 113504903 svim.BND.14706 N N[chr1:2345914[ 1 PASS SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=m141223_012647_42163R_c100748432550000001823169907081526_s1_p0/57386/7145_37078 GT:DP:AD ./.:.:.,.
2545447:chr6 5537590 svim.INS.2345914 C CGGAAGGCTCCTCTTCGATGGGTTGGGAGATGGGGTGTCCGGAACCTGGGATCACTTACACTCTCGTAGTTGGAGATGTCCA 1 PASSSVTYPE=INS;END=5537590;SVLEN=81;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=m150212_124930_42177R_c100777662550000001823160908051502_s1_p0/46250/4456_23453 GT:DP:AD ./.:.:.,.
$ less MtSinai_PacBio/variants.vcf | grep -n -sw 2345913
5387:chr1 2345913 svim.INS.4708 G GGCGGATGATGTATATATGATGGCTTGATGGTCGTTCGCTGCTGTGCTAGGCACAGCATAGAGGCTCAGCATCGAGGCTATATCTCA 1 PASS SVTYPE=INS;END=2345913;SVLEN=86;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=m150210_124418_42163R_c100779642550000001823165208251543_s1_p0/129717/19664_36011 GT:DP:AD ./.:.:.,.
2545446:chr6 5537511 svim.INS.2345913 C CCTCATACCCCTCCCCTCCCCCCCCCTCCCCATCTCTGAGTTGGAGATGCTCCCGCCCGGGCCTCCTCGTCGAGTTGGAGAAGCTGTACCCGCGG 1 PASS SVTYPE=INS;END=5537511;SVLEN=94;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=m150130_022735_42156_c100779802550000001823165208251520_s1_p0/101304/0_3889 GT:DP:AD ./.:.:.,.
-n: with line numbers, -sw: only whole words
For another dataset:
$ svim alignment --sample HG002 --partition_max_distance 1000 --cluster_max_distance 0.5 --min_sv_size 40 --segment_gap_tolerance 20 --segment_overlap_tolerance 20 --read_names --max_sv_size 1000000 10X_Genomics/ ../data/long_reads/NA24385_phas5ed_possorted_bam.bam ../data/reference/hg19.reordered.fa &>> 2022_01_20_svim_output.10X_Genomics.log
$ bgzip -c 10X_Genomics/variants.vcf > 10X_Genomics/variants.vcf.gz
$ tabix -p vcf 10X_Genomics/variants.vcf.gz
[E::hts_idx_push] Unsorted positions on sequence #1: 1453683 followed by 1453682
tbx_index_build failed: 10X_Genomics/variants.vcf.gz
And grep gives me:
$ less 10X_Genomics/variants.vcf | grep -n -sw 1453683
1403:chr1 1453683 svim.BND.671 N N]chr2:195764355] 1 PASS SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=HISEQ-002:393:C7A29ANXX:5:1107:4573:15814 GT:DP:AD ./.:.:.,.
1286653:chr2 195764355 svim.BND.725457 N N]chr1:1453683] 1 PASS SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=HISEQ-002:393:C7A29ANXX:5:1107:4573:15814 GT:DP:AD ./.:.:.,.
2566115:chr4 180906971 svim.BND.1453683 N ]chr13:78951248]N 1 PASS SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=HISEQ-002:393:C7A29ANXX:4:1209:12410:91014 GT:DP:AD ./.:.:.,.
5882736:chr12 58346928 svim.INV.1453683 AGT...ACT 0 PASS SVTYPE=INV;END=58524222;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=HISEQ-002:393:C7A29ANXX:5:2213:9562:27964 GT:DP:AD ./.:.:.,.
$ less 10X_Genomics/variants.vcf | grep -n -sw 1453682
1404:chr1 1453682 svim.DEL.80 ACA...ACA A 1 PASS SVTYPE=DEL;END=1454098;SVLEN=-416;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=HISEQ-002:393:C7A29ANXX:2:1308:14440:38213 GT:DP:AD ./.:.:.,.
2566114:chr4 180906772 svim.BND.1453682 N N]chr2:105418649] 1 PASS SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.;READS=HISEQ-002:393:C7A29ANXX:4:1206:7043:15405 GT:DP:AD ./.:.:.,.
5882734:chr12 58346909 svim.INV.1453682 GGG...CCC 0 PASS SVTYPE=INV;END=58347496;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=HISEQ-002:393:C7A29ANXX:4:1316:12304:42436 GT:DP:AD ./.:.:.,.
I tried to sort the file with picard:
$ picard SortVcf -I MtSinai_PacBio/variants.vcf -O MtSinai_PacBio_variants.vcf -Xms1g -Xmx100g --TMP_DIR ../tmp/picard/ &>> 2022-01-26_picard_output.MtSinai_PacBio.log
but this results in an error aswell:
Runtime.totalMemory()=5727846400
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 819567: unparsable vcf record with allele TGA...CCT, for input source: file:///.../MtSinai_PacBio/variants.vcf
at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:887)
at htsjdk.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:678)
at htsjdk.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:640)
at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:443)
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:384)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:48)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:375)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:354)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:315)
at picard.vcf.SortVcf.sortInputs(SortVcf.java:164)
at picard.vcf.SortVcf.doWork(SortVcf.java:98)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
here is the variant:
$ sed '819567q;d' MtSinai_PacBio/variants.vcf
chr2 108832254 svim.DEL.17471 TGA...CCT T 1 PASS SVTYPE=DEL;END=109522633;SVLEN=-690379;SUPPORT=1;STD_SPAN=.;STD_POS=.;READS=m150131_015113_42163R_c100780292550000001823166508251570_s1_p0/103911/7208_10342 GT:DP:AD ./.:.:.,.
This could be another problem, as picard runs through for the 10X Genomics dataset.