htsjdk
htsjdk copied to clipboard
TribbleIndexedFeatureReader throws "TribbleException: there aren't enough columns .." on large files unless there's a tabix index
Verify
This is the same Issue as https://github.com/broadinstitute/gatk/issues/4761
I got the error below when running picard liftover on the bravo-dbsnp-all.vcf.gz (GRCh38 => GRCh37).
I'm reposting this because, even though the user closed https://github.com/broadinstitute/gatk/issues/4761 after they saw that bgzipping + tabixing the input vcf avoids the error, I think there are 2 issues which may not have been fixed yet(?)
- if TribbleIndexedFeatureReader stops in the middle of a line, it causes a cryptic parsing error that's hard to connect to lack of tabix index
- in the rare event that it stops at the end of a line, it would presumably silently continue and result in truncated output that users may not notice, and that's even harder to debug.
Exception in thread "main" htsjdk.tribble.TribbleException: Line 998904: there aren't enough columns for line 1 4723990 TOPMed_freeze_5?chr1:4,723,9 (we expected 9 tokens, and saw 3 ), for input source: file:///Users/weisburd/code/bravo/bravo-dbsnp-all.removed_chr_prefix.vcf.gz
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:281)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:262)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:64)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:365)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:346)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:307)
at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:324)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
Subject of the issue
The full logs are:
++ i=
++ java -Xmx50G -jar picard.jar LiftoverVcf R=hg19.fa C=hg38ToHg19.over.chain.remove_chr_prefix.gz I=bravo-dbsnp-all.removed_chr_prefix.vcf.gz O=bravo-dbsnp-all.removed_chr_prefix.liftunder_grch19.vcf.gz REJECT=bravo-dbsnp-all.removed_chr_prefix.rejected.vcf.gz TMP_DIR=./tmp VALIDATION_STRINGENCY=LENIENT WARN_ON_MISSING_CONTIG=true WRITE_ORIGINAL_POSITION=true
++ tee .log
09:33:54.465 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/weisburd/code/bravo/picard.jar!/com/intel/gkl/native/libgkl_compression.dylib
[Thu Jul 19 09:33:54 EDT 2018] LiftoverVcf INPUT=bravo-dbsnp-all.removed_chr_prefix.vcf.gz OUTPUT=bravo-dbsnp-all.removed_chr_prefix.liftunder_grch19.vcf.gz CHAIN=hg38ToHg19.over.chain.remove_chr_prefix.gz REJECT=bravo-dbsnp-all.removed_chr_prefix.rejected.vcf.gz WARN_ON_MISSING_CONTIG=true WRITE_ORIGINAL_POSITION=true TMP_DIR=[./tmp] VALIDATION_STRINGENCY=LENIENT REFERENCE_SEQUENCE=hg19.fa LOG_FAILED_INTERVALS=true LIFTOVER_MIN_MATCH=1.0 ALLOW_MISSING_FIELDS_IN_HEADER=false TAGS_TO_REVERSE=[AF] TAGS_TO_DROP=[MAX_AF] VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Jul 19 09:33:54 EDT 2018] Executing as weisburd@wm4d1-03b on Mac OS X 10.12.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12; Deflater: Intel; Inflater: Intel; Picard version: 2.17.4-1-gb63d73e-SNAPSHOT
INFO 2018-07-19 09:33:54 LiftoverVcf Loading up the target reference genome.
INFO 2018-07-19 09:34:04 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180127-180133 failed to match chain 816 because intersection length 6 < minMatchSize 7.0 (0.85714287 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180221-180227 failed to match chain 816 because intersection length 2 < minMatchSize 7.0 (0.2857143 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180221-180227 failed to match chain 977 because intersection length 5 < minMatchSize 7.0 (0.71428573 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180801-180808 failed to match chain 23749811 because intersection length 3 < minMatchSize 8.0 (0.375 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180801-180808 failed to match chain 2410 because intersection length 3 < minMatchSize 8.0 (0.375 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180802-180808 failed to match chain 23749811 because intersection length 3 < minMatchSize 7.0 (0.42857143 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180802-180808 failed to match chain 2410 because intersection length 2 < minMatchSize 7.0 (0.2857143 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180803-180804 failed to match chain 2410 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180803-180808 failed to match chain 23749811 because intersection length 3 < minMatchSize 6.0 (0.5 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180803-180808 failed to match chain 2410 because intersection length 1 < minMatchSize 6.0 (0.16666667 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180803-180809 failed to match chain 23749811 because intersection length 4 < minMatchSize 7.0 (0.5714286 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180803-180809 failed to match chain 2410 because intersection length 1 < minMatchSize 7.0 (0.14285715 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180804-180806 failed to match chain 23749811 because intersection length 1 < minMatchSize 3.0 (0.33333334 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180863-180869 failed to match chain 7666873 because intersection length 1 < minMatchSize 7.0 (0.14285715 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180863-180869 failed to match chain 23749811 because intersection length 6 < minMatchSize 7.0 (0.85714287 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180863-180875 failed to match chain 7666873 because intersection length 7 < minMatchSize 13.0 (0.53846157 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180863-180875 failed to match chain 23749811 because intersection length 6 < minMatchSize 13.0 (0.46153846 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180868-180869 failed to match chain 7666873 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180868-180869 failed to match chain 23749811 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180868-180870 failed to match chain 7666873 because intersection length 2 < minMatchSize 3.0 (0.6666667 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180868-180870 failed to match chain 23749811 because intersection length 1 < minMatchSize 3.0 (0.33333334 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180868-180874 failed to match chain 7666873 because intersection length 6 < minMatchSize 7.0 (0.85714287 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180868-180874 failed to match chain 23749811 because intersection length 1 < minMatchSize 7.0 (0.14285715 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180868-180880 failed to match chain 7666873 because intersection length 12 < minMatchSize 13.0 (0.9230769 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180868-180880 failed to match chain 23749811 because intersection length 1 < minMatchSize 13.0 (0.07692308 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180993-181004 failed to match chain 2576 because intersection length 9 < minMatchSize 12.0 (0.75 < 1.0)
INFO 2018-07-19 09:34:06 LiftOver Interval 1:180993-181004 failed to match chain 2410
...
...
...
INFO 2018-07-19 09:34:13 LiftOver Interval 1:1663380-1663401 failed to match chain 2 because intersection length 1 < minMatchSize 22.0 (0.045454547 < 1.0)
INFO 2018-07-19 09:34:13 LiftOver Interval 1:1663399-1663409 failed to match chain 2 because intersection length 7 < minMatchSize 11.0 (0.6363636 < 1.0)
INFO 2018-07-19 09:34:22 LiftOver Interval 1:2767665-2767666 failed to match chain 4680 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO 2018-07-19 09:34:22 LiftOver Interval 1:2778770-2778774 failed to match chain 2 because intersection length 3 < minMatchSize 5.0 (0.6 < 1.0)
INFO 2018-07-19 09:34:22 LiftOver Interval 1:2778772-2778776 failed to match chain 2 because intersection length 1 < minMatchSize 5.0 (0.2 < 1.0)
INFO 2018-07-19 09:34:23 LiftOver Interval 1:2970389-2970390 failed to match chain 2 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
[Thu Jul 19 09:34:30 EDT 2018] picard.vcf.LiftoverVcf done. Elapsed time: 0.60 minutes.
Runtime.totalMemory()=6373244928
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.tribble.TribbleException: Line 998904: there aren't enough columns for line 1 4723990 TOPMed_freeze_5?chr1:4,723,9 (we expected 9 tokens, and saw 3 ), for input source: file:///Users/weisburd/code/bravo/bravo-dbsnp-all.removed_chr_prefix.vcf.gz
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:281)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:262)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:64)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:365)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:346)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:307)
at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:324)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
Your environment
$ java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
$ java -Xmx50G -jar picard.jar LiftoverVcf --version
2.17.4-1-gb63d73e-SNAPSHOT
happens on latest MacOS and CentOS
Steps to reproduce
Tell us how to reproduce this issue. If possible, include a short code snippet to demonstrate the problem.
Run
java -Xmx50G -jar picard.jar LiftoverVcf \
R=hg19.fa \
C=hg38ToHg19.over.chain.remove_chr_prefix.gz \
I=bravo-dbsnp-all.removed_chr_prefix.vcf.gz \
O=bravo-dbsnp-all.removed_chr_prefix.liftunder_grch19.vcf.gz \
REJECT=bravo-dbsnp-all.removed_chr_prefix.rejected.vcf.gz \
TMP_DIR=./tmp \
VALIDATION_STRINGENCY=LENIENT \
WARN_ON_MISSING_CONTIG=true \
WRITE_ORIGINAL_POSITION=true |& tee ${i}.log
on the vcf from https://bravo.sph.umich.edu/freeze5/hg38/download
Actual behaviour
I haven't looked through the htsjdk code, but it looks like if there's no .tbi index, the TribbleReader will return lines until it hits some threshold. It then stops & returns incomplete data. I double checked the input .vcf with a python script and confirmed there's nothing unusual about the line it errored on - the file itself isn't corrupt/truncated.
Generating a tabix index prevents the error.
I believe that this is a picard issue and should be reported there. The gatk issue that you mention is closed because it was already fixed with the update of htsjdk (version >= 2.14.2 have the fix).
You are using a snapshot of picard (concretely, version 2.17.4-1-gb63d73e-SNAPSHOT, as reported in your logs) and the htsjdk version was updated with 2.17.8. Updating picard should fix this problem. I recommend to use the latest version (2.18.9 at this moment).
If the problem persists after update it, please report it here.
Has there been any update for the issue ? I have recently tried using picard (Liftover) to convert hg38ToHg19 and was facing similar issues.