htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

TribbleIndexedFeatureReader throws "TribbleException: there aren't enough columns .." on large files unless there's a tabix index

Open bw2 opened this issue 6 years ago • 2 comments

Verify

This is the same Issue as https://github.com/broadinstitute/gatk/issues/4761

I got the error below when running picard liftover on the bravo-dbsnp-all.vcf.gz (GRCh38 => GRCh37).

I'm reposting this because, even though the user closed https://github.com/broadinstitute/gatk/issues/4761 after they saw that bgzipping + tabixing the input vcf avoids the error, I think there are 2 issues which may not have been fixed yet(?)

  1. if TribbleIndexedFeatureReader stops in the middle of a line, it causes a cryptic parsing error that's hard to connect to lack of tabix index
  2. in the rare event that it stops at the end of a line, it would presumably silently continue and result in truncated output that users may not notice, and that's even harder to debug.
Exception in thread "main" htsjdk.tribble.TribbleException: Line 998904: there aren't enough columns for line 1	4723990	TOPMed_freeze_5?chr1:4,723,9 (we expected 9 tokens, and saw 3 ), for input source: file:///Users/weisburd/code/bravo/bravo-dbsnp-all.removed_chr_prefix.vcf.gz
	at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:281)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:262)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:64)
	at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
	at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:365)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:346)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:307)
	at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:324)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

Subject of the issue

The full logs are:

++ i=
++ java -Xmx50G -jar picard.jar LiftoverVcf R=hg19.fa C=hg38ToHg19.over.chain.remove_chr_prefix.gz I=bravo-dbsnp-all.removed_chr_prefix.vcf.gz O=bravo-dbsnp-all.removed_chr_prefix.liftunder_grch19.vcf.gz REJECT=bravo-dbsnp-all.removed_chr_prefix.rejected.vcf.gz TMP_DIR=./tmp VALIDATION_STRINGENCY=LENIENT WARN_ON_MISSING_CONTIG=true WRITE_ORIGINAL_POSITION=true
++ tee .log
09:33:54.465 INFO  NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/weisburd/code/bravo/picard.jar!/com/intel/gkl/native/libgkl_compression.dylib
[Thu Jul 19 09:33:54 EDT 2018] LiftoverVcf INPUT=bravo-dbsnp-all.removed_chr_prefix.vcf.gz OUTPUT=bravo-dbsnp-all.removed_chr_prefix.liftunder_grch19.vcf.gz CHAIN=hg38ToHg19.over.chain.remove_chr_prefix.gz REJECT=bravo-dbsnp-all.removed_chr_prefix.rejected.vcf.gz WARN_ON_MISSING_CONTIG=true WRITE_ORIGINAL_POSITION=true TMP_DIR=[./tmp] VALIDATION_STRINGENCY=LENIENT REFERENCE_SEQUENCE=hg19.fa    LOG_FAILED_INTERVALS=true LIFTOVER_MIN_MATCH=1.0 ALLOW_MISSING_FIELDS_IN_HEADER=false TAGS_TO_REVERSE=[AF] TAGS_TO_DROP=[MAX_AF] VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Jul 19 09:33:54 EDT 2018] Executing as weisburd@wm4d1-03b on Mac OS X 10.12.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12; Deflater: Intel; Inflater: Intel; Picard version: 2.17.4-1-gb63d73e-SNAPSHOT
INFO	2018-07-19 09:33:54	LiftoverVcf	Loading up the target reference genome.
INFO	2018-07-19 09:34:04	LiftoverVcf	Lifting variants over and sorting (not yet writing the output file.)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180127-180133 failed to match chain 816 because intersection length 6 < minMatchSize 7.0 (0.85714287 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180221-180227 failed to match chain 816 because intersection length 2 < minMatchSize 7.0 (0.2857143 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180221-180227 failed to match chain 977 because intersection length 5 < minMatchSize 7.0 (0.71428573 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180801-180808 failed to match chain 23749811 because intersection length 3 < minMatchSize 8.0 (0.375 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180801-180808 failed to match chain 2410 because intersection length 3 < minMatchSize 8.0 (0.375 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180802-180808 failed to match chain 23749811 because intersection length 3 < minMatchSize 7.0 (0.42857143 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180802-180808 failed to match chain 2410 because intersection length 2 < minMatchSize 7.0 (0.2857143 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180803-180804 failed to match chain 2410 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180803-180808 failed to match chain 23749811 because intersection length 3 < minMatchSize 6.0 (0.5 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180803-180808 failed to match chain 2410 because intersection length 1 < minMatchSize 6.0 (0.16666667 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180803-180809 failed to match chain 23749811 because intersection length 4 < minMatchSize 7.0 (0.5714286 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180803-180809 failed to match chain 2410 because intersection length 1 < minMatchSize 7.0 (0.14285715 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180804-180806 failed to match chain 23749811 because intersection length 1 < minMatchSize 3.0 (0.33333334 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180863-180869 failed to match chain 7666873 because intersection length 1 < minMatchSize 7.0 (0.14285715 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180863-180869 failed to match chain 23749811 because intersection length 6 < minMatchSize 7.0 (0.85714287 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180863-180875 failed to match chain 7666873 because intersection length 7 < minMatchSize 13.0 (0.53846157 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180863-180875 failed to match chain 23749811 because intersection length 6 < minMatchSize 13.0 (0.46153846 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180869 failed to match chain 7666873 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180869 failed to match chain 23749811 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180870 failed to match chain 7666873 because intersection length 2 < minMatchSize 3.0 (0.6666667 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180870 failed to match chain 23749811 because intersection length 1 < minMatchSize 3.0 (0.33333334 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180874 failed to match chain 7666873 because intersection length 6 < minMatchSize 7.0 (0.85714287 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180874 failed to match chain 23749811 because intersection length 1 < minMatchSize 7.0 (0.14285715 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180880 failed to match chain 7666873 because intersection length 12 < minMatchSize 13.0 (0.9230769 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180880 failed to match chain 23749811 because intersection length 1 < minMatchSize 13.0 (0.07692308 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180993-181004 failed to match chain 2576 because intersection length 9 < minMatchSize 12.0 (0.75 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180993-181004 failed to match chain 2410 
...
...
...
INFO	2018-07-19 09:34:13	LiftOver	Interval 1:1663380-1663401 failed to match chain 2 because intersection length 1 < minMatchSize 22.0 (0.045454547 < 1.0)
INFO	2018-07-19 09:34:13	LiftOver	Interval 1:1663399-1663409 failed to match chain 2 because intersection length 7 < minMatchSize 11.0 (0.6363636 < 1.0)
INFO	2018-07-19 09:34:22	LiftOver	Interval 1:2767665-2767666 failed to match chain 4680 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO	2018-07-19 09:34:22	LiftOver	Interval 1:2778770-2778774 failed to match chain 2 because intersection length 3 < minMatchSize 5.0 (0.6 < 1.0)
INFO	2018-07-19 09:34:22	LiftOver	Interval 1:2778772-2778776 failed to match chain 2 because intersection length 1 < minMatchSize 5.0 (0.2 < 1.0)


INFO	2018-07-19 09:34:23	LiftOver	Interval 1:2970389-2970390 failed to match chain 2 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
[Thu Jul 19 09:34:30 EDT 2018] picard.vcf.LiftoverVcf done. Elapsed time: 0.60 minutes.
Runtime.totalMemory()=6373244928
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.tribble.TribbleException: Line 998904: there aren't enough columns for line 1	4723990	TOPMed_freeze_5?chr1:4,723,9 (we expected 9 tokens, and saw 3 ), for input source: file:///Users/weisburd/code/bravo/bravo-dbsnp-all.removed_chr_prefix.vcf.gz
	at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:281)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:262)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:64)
	at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
	at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:365)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:346)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:307)
	at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:324)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

Your environment

 $ java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

$ java  -Xmx50G -jar picard.jar LiftoverVcf --version
2.17.4-1-gb63d73e-SNAPSHOT

happens on latest MacOS and CentOS

Steps to reproduce

Tell us how to reproduce this issue. If possible, include a short code snippet to demonstrate the problem.

Run

java  -Xmx50G -jar picard.jar LiftoverVcf \
    R=hg19.fa \
    C=hg38ToHg19.over.chain.remove_chr_prefix.gz \
    I=bravo-dbsnp-all.removed_chr_prefix.vcf.gz \
    O=bravo-dbsnp-all.removed_chr_prefix.liftunder_grch19.vcf.gz \
    REJECT=bravo-dbsnp-all.removed_chr_prefix.rejected.vcf.gz \
    TMP_DIR=./tmp \
    VALIDATION_STRINGENCY=LENIENT \
    WARN_ON_MISSING_CONTIG=true \
    WRITE_ORIGINAL_POSITION=true |& tee ${i}.log

on the vcf from https://bravo.sph.umich.edu/freeze5/hg38/download

Actual behaviour

I haven't looked through the htsjdk code, but it looks like if there's no .tbi index, the TribbleReader will return lines until it hits some threshold. It then stops & returns incomplete data. I double checked the input .vcf with a python script and confirmed there's nothing unusual about the line it errored on - the file itself isn't corrupt/truncated.

Generating a tabix index prevents the error.

bw2 avatar Jul 19 '18 14:07 bw2

I believe that this is a picard issue and should be reported there. The gatk issue that you mention is closed because it was already fixed with the update of htsjdk (version >= 2.14.2 have the fix).

You are using a snapshot of picard (concretely, version 2.17.4-1-gb63d73e-SNAPSHOT, as reported in your logs) and the htsjdk version was updated with 2.17.8. Updating picard should fix this problem. I recommend to use the latest version (2.18.9 at this moment).

If the problem persists after update it, please report it here.

magicDGS avatar Jul 19 '18 15:07 magicDGS

Has there been any update for the issue ? I have recently tried using picard (Liftover) to convert hg38ToHg19 and was facing similar issues.

kliarist avatar Jan 03 '22 07:01 kliarist