htsjdk How to read bcf file ?

Dear,

Firstly thanks for your amazing work on this library.

I try to read a bcffile through VCFFileReader but I have this error:

Unable to parse header with error: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:, for input source: file:///somewhere/build/resources/test/dataset1.bcf

This file is valid and I am able to read it with pysam in python

$ bcftools  view -h /somewhere/build/resources/test/dataset1.bcf | head | grep fileformat
##fileformat=VCFv4.1

I use htsjdk: 2.23.0 java: 8+

I need to increase the speed process due to slowness of python and is GIL limitation which do not allow the use of multithreading to read the file through a random access way

Thanks for your help

Feb 08 '21 21:02 bioinfornatics

@bioinfornatics Oh - can you run od -c -N 30 <file> and include the results here. You may be hitting this embarrassing bug which we've never fixed https://github.com/samtools/htsjdk/issues/946.

Feb 08 '21 22:02 cmnbroad

@cmnbroad thanks I was trying to get the magic number as it is done into BCFVersion.java with hexdump -C but I had any idea to do next :-)

So the result of od -c -N 30

0000000 037 213  \b 004  \0  \0  \0  \0  \0 377 006  \0   B   C 002  \0
0000020 250 003 255 224 313 213 023   I 034 307 333   L 226 245

Feb 08 '21 22:02 bioinfornatics

Yes, that looks like a gzip header. Sadly, htsjdk lacks support for gzipped BCF, as mentioned above.

Feb 09 '21 13:02 cmnbroad

@bioinfornatics you can try to use a VCFIteratorBuilder instead of a VCFFileReader . As far as I remember, the GZ compression was first tested. (But you cannot do random-access)

Feb 09 '21 14:02 lindenb

thanks @lindenb and @cmnbroad

Arf … I though using the index file (or/and the gzip block) to read the bcf file in // by region.

Feb 09 '21 14:02 bioinfornatics

htsjdk htsjdk copied to clipboard

How to read bcf file ?

htsjdk
htsjdk copied to clipboard