htsjdk
htsjdk copied to clipboard
How to read bcf file ?
Dear,
Firstly thanks for your amazing work on this library.
I try to read a bcffile through VCFFileReader
but I have this error:
Unable to parse header with error: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:, for input source: file:///somewhere/build/resources/test/dataset1.bcf
This file is valid and I am able to read it with pysam
in python
$ bcftools view -h /somewhere/build/resources/test/dataset1.bcf | head | grep fileformat
##fileformat=VCFv4.1
I use htsjdk: 2.23.0
java: 8+
I need to increase the speed process due to slowness of python and is GIL limitation which do not allow the use of multithreading to read the file through a random access way
Thanks for your help
@bioinfornatics Oh - can you run od -c -N 30 <file>
and include the results here. You may be hitting this embarrassing bug which we've never fixed https://github.com/samtools/htsjdk/issues/946.
@cmnbroad thanks I was trying to get the magic number as it is done into BCFVersion.java with hexdump -C
but I had any idea to do next :-)
So the result of od -c -N 30
0000000 037 213 \b 004 \0 \0 \0 \0 \0 377 006 \0 B C 002 \0
0000020 250 003 255 224 313 213 023 I 034 307 333 L 226 245
Yes, that looks like a gzip header. Sadly, htsjdk lacks support for gzipped BCF, as mentioned above.
@bioinfornatics you can try to use a VCFIteratorBuilder instead of a VCFFileReader . As far as I remember, the GZ compression was first tested. (But you cannot do random-access)
thanks @lindenb and @cmnbroad
Arf … I though using the index file (or/and the gzip block) to read the bcf file in // by region.