hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

CSI file is BGZF compressed but this is not mentioned in the CSV1 spec

Open bguo068 opened this issue 10 months ago • 2 comments

I used bcftools 1.19 to index a BCF file and tried to parse the CSI index file according to the spec https://github.com/samtools/hts-specs/blob/26347448cadff3cf40982d60fe2a97f20d2543ea/CSIv1.tex#L20C28-L20C33. It was not working as expected. After hexdump -C on the csi file, I realized it not a plain binary file as described in CSIv1 spec file.

00000000  1f 8b 08 04 00 00 00 00  00 ff 06 00 42 43 02 00  |............BC..|
00000010  46 00 73 0e f6 64 e4 63  60 60 60 66 80 00 01 20  |F.s..d.c```f... |
00000020  66 02 62 4f 20 e6 11 86  88 31 22 b1 19 18 0a 0d  |f.bO ....1".....|
...

But it seem consist with the spec after decompressing it bgzip -cd test.bcf.csi | hexdump -C:

00000000  43 53 49 01 0e 00 00 00  03 00 00 00 00 00 00 00  |CSI.............|
00000010  10 00 00 00 02 00 00 00  49 00 00 00 09 18 00 00  |........I.......|

Could we add a sentence in the spec to point this out for future readers? Or it is not part of the spec?

bguo068 avatar Apr 13 '24 21:04 bguo068

While I agree adding this would be beneficial, it's the least problematic bit about the spec!

It certainly would be good if the original authors could add more about it. One thing that confused me lots is the "Auxiliary data", which changes format depending on the thing being indexed. (IIRC it's tabix data for VCF and some BAI-related format for BCF). I assume it's meant to be generic, but it also makes it largely unparseable without custom knowledge.

Ping @lh3 @pd3: is there any more information on CSI somewhere else? It looks like it arrived with this commit and subsequent commits. This appears to be where the original minimal spec documentation came from too.

jkbonfield avatar Apr 15 '24 08:04 jkbonfield

See also #70, a long-standing issue noting this:

It is clear from examination of .csi files that they are stored as BGZF (why?), although this is not mentioned and is at odds with the current behaviour of BAI.

zaeleus avatar Apr 15 '24 14:04 zaeleus