hts-specs
hts-specs copied to clipboard
CSI file is BGZF compressed but this is not mentioned in the CSV1 spec
I used bcftools 1.19 to index a BCF file and tried to parse the CSI index file according to the spec https://github.com/samtools/hts-specs/blob/26347448cadff3cf40982d60fe2a97f20d2543ea/CSIv1.tex#L20C28-L20C33. It was not working as expected. After hexdump -C
on the csi file, I realized it not a plain binary file as described in CSIv1 spec file.
00000000 1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 |............BC..|
00000010 46 00 73 0e f6 64 e4 63 60 60 60 66 80 00 01 20 |F.s..d.c```f... |
00000020 66 02 62 4f 20 e6 11 86 88 31 22 b1 19 18 0a 0d |f.bO ....1".....|
...
But it seem consist with the spec after decompressing it bgzip -cd test.bcf.csi | hexdump -C
:
00000000 43 53 49 01 0e 00 00 00 03 00 00 00 00 00 00 00 |CSI.............|
00000010 10 00 00 00 02 00 00 00 49 00 00 00 09 18 00 00 |........I.......|
Could we add a sentence in the spec to point this out for future readers? Or it is not part of the spec?
While I agree adding this would be beneficial, it's the least problematic bit about the spec!
It certainly would be good if the original authors could add more about it. One thing that confused me lots is the "Auxiliary data", which changes format depending on the thing being indexed. (IIRC it's tabix data for VCF and some BAI-related format for BCF). I assume it's meant to be generic, but it also makes it largely unparseable without custom knowledge.
Ping @lh3 @pd3: is there any more information on CSI somewhere else? It looks like it arrived with this commit and subsequent commits. This appears to be where the original minimal spec documentation came from too.
See also #70, a long-standing issue noting this:
It is clear from examination of .csi files that they are stored as BGZF (why?), although this is not mentioned and is at odds with the current behaviour of BAI.