htslib
htslib copied to clipboard
Multiple strings in FORMAT fields
This is a variation/duplicate of https://github.com/samtools/htslib/issues/664, which I could not re-open, hence please excuse to start a new issue here.
The above issue handles the case of setting variable length (.
) strings when writing them. Here, I want to discuss reading/parsing them. As far as I tested this, a fixed number of string values per sample are parsed without taking the Number
or any comma in the per-sample values into account, and simply return a sequence of chars for each sample. The documentation of bcf_get_format_string
does not really help to explain this, but that seems to be the behaviour.
Take for example this:
##fileformat=VCFv4.3
##contig=<ID=1,length=1>
##FORMAT=<ID=XX,Number=2,Type=String,Description="Test">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3
1 1 . A T . PASS XX Hello beautiful,nice World
Calling bcf_get_format_string
on this line returns 15, which is the length of beautiful,nice
plus one NULL char. Furthermore, the char** dst
that the data is written to has three elements, one for each of the samples.
If this was behaving similar to the the int and float functions, I instead would expect that char** dst
has six values:
Hello <missing> beautiful nice World <missing>
and that the function returns 6 (the number of values). The ndst
param could still capture the total length of bytes allocated in dst
, so that subsequent calls can properly re-use the memory.
Furthermore, and now it gets confusing, the result changes when we add another FORMAT into the mix, here GT:
##fileformat=VCFv4.3
##contig=<ID=1,length=1>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=XX,Number=2,Type=String,Description="Test">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3
1 1 . A T . PASS GT:XX 0|0:Hello 1|0:beautiful,nice 1/1:World
Now, with htslib 1.10.2, we get 16 chars per sample, that is, the longest one now ends with 2 NULL chars. That is not a big deal, but still seems like a bug.
Here is some test code (in C++, force of habit): format_string.cpp.
So now my questions are:
- Am I correct in the assumption that (as of now) I have to take care of the splitting for the string format per sample myself?
- Is that something that might be considered for a future version?
- Is that last part about the two NULL chars a bug that needs to be fixed?
Thanks and all the best Lucas
PS: To be honest, the documentation and the code comments of at least the vcf.h
header are lacking detail and explanations in many places, and are inconsistent in others. I don't want to complain too much, because generally, the library does a great job. But I had to dive deep into the data structures and code in order to get some simple tasks done... Maybe this is something that could be improved in the future ;-)
Thanks for the very detailed report.
As mentioned in #664, FORMAT strings are stored as a character array, of size number_samples * length(longest_string)
. Shorter strings get padded with NUL bytes. For tags where Number > 1
the values are stored separated by commas and HTSlib makes no attempt to split them up. bcf_get_format_string()
sets a pointer to the start of each per-sample string, so you do have to look for the commas and split the string up yourself. We can't change bcf_get_format_string()
as any code using it will expect per-sample values. Returning a pointer to each individual value would need a new API function.
The two NUL characters do look like a minor bug, but actually in vcf_parse_format()
. It's including the colon when finding the length of the string, which means it allocates one byte more than needed for each sample. It means it's using more space than necessary to store the data, but otherwise it should be fairly harmless.
Thanks for the detailled answer!
For tags where Number > 1 the values are stored separated by commas and HTSlib makes no attempt to split them up.
Thanks for confirming this. Okay, I'll implement my wrapper code accordingly.
Returning a pointer to each individual value would need a new API function.
That makes sense. I won't need that as of now (or could implement it in my wrapper directly), but maybe this is a nice feature to add at some point.
... but otherwise it should be fairly harmless.
Right, I thought so. Still wanted to report it ;-)
Thanks again, and feel free to close this issue as you see fit.