htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Multiple strings in FORMAT fields

Open lczech opened this issue 3 years ago • 2 comments

This is a variation/duplicate of https://github.com/samtools/htslib/issues/664, which I could not re-open, hence please excuse to start a new issue here.

The above issue handles the case of setting variable length (.) strings when writing them. Here, I want to discuss reading/parsing them. As far as I tested this, a fixed number of string values per sample are parsed without taking the Number or any comma in the per-sample values into account, and simply return a sequence of chars for each sample. The documentation of bcf_get_format_string does not really help to explain this, but that seems to be the behaviour.

Take for example this:

##fileformat=VCFv4.3
##contig=<ID=1,length=1>
##FORMAT=<ID=XX,Number=2,Type=String,Description="Test">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1	S2	S3
1	1	.	A	T	.	PASS		XX	Hello	beautiful,nice	World

Calling bcf_get_format_string on this line returns 15, which is the length of beautiful,nice plus one NULL char. Furthermore, the char** dst that the data is written to has three elements, one for each of the samples.

If this was behaving similar to the the int and float functions, I instead would expect that char** dst has six values:

Hello <missing> beautiful nice World <missing>

and that the function returns 6 (the number of values). The ndst param could still capture the total length of bytes allocated in dst, so that subsequent calls can properly re-use the memory.

Furthermore, and now it gets confusing, the result changes when we add another FORMAT into the mix, here GT:

##fileformat=VCFv4.3
##contig=<ID=1,length=1>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=XX,Number=2,Type=String,Description="Test">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1	S2	S3
1	1	.	A	T	.	PASS		GT:XX	0|0:Hello	1|0:beautiful,nice	1/1:World

Now, with htslib 1.10.2, we get 16 chars per sample, that is, the longest one now ends with 2 NULL chars. That is not a big deal, but still seems like a bug.

Here is some test code (in C++, force of habit): format_string.cpp.

So now my questions are:

  • Am I correct in the assumption that (as of now) I have to take care of the splitting for the string format per sample myself?
  • Is that something that might be considered for a future version?
  • Is that last part about the two NULL chars a bug that needs to be fixed?

Thanks and all the best Lucas

PS: To be honest, the documentation and the code comments of at least the vcf.h header are lacking detail and explanations in many places, and are inconsistent in others. I don't want to complain too much, because generally, the library does a great job. But I had to dive deep into the data structures and code in order to get some simple tasks done... Maybe this is something that could be improved in the future ;-)

lczech avatar Jul 24 '20 02:07 lczech

Thanks for the very detailed report.

As mentioned in #664, FORMAT strings are stored as a character array, of size number_samples * length(longest_string). Shorter strings get padded with NUL bytes. For tags where Number > 1 the values are stored separated by commas and HTSlib makes no attempt to split them up. bcf_get_format_string() sets a pointer to the start of each per-sample string, so you do have to look for the commas and split the string up yourself. We can't change bcf_get_format_string() as any code using it will expect per-sample values. Returning a pointer to each individual value would need a new API function.

The two NUL characters do look like a minor bug, but actually in vcf_parse_format(). It's including the colon when finding the length of the string, which means it allocates one byte more than needed for each sample. It means it's using more space than necessary to store the data, but otherwise it should be fairly harmless.

daviesrob avatar Jul 24 '20 17:07 daviesrob

Thanks for the detailled answer!

For tags where Number > 1 the values are stored separated by commas and HTSlib makes no attempt to split them up.

Thanks for confirming this. Okay, I'll implement my wrapper code accordingly.

Returning a pointer to each individual value would need a new API function.

That makes sense. I won't need that as of now (or could implement it in my wrapper directly), but maybe this is a nice feature to add at some point.

... but otherwise it should be fairly harmless.

Right, I thought so. Still wanted to report it ;-)

Thanks again, and feel free to close this issue as you see fit.

lczech avatar Jul 28 '20 03:07 lczech