htslib bgzip text compression mode

bgzip text compression mode

Open mlin opened this issue 2 years ago • 15 comments

Compressing with -n/--text promotes alignment of BGZF blocks with the uncompressed text lines. BGZF blocks start at the beginning of an input line and end after some subsequent newline (except when the block's first line overflows the BGZF block size).

This ensures it's possible to specify byte ranges of a BGZF file that decompress into complete text records -- useful for parallel processing and "slicing" from remote servers.

Inspiration: bam_write1() uses the same trick to align BGZF blocks with BAM records.

Jan 02 '22 11:01 mlin

I like this change. Although I expect it was accidental, being forced down hopen vs open in order to get kgetline functionality, it's worth noting in the commit message that this also has the side effect of supporting URIs as the input filename to bgzip. A worthy addition in its own right.

There is a bug however when given data that doesn't end on a newline. A newline is added. Eg:

$ echo -n foo | ./bgzip | zcat |hd
00000000  66 6f 6f                                          |foo|
00000003

$ echo -n foo | ./bgzip -n | zcat |hd
00000000  66 6f 6f 0a                                       |foo.|
00000004

I think the kputc('\n', &line); statement needs an eof check (eg if (!f_src->at_eof)).

Would you mind also updating the man page? (I can do this instead if you prefer as it's a pretty minor addition.) It would need documenting the heuristics handling special header symbols used too, as this surprised me when reading the code. I can see why you want to include it though.

Similarly a simple test adding to test/test.pl.

Edit: also maybe a warning about binary data. If it has a nul character then it'll be truncated, due to the way kgetline works. So -n on a binary file will silently trash the data. That's not ideal, albeit inappropriate usage. At least the man page should be explicit that -n behaviour is undefined on non-text documents. Ideally it'd work in all cases, but that's probably a challenging operation unless we replace kgetline with something different.

Jan 05 '22 12:01 jkbonfield

@jkbonfield Thanks, great feedback and great point about the last line. I worry that if (!f_src->at_eof) does not distinguish the two cases though. kgetline() consumes any newline and trims it afterwards, so I think at_eof will indicate regardless. Maybe at EOF I need to htell() and test if it's off-by-one from the total bytes I got from kgetline() plus number of lines. Yikes...

It looks to me like kgetline2() handles the nul character case, but otherwise it has the same issue at EOF. (Its purpose had been mysterious to me until your comment -- thank you!) And the memory usage could become erratic if the binary data happens not to use newline characters for long stretches.

Acknowledge the need for man & tests -- I'm first going to noodle on these technical issues though.

Jan 06 '22 09:01 mlin

If something claiming to be a text file does not end in a newline, is it really a bug to emit one? The two representations are surely equivalent for a text file. (This is certainly the attitude of C compilers and Unix, which don't really believe in text files without a final newline character — cf POSIX 3.403 Text File and 3.206 Line. Windows users may have a different view.)

What is the mnemonic for using -n for this? Did you consider using -T?

Jan 06 '22 12:01 jmarshall

That's one possible solution.

I am a bit uneasy given I could imagine us getting bug reports for files not round-tripping correctly. Given as I understand it the intention is simply to permit BGZF block boundaries to occur after newline characters, IMO it would be preferable if that's the only impact it had.

However we could just document it as only working on valid text files, ending in a newline and containing no binary data (UTF8 permitted, but not nuls).

Jan 06 '22 14:01 jkbonfield

I've factored a kgetline3() out of kgetline2() where the former includes the line terminator, if present. kgetline2() now just calls this kgetline3(), then trims the terminator as before.
Using kgetline3() instead of kgetline() should address both the end-of-file and inline-null problems.
Added a comment to kgetline() mentioning the inline-null caveat.
-T would be okay but I'm not a fan of unrelated case-sensitive short flags (nightmares about vg's CLI!). -n was meant to evoke \n but I admit that's cheesy. I've now tried -a seeing that GNU gzip has an -a flag, which does something different but at least vaguely related. But then maybe it should understand the corresponding long form --ascii too?

WDYT @jkbonfield @jmarshall ? (I've not forgotten about man & tests.)

Jan 08 '22 11:01 mlin

Upon further reflection, James is of course correct that not preserving the characters exactly would be a recipe for bug reports and trouble. So the better approach for this is to aim to flush blocks after newlines, and always produce identical contents even given non-text binary data.

At that point, this is just a compression knob (the extra flushes reduce the compression slightly) rather than a major mode of operation. So instead of having a --text command line option at all, it could take advantage of using hFILE by using hts_detect_format to select this textual approach or the pure binary approach automatically as appropriate.

Applying this text mode automatically would mean that many more .vcf.gz etc files will have this technique used than would if it needed to be specified by users and scripts on the command line — which is a good thing. (We might want to have a --binary option to always just use the pure binary approach. Sadly the good short option letters for that are also already taken…)

Jan 08 '22 12:01 jmarshall

And the memory usage could become erratic if the binary data happens not to use newline characters for long stretches.

Related to this, I do wonder what the implementation would look like if, instead of using any form of getline, it used the same 64K buffer as the pure binary approach and used strrchr() to search backwards from the end for \n characters. There's no need to process every line separately… (though setting off the header lines would be a bit harder…)

Jan 08 '22 13:01 jmarshall

I've been experimenting with adding explicit flush calls, as the simplest option (to-do: headers). It's trivial and seems minimal of CPU cost.

diff --git a/sam.c b/sam.c
index 393a3b2..23da760 100644
--- a/sam.c
+++ b/sam.c
@@ -4348,6 +4348,8 @@ int sam_write1(htsFile *fp, const sam_hdr_t *h, const bam1_t *b)
             if (sam_format1(h, b, &fp->line) < 0) return -1;
             kputc('\n', &fp->line);
             if (fp->is_bgzf) {
+                if (bgzf_flush_try(fp->fp.bgzf, fp->line.l) < 0)
+                    return -1;
                 if ( bgzf_write(fp->fp.bgzf, fp->line.s, fp->line.l) != fp->line.l ) return -1;
             } else {
                 if ( hwrite(fp->fp.hfile, fp->line.s, fp->line.l) != fp->line.l ) return -1;
diff --git a/vcf.c b/vcf.c
index 0d59be6..3d0643e 100644
--- a/vcf.c
+++ b/vcf.c
@@ -3275,10 +3275,13 @@ int vcf_write(htsFile *fp, const bcf_hdr_t *h, bcf1_t *v)
     fp->line.l = 0;
     if (vcf_format1(h, v, &fp->line) != 0)
         return -1;
-    if ( fp->format.compression!=no_compression )
+    if ( fp->format.compression!=no_compression ) {
+        if (bgzf_flush_try(fp->fp.bgzf, fp->line.l) < 0)
+            return -1;
         ret = bgzf_write(fp->fp.bgzf, fp->line.s, fp->line.l);
-    else
+    } else {
         ret = hwrite(fp->fp.hfile, fp->line.s, fp->line.l);
+    }
 
     if (fp->idx) {
         int tid;

I have a hack of bgzf and a bgzip debug binary that uses it to report the first 5 and last 5 chars in each block. Sure enough it's very noticable:

@ [samtools.../htslib]; bgzip -d < /tmp/a0.vcf.gz 2>&1 > /dev/null  | head
Block ##fil ... 	CCAC
Block A	222 ... .	DP=
Block 32;VD ... ;AN=2
Block ;DP4= ... ,18,2
Block 8;MQ= ... =0.25
Block 1315; ... S=0;M
Block Q0F=0 ... QBZ=0
Block ;MQSB ... :193,
Block 0,255 ... 18;FS
Block =0;MQ ... 3,16;

@ [samtools.../htslib]; bgzip -d < /tmp/a1.vcf.gz 2>&1 > /dev/null  | head
Block ##fil ... ,255

Block 1	568 ... 23,0

Block 1	852 ... ,255

Block 1	985 ... 96,0

Block 1	129 ... ,215


-rw-r--r-- 1 jkb team117 2958781 Jan 20 15:54 /tmp/a0.vcf.gz
-rw-r--r-- 1 jkb team117 2958560 Jan 20 16:15 /tmp/a1.vcf.gz

-rw-r--r-- 1 jkb team117 11148963 Jan 20 15:53 /tmp/a0.sam.gz
-rw-r--r-- 1 jkb team117 11152924 Jan 20 15:51 /tmp/a1.sam.gz

We could likely do the same with fasta/fastq too. Next up though is seeing whether we know on output files whether they're text or not. In htslib's usage via an htsFile this is doable. fp->format indicates it's a vcf, although it also supports generics such as text_format. So we could detect there.

However it won't solve this particular PR which I think is a separate issue. The bgzip utility doesn't use htsFile. It's driving bgzf natively. Of course it could be wrapped up to use htsFile, but firstly I don't know if it then handles unrecognised formats (eg any random data stream), and even if so there aren't any format-agnostic hts_write functions which can do the necessary flush logic. So I think we need two changes here. I'll look at the format specific side of things.

(That said, bgzip should probably still use htsFile for input, even if it's then ignoring the higher level structure components and is driving fp->fp.bgzf at the low level again, as it's a convenient way of querying the input format.)

Jan 20 '22 16:01 jkbonfield

Looking at this again, I see one minor glitch due to the option rename:

diff --git a/bgzip.c b/bgzip.c
index c7d61b5..0d51c4a 100644
--- a/bgzip.c
+++ b/bgzip.c
@@ -173,7 +173,7 @@ int main(int argc, char **argv)
         case '@': threads = atoi(optarg); break;
         case 't': test = 1; compress = 0; reindex = 0; break;
         case 'k': keep = 1; break;
-        case 'n': text = 1; break;
+        case 'a': text = 1; break;
         case 1:

With that change it appears to work, although I want to do more stress testing to look for corner cases.

Listening to @jmarshall's comments above I did try replacing the kstring code with something to use hread direct and a strrchr equivalent: strrchr doesn't have an end limit and it scans forward all the way so it's inefficient, so I use a proper reverse mechanism.

This replacement does work, and it's much less overhead:

bgzip -l1 -c ecoli60.sam

          cycles        inst            bytes                                                       
no -a     14781612438   24837545274     152298241                                                   
with -a   15327935908   26137789859     152270394                                                   
          +3.7%         +5.2%                                                                       
-a new    14882182612   24875449321     152267214                                                   
          +0.7%         +0.2%

At -l0 the old code was a full 92% more instructions, reduced to just +2%.

However, there are other factors to consider:

My replacement doesn't yet support headers. As previously noted It'd need to do line-by-line reading forward to find the first non-header line and then reverse searching for the last newline for data components, so it's basically doubling the complexity with two different implementations present.
I only have \n and not \n\r detection. It gets a little messy when the \n and \r may span a block boundary, as then maybe we need to find the previous \n\r instead or treat it as not-fitting). Lots of boundary cases which kgetline already solves. (Edit: I'm being dumb - it's \r\n, so works fine as is)
When we use the default compression level, the cycle difference was under 0.1% with Mike's code, so trying to optimise this is probably overkill.

That said, I suspect there are basic optimisations that can be done to the PR as-is to improve the overhead which may be worth doing, such as the !strchr("@#", *line.s).

Aug 04 '22 15:08 jkbonfield

Thanks @jkbonfield, I've not had a chance to revisit this in some time, but really appreciate your looking into it

Aug 05 '22 02:08 mlin

I've squashed and rebased it off current develop (including the one char fix for -a vs -n), added a minimal test (just duplicated the round-trip but with -a) and updated the man page.

This can be seen in my copy of your PR at https://github.com/jkbonfield/htslib/tree/bgzip-text-mode-1

If you're happy with this squashing and the extra commit then I can force to your branch so this PR updates. (It'll likely miss the upcoming release, but we should be able to then finish reviewing and merge early in next release cycle.)

Aug 05 '22 14:08 jkbonfield

@jkbonfield thanks, I reset my branch to yours

Aug 06 '22 08:08 mlin

Re option letters, see also the comments in https://github.com/samtools/htslib/pull/1369#issuecomment-1007975702 re detecting text/binary and using this text mode automatically as appropriate (probably with --binary to override it explicitly). Using text mode automatically would mean more text files would get this treatment.

Aug 08 '22 06:08 jmarshall

The other advantage of the strrchr-equivalent version is that you don't have to invent yet another getline variant. Headers can be incorporated by a mode that scans forward for newlines while lines start with the header character. Then the mode switches to the strrchr mode when it sees a non-header character at the start of a line.

Aug 08 '22 06:08 jmarshall

Good point on automatic detection.

As for strrchr, that itself is a bad function as it needs a null terminated string and it scans forward first before scanning backwards, as it's just a plain dumb API! However rolling your own is easy (as I did). I'm well aware of how we could do headers, but as I mentioned it's basically doubling up the work as we have a scan-forward variant followed by a scan-backward variant, and some trickery in the middle to switch. Doable and I have something part way already, but debateable if it's really worth the effort and extra code to maintain. I'll explore some more though to see just how messy it ends up looking.

Aug 08 '22 13:08 jkbonfield

htslib htslib copied to clipboard

bgzip text compression mode

htslib
htslib copied to clipboard