async-compression icon indicating copy to clipboard operation
async-compression copied to clipboard

GzipDecoder read_lines terminated early, but fixes on deleting an empty line?!

Open mfajer opened this issue 3 years ago • 2 comments

I am having a weird interaction with async_compression::tokio::bufread::GzipDecoder and tokio::io::AsyncBufReadExt. This example shows that when I use AsyncBufReadExt to read the number of lines in an unzipped file I get 413484 lines, but if I use a Bufreader wrapped around a GzipDecoder on a gzipped version of the same file I only get 65654 lines. I can fix this error by removing an empty line somewhere before the divergence point, at which point both files will report 413483 lines. This makes me think there is some edge-case with the various buffers that cause the GzipDecoder read_lines to terminate early, and any small change (removing that one empty line) manages to get things working again. I can't share the files but would be happy to diagnose further if anyone has suggestions.

EDIT: This error does not occur if I use the synchronous flate2 decompression by the way, so it is something specific to the tokio/async_compression interactions.

mfajer avatar Jul 14 '22 20:07 mfajer

First thing I would try is to read_to_end and check that the lengths match. It seems unlikely that it's an interaction with the outer BufReader, more likely to be the gzip decoder getting an early EOF.

One possibility is that the compressed file consists of multiple concatenated sections. Some decompressors will automatically read these sections and concatenate their output, but for async-compression you must use multiple_members to enable this behaviour. (I'm not sure if there's an easy way to check whether a file is multiple sections or not, the gzip cli doesn't seem to have any way to see them).

Nemo157 avatar Jul 14 '22 23:07 Nemo157

You were exactly right! Using read_to_end on the gzipped file resulted in about a quarter of the expected bytes read. Turning on multiple_members was able to resolve both the read_to_end and read_line discrepancies as well. Is it worth considering have this enabled by default if it seems to be the default for other decompressors? Or perhaps increasing the visibility of the option in the docs somewhere? If you would prefer the second I can make a merge request. Thanks again for your incisive and prompt assistance!

mfajer avatar Jul 15 '22 16:07 mfajer