GZip.jl icon indicating copy to clipboard operation
GZip.jl copied to clipboard

eachline() reports extra line in GZip file but not in unzipped file

Open slundberg opened this issue 10 years ago • 10 comments

I found an issue where eachline() was returning an extra empty line "" after the end of a gz file I was reading. The file ends in a single newline, and has 171 total lines. Reading the uncompressed file works fine, but as the output below shows reading from the GZip stream produces a spurious blank line.

This only happens for this file (thousands of other such files worked fine) and if I change the file more than just a character or two the bug goes away. Unfortunately this is medical data so I can't attach the file, but see the output below (using the current version of GZip):

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.3.0-rc4 (2014-08-15 04:01 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-redhat-linux
> using GZip
> open(f->length(readlines(f)), "/tmp/orig")
171
> GZip.open(fout->write(fout, open(readall, "/tmp/orig")), "/tmp/orig.gz", "w");
> GZip.open(f->length(readlines(f)), "/tmp/orig.gz")
172
> GZip.open(f->readlines(f), "/tmp/orig.gz")[end]
""
> open(fout->write(fout, GZip.open(readall, "/tmp/orig.gz")), "/tmp/orig2", "w");
> open(f->length(readlines(f)), "/tmp/orig2")
171

slundberg avatar Sep 05 '14 19:09 slundberg

@slundberg, thanks for the report.

Can you give the zlib version you're using? You can get it with GZip.zlib_version.

Also, can you try installing the Zlib package and running the same test using Zlib.reader(open("/tmp/orig"))?

kmsquire avatar Sep 05 '14 20:09 kmsquire

Sorry, that's Zlib.Reader(open("/tmp/orig")).

kmsquire avatar Sep 05 '14 20:09 kmsquire

Same issue with Zlib:

> GZip.zlib_version
"1.2.3"
> length(readlines(Zlib.Reader(open("/tmp/orig.gz"))))
172

I should also note that when I compress the file using the gzip from the command line and then read the file everything is fine (at least for this file), so it only happens during a full read write cycle.

slundberg avatar Sep 05 '14 20:09 slundberg

One further update...read write read with Zlib works, but I don't know if it's just because I may have chosen a different compression level than GZip uses by default.

> f = open("/tmp/orig.gz", "w")
> zf = Zlib.Writer(f, 9)
> write(zf, open(readall, "/tmp/orig"))
> length(readlines(Zlib.Reader(open("/tmp/orig.gz"))))
171

slundberg avatar Sep 05 '14 21:09 slundberg

I was just going to suggest doing that. :-) At least that gives you a workaround.

You can set the compression level for gzip by appending the number to the file mode, e.g.,

f = GZip.open("/tmp/orig.gz", "w9");

Can you try that? Also, is it possible for you to try with a later version of zlib?

kmsquire avatar Sep 05 '14 21:09 kmsquire

Matching compression levels at 6 creates the issue with GZip but not Zlib.

It also looks like it could be the zlib version. I can't change that on the server very easily but on my macbook with zlib 1.2.5 I don't see the issue.

Perhaps I can get a newer zlib sometime soon on the server and see if that resolves it there as well. For now I can just check for empty lines.

slundberg avatar Sep 05 '14 21:09 slundberg

GZip calls gzwrite, and Zlib doesn't, so that probably explains the difference. If you increase the buffer size for the write in ZLib, it might even be faster, if that matters. In the past, I've thought about merging those packages, since they're somewhat redundant, but I doubt I'll get to it anytime soon.

The zlib changelog shows a few fixes in gzwrite after version 1.2.3, so perhaps one of them fixed the issue.

Unfortunately, I'm not sure how we could detect this issue in GZip.jl, especially without a test example. If you have any thoughts, let me know.

kmsquire avatar Sep 05 '14 21:09 kmsquire

Thanks for being responsive on this! I ran a bunch of random tests and found a random file that had the same error after about 20k tries.

https://www.dropbox.com/s/qppddaryvgcmenl/test?dl=0

Perhaps it will give you the same error. If not it might be restricted to the zlib version I have.

slundberg avatar Sep 05 '14 22:09 slundberg

I also ran this script on my macbook and found the same issue in a different random file, so I don't think the version is the issue. Perhaps you can run this and see if you find one on your setup? (you may need to increase the number of runs past 10k)

using GZip
using StatsBase
for i in 1:10000
    f = GZip.open("/tmp/test.gz", "w")
    numLines = sample(100:800)
    for j in 1:numLines
        println(f, join(sample(["aasdf", "dfs", "ds", "q", " ", " ", "s", "t", "b", "e", "hdffda sdf", "sdf", "xjkd", "df:0.1"], sample(10:300)), ""))
    end
    close(f)
    foundLines = GZip.open(f->length(readlines(f)), "/tmp/test.gz")
    if foundLines != numLines
        println("Found example! $foundLines $numLines")
        break
    end
end

slundberg avatar Sep 05 '14 23:09 slundberg

same problem with GZip.zlib_version "1.2.7",Julia Version 1.1.0 any gz file on my CentOS will give an extra line by GZip.jl, but no with unzipped plain text file. Any suggestion? Thanks a lot!

realzhang avatar Feb 01 '19 10:02 realzhang