GZip.jl
GZip.jl copied to clipboard
eachline() reports extra line in GZip file but not in unzipped file
I found an issue where eachline() was returning an extra empty line "" after the end of a gz file I was reading. The file ends in a single newline, and has 171 total lines. Reading the uncompressed file works fine, but as the output below shows reading from the GZip stream produces a spurious blank line.
This only happens for this file (thousands of other such files worked fine) and if I change the file more than just a character or two the bug goes away. Unfortunately this is medical data so I can't attach the file, but see the output below (using the current version of GZip):
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.3.0-rc4 (2014-08-15 04:01 UTC)
_/ |\__'_|_|_|\__'_| |
|__/ | x86_64-redhat-linux
> using GZip
> open(f->length(readlines(f)), "/tmp/orig")
171
> GZip.open(fout->write(fout, open(readall, "/tmp/orig")), "/tmp/orig.gz", "w");
> GZip.open(f->length(readlines(f)), "/tmp/orig.gz")
172
> GZip.open(f->readlines(f), "/tmp/orig.gz")[end]
""
> open(fout->write(fout, GZip.open(readall, "/tmp/orig.gz")), "/tmp/orig2", "w");
> open(f->length(readlines(f)), "/tmp/orig2")
171
@slundberg, thanks for the report.
Can you give the zlib version you're using? You can get it with GZip.zlib_version
.
Also, can you try installing the Zlib
package and running the same test using Zlib.reader(open("/tmp/orig"))
?
Sorry, that's Zlib.Reader(open("/tmp/orig"))
.
Same issue with Zlib:
> GZip.zlib_version
"1.2.3"
> length(readlines(Zlib.Reader(open("/tmp/orig.gz"))))
172
I should also note that when I compress the file using the gzip from the command line and then read the file everything is fine (at least for this file), so it only happens during a full read write cycle.
One further update...read write read with Zlib works, but I don't know if it's just because I may have chosen a different compression level than GZip uses by default.
> f = open("/tmp/orig.gz", "w")
> zf = Zlib.Writer(f, 9)
> write(zf, open(readall, "/tmp/orig"))
> length(readlines(Zlib.Reader(open("/tmp/orig.gz"))))
171
I was just going to suggest doing that. :-) At least that gives you a workaround.
You can set the compression level for gzip by appending the number to the file mode, e.g.,
f = GZip.open("/tmp/orig.gz", "w9");
Can you try that? Also, is it possible for you to try with a later version of zlib?
Matching compression levels at 6 creates the issue with GZip but not Zlib.
It also looks like it could be the zlib version. I can't change that on the server very easily but on my macbook with zlib 1.2.5 I don't see the issue.
Perhaps I can get a newer zlib sometime soon on the server and see if that resolves it there as well. For now I can just check for empty lines.
GZip calls gzwrite
, and Zlib doesn't, so that probably explains the difference. If you increase the buffer size for the write in ZLib
, it might even be faster, if that matters. In the past, I've thought about merging those packages, since they're somewhat redundant, but I doubt I'll get to it anytime soon.
The zlib
changelog shows a few fixes in gzwrite
after version 1.2.3, so perhaps one of them fixed the issue.
Unfortunately, I'm not sure how we could detect this issue in GZip.jl
, especially without a test example. If you have any thoughts, let me know.
Thanks for being responsive on this! I ran a bunch of random tests and found a random file that had the same error after about 20k tries.
https://www.dropbox.com/s/qppddaryvgcmenl/test?dl=0
Perhaps it will give you the same error. If not it might be restricted to the zlib version I have.
I also ran this script on my macbook and found the same issue in a different random file, so I don't think the version is the issue. Perhaps you can run this and see if you find one on your setup? (you may need to increase the number of runs past 10k)
using GZip
using StatsBase
for i in 1:10000
f = GZip.open("/tmp/test.gz", "w")
numLines = sample(100:800)
for j in 1:numLines
println(f, join(sample(["aasdf", "dfs", "ds", "q", " ", " ", "s", "t", "b", "e", "hdffda sdf", "sdf", "xjkd", "df:0.1"], sample(10:300)), ""))
end
close(f)
foundLines = GZip.open(f->length(readlines(f)), "/tmp/test.gz")
if foundLines != numLines
println("Found example! $foundLines $numLines")
break
end
end
same problem with GZip.zlib_version "1.2.7",Julia Version 1.1.0 any gz file on my CentOS will give an extra line by GZip.jl, but no with unzipped plain text file. Any suggestion? Thanks a lot!