DataflowJavaSDK
DataflowJavaSDK copied to clipboard
Dataflow uses incorrect full file size with GS file using Content-Encoding: gzip
To reproduce:
-
Upload a simple file (10000 sequential numbers, one per line) to Google storage specifying GZIP compression
gsutil cp -Z numbers.txt gs://<bucket>/numbers.txt. -
Execute a simple dataflow just reading, then writing these numbers:
p.apply(TextIO.Read.from("gs://<bucket>/numbers.txt"))
.apply(TextIO.Write.to("gs://<bucket>/out").withSuffix(".txt"));
Expected: Either all 10000 numbers written, or alternately gibberish written (raw compressed data). Actual: A subset of numbers written (1-4664). Looks like it reads the decompressed file as if its size was that of the file before decompression.
Specifying GZIP decompression mode works as expected (all 10000 numbers written):
p.apply(TextIO.Read.from("gs://<bucket>/numbers.txt")
.withCompressionType(CompressionType.GZIP))
.apply(TextIO.Write.to("gs://<bucket>/out").withSuffix(".txt"));
Thanks @rfevang for this detailed report, much appreciated.
My understanding is that this is a fundamental limitation of GCS's encoded-type format.
- TextIO.Read uses file extension to determine whether a file is compressed, and
.txtsays it is not. - We stat the file, and GCS gives us the compressed size.
- We use the GCS client libraries to download the file, and they serve us the uncompressed bytes (they transparently decompress them and we have no way to disable this).
- We trust the file size and read only the prefix of the decompressed stream.
So I think this is working as intended -- simply should not use that mode with GCS unless you force GZIP compression. You arrived at exactly the right solution.
What should be happening, I think, is that the issue is in 3. If the bytes were not transparently decompressed, we would get the right number of compressed bytes. Then TextIO would serve garbage and the user would notice, and properly set the GZIP compression flag to force decompression.
It's a bit nasty to have it work this way though. It looks as if everything works correctly (depending on the type of file you're processing).
Considering that cloud storage encourages you to upload text files in this way (and TextIO only supports text files), I really think that throwing an exception or yielding compressed data needs to happen if decompression isn't possible. Is there no way to look at the content-encoding ahead of time and do the right thing?
Also, how does TextIO get the non-decompressed bytes when specifying GZIP encoding manually? It would need to get the data in a different way somehow so it doesn't try to decompress the already decompressed bytes right?
I think you're right @rfevang . I'll try to follow up. Right now, Google Cloud Storage just looks like a filesystem that lies to us about its file size, but perhaps we can catch this in a different way or push an upstream behavior change.
Leaving this bug open to track.
I just hit this issue. Thank goodness I found this bug report, because I might never have figured it out myself. (Thanks @rfevang!)
Is there anything that can be done? Perhaps at least add an item for this in Troubleshooting Your Pipeline (https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline#Errors), or is this error too uncommon for that?