DataflowJavaSDK icon indicating copy to clipboard operation
DataflowJavaSDK copied to clipboard

Dataflow jobs using the SDK for Java 1.6.0 and reading compressed files from TextIO with compression mode set may be subject to data loss.

Open dhalperi opened this issue 9 years ago • 5 comments
trafficstars

We have identified an issue with Dataflow jobs reading from TextIO with compression type set to GZIP or BZIP2, potentially losing data during processing.

Specifically, using TextIO:

  • TextIO.from(...).withCompressionType(CompressionType.GZIP) or
  • TextIO.from(...).withCompressionType(CompressionType.BZIP2)

This is a silent issue so you will not see any error messages or visible symptoms. The problem occurs under the following circumstances: Using the Dataflow SDK for Java 1.6.0, reading compressed files, and setting the compression mode using withCompressionType to either GZIP or BZIP2.

Current known workarounds:

  • Recommended option: Use AUTO mode instead of GZIP or BZIP2 mode.

    Use withCompressionType(CompressionType.AUTO) or leave it unset (it is the default) with the TextIO source. NOTE: compressed files must have .gz or .bz2 (case-insensitive) extension for this to work.

  • Switch to version 1.5.1 of the Dataflow SDK for Java. If you are using mvn, this can be done by specifying version 1.5.1 in your pom.xml

We are actively working to resolve this and will update this issue with all developments.

dhalperi avatar Aug 05 '16 02:08 dhalperi

Hi Dan.

How can we identify which pipelines, if any, have lost data? We have many pipelines reading GZIP files from GCS.

Graham

polleyg avatar Aug 05 '16 05:08 polleyg

Hi Graham,

The vast majority of customers will not be affected, because the default TextIO.Read.from("filepattern") will automatically notice .gz files and decompress them.

Affected jobs are only those using version 1.6.0 and manually calling withCompressionType(CompressionType.GZIP) or withCompressionType(CompressionType.BZIP2).

If you use the Cloud Console, you can inspect the Display Data of the TextIO.Read to see the compression mode.

An example of a TextIO.Read that is affected (Compression Mode is GZIP): image

An example of a normal TextIO.Read that is not affected (AUTO mode shows up as DecompressAccordingToFilename): image

Our support team is tracking affected jobs submitted to the Cloud Dataflow service and has already reached out to affected customers.

The DirectPipelineRunner also exhibits this bug locally.

Dan

dhalperi avatar Aug 05 '16 18:08 dhalperi

Thanks Dan. As recommended, we're going to roll out a hotfix with the compression type set to AUTO for now.

polleyg avatar Aug 08 '16 00:08 polleyg

Cloud Dataflow SDK for Java 1.6.1 has been released with a fix to this issue.

See Downloads for instructions on how to obtain and install the Cloud Dataflow SDK for Java.

dhalperi avatar Aug 10 '16 01:08 dhalperi

Awesome stuff Dan. Great turnaround time.

polleyg avatar Aug 11 '16 01:08 polleyg