beam Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

We have gzipped text files in Google Cloud Storage that have the following metadata headers set:


Content-Encoding: gzip
Content-Type: application/octet-stream

Trying to read these with apache_beam.io.ReadFromText yields the following error:


ERROR:root:Exception while fetching 341565 bytes from position 0 of gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz:
Cannot have start index greater than total size
Traceback (most recent call last):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 585, in _fetch_to_queue
    value = func(*args)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 610, in _get_segment
    downloader.GetRange(start, end)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
line 477, in GetRange
    progress, end_byte = self.__NormalizeStartEnd(start, end)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
line 340, in __NormalizeStartEnd
    'Cannot have start index greater than total size')
TransferInvalidError:
Cannot have start index greater than total size

WARNING:root:Task failed: Traceback (most recent
call last):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
line 300, in __call__
    result = evaluator.finish_bundle()
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
line 206, in finish_bundle
    bundles = _read_values_to_bundles(reader)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
line 196, in _read_values_to_bundles
    read_result = [GlobalWindows.windowed_value(e) for e in reader]

 File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
line 79, in read
    range_tracker.sub_range_tracker(source_ix)):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 155, in read_records
    read_buffer)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 245, in _read_record
    sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)

 File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 190, in _find_separator_bounds
    file_to_read, read_buffer, current_pos + 1):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 212, in _try_to_ensure_num_bytes_in_buffer
    read_data = file_to_read.read(self._buffer_size)

 File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
line 460, in read
    self._fetch_to_internal_buffer(num_bytes)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
line 420, in _fetch_to_internal_buffer
    buf = self._file.read(self._read_size)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 472, in read
    return self._read_inner(size=size, readline=False)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 516, in _read_inner
    self._fetch_next_if_buffer_exhausted()
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 577, in _fetch_next_if_buffer_exhausted
    raise exn
TransferInvalidError: Cannot have start
index greater than total size

After removing the Content-Encoding header the read works fine.

Imported from Jira BEAM-1874. Original Jira may contain additional context. Reported by: smphhh.

Jun 03 '22 18:06 kennknowles

Is there an update on this? It looks like it has been an issue for years, and while there is a workaround, it's not very satisfying and we don't want to set the content-encoding to the wrong value on GCS.

Jul 07 '22 17:07 linamartensson

Bringing over some context from https://cloud.google.com/storage/docs/transcoding it seems like there are the following consistent situations:

GCS transcodes and Beam works with this transparently.
- Content-encoding: gzip
- Content-type: X
- Beam's IO reads it expecting contents to be X. I believe the problem is that GCS serves metadata that results in wrong splits.
GCS does not transcode because the metadata is set to not transcode (current recommendation)
- Content-encoding: <empty>
- Content-typ: gzip
- Beam's IO reads and the user specifies gzip or it is autodetected by the IO
GCS does not transcode because the Beam IO requests no transcoding
- Content-encoding: gzip
- Content-type: X
- Beam's IO passes the header Accept-Encoding: gzip

I believe 2 is the only one that works today. I am not sure if 1 is possible. I do think that 3 should be able to work, but needs some implementation.

Jul 08 '22 19:07 kennknowles

Guys this is a major issue.

Nov 26 '22 19:11 sqlboy

This is still an issue with 2.43.0. Does anyone have a workaround that does not require changing metadata in GCS, and isn't "use the Java SDK"?

Jan 10 '23 18:01 daniels-cysiv

The way to fix this is to just use the python GCS library and not use the GCS client in beam, this is assuming you can and it’s not some internal usage by beam. Also, unlike the beam implementation of the official GCS client is thread safe, looks like it’s been moved off httplib2.

Jan 10 '23 23:01 sqlboy

Thanks for the updates. Seems like the thing that would make this "just work", at some cost on the Dataflow side but saving bandwidth, would be option 3. This should be a fairly easy thing for someone to do as a first issue without knowing Beam too much.

Jan 11 '23 19:01 kennknowles

you can upload the object to GCS with the Content-Type set to indicate compression and NO Content-Encoding at all, according to best practices.

Content-encoding: application/gzip Content-type:

in this case the only thing immediately known about the object is that it is gzip-compressed, with no information regarding the underlying object type. Moreover, the object is not eligible for decompressive transcoding. reference : https://cloud.google.com/storage/docs/transcoding

beam's ReadFromText with compression_type=CompressionTypes.GZIP works fine with above option

p | "Read GCS File" >> beam.io.ReadFromText(file_pattern=file_path,compression_type=CompressionTypes.GZIP, skip_header_lines=int(skip_header))

Ways to compress the file

Implicitly by specifying gsutil cp -Z <filename> <bucket>
Explicitly by compressing the file first like gzip <filename> and load it to GCS

For more details around which combination works please see the table below :

Feb 08 '23 14:02 chavdaparas

Hi @kennknowles @sqlboy ,

The option that works correctly so far is as below

Do a explicit compression of the file - gzip
Upload the file to GCS with correct content type - application/gzip

gsutil -h "Content-Type:application/gzip" cp sample.csv.gz gs://gcp-sandbox-1-359004/scn4/

Content encoding will not be set

gcloud storage objects describe gs://gcp-sandbox-1-359004/scn4/sample.csv.gz

bucket: gcp-sandbox-1-359004
contentType: application/gzip
crc32c: v1lNUQ==
etag: CLnDx+CIif0CEAE=
generation: '1675967308358073'

The only caveat here is user will not be able to have benefit of transcoding as when the user attempts to download from the bucket, he will get a .gz file.

While we explore this caveat with the client, we wanted to check if Option 1 mentioned in the comment (https://github.com/apache/beam/issues/18390#issuecomment-1179313964) can be fixed.

As this option will give best of both worlds, dataflow will be able to read a compressed file and user can take benefit of transcoding.

Please let me know if any alternate suggestion.

Feb 10 '23 03:02 Murli16

.take-issue

Apr 11 '23 20:04 BjornPrime

@BjornPrime is working on fixing #25676, which might fix this issue as well.

Apr 11 '23 20:04 liferoad

In encountering this while migrating the GCS client, I do not believe the migration will resolve this issue on its own. It seems to be related to how GCSFileSystem handles compressed files.

Sep 06 '23 18:09 BjornPrime

I haven't thought about this in a while, but is there a problem with always passing Accept-encoding: gzip ?

Sep 08 '23 13:09 kennknowles

I am encountering similar issue when uploading my SQL files from Github via CI. not sure if this issue is still fixed. I tried having paramter: headers: |- content-type: application/octet-stream but it did't make any change in the error.

Mar 27 '24 20:03 chaitanya1293

same as https://github.com/apache/beam/issues/31040

May 18 '24 10:05 liferoad

cc @shunping

Dec 02 '24 16:12 liferoad