Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header
We have gzipped text files in Google Cloud Storage that have the following metadata headers set:
Content-Encoding: gzip
Content-Type: application/octet-stream
Trying to read these with apache_beam.io.ReadFromText yields the following error:
ERROR:root:Exception while fetching 341565 bytes from position 0 of gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz:
Cannot have start index greater than total size
Traceback (most recent call last):
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 585, in _fetch_to_queue
value = func(*args)
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 610, in _get_segment
downloader.GetRange(start, end)
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
line 477, in GetRange
progress, end_byte = self.__NormalizeStartEnd(start, end)
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
line 340, in __NormalizeStartEnd
'Cannot have start index greater than total size')
TransferInvalidError:
Cannot have start index greater than total size
WARNING:root:Task failed: Traceback (most recent
call last):
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
line 300, in __call__
result = evaluator.finish_bundle()
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
line 206, in finish_bundle
bundles = _read_values_to_bundles(reader)
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
line 196, in _read_values_to_bundles
read_result = [GlobalWindows.windowed_value(e) for e in reader]
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
line 79, in read
range_tracker.sub_range_tracker(source_ix)):
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 155, in read_records
read_buffer)
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 245, in _read_record
sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 190, in _find_separator_bounds
file_to_read, read_buffer, current_pos + 1):
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 212, in _try_to_ensure_num_bytes_in_buffer
read_data = file_to_read.read(self._buffer_size)
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
line 460, in read
self._fetch_to_internal_buffer(num_bytes)
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
line 420, in _fetch_to_internal_buffer
buf = self._file.read(self._read_size)
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 472, in read
return self._read_inner(size=size, readline=False)
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 516, in _read_inner
self._fetch_next_if_buffer_exhausted()
File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 577, in _fetch_next_if_buffer_exhausted
raise exn
TransferInvalidError: Cannot have start
index greater than total size
After removing the Content-Encoding header the read works fine.
Imported from Jira BEAM-1874. Original Jira may contain additional context. Reported by: smphhh.
Is there an update on this? It looks like it has been an issue for years, and while there is a workaround, it's not very satisfying and we don't want to set the content-encoding to the wrong value on GCS.
Bringing over some context from https://cloud.google.com/storage/docs/transcoding it seems like there are the following consistent situations:
- GCS transcodes and Beam works with this transparently.
Content-encoding: gzipContent-type: X- Beam's IO reads it expecting contents to be X. I believe the problem is that GCS serves metadata that results in wrong splits.
- GCS does not transcode because the metadata is set to not transcode (current recommendation)
Content-encoding: <empty>Content-typ: gzip- Beam's IO reads and the user specifies gzip or it is autodetected by the IO
- GCS does not transcode because the Beam IO requests no transcoding
Content-encoding: gzipContent-type: X- Beam's IO passes the header
Accept-Encoding: gzip
I believe 2 is the only one that works today. I am not sure if 1 is possible. I do think that 3 should be able to work, but needs some implementation.
Guys this is a major issue.
This is still an issue with 2.43.0. Does anyone have a workaround that does not require changing metadata in GCS, and isn't "use the Java SDK"?
The way to fix this is to just use the python GCS library and not use the GCS client in beam, this is assuming you can and it’s not some internal usage by beam. Also, unlike the beam implementation of the official GCS client is thread safe, looks like it’s been moved off httplib2.
Thanks for the updates. Seems like the thing that would make this "just work", at some cost on the Dataflow side but saving bandwidth, would be option 3. This should be a fairly easy thing for someone to do as a first issue without knowing Beam too much.
- you can upload the object to GCS with the
Content-Typeset to indicate compression and NOContent-Encodingat all, according to best practices.
Content-encoding: application/gzip
Content-type:
in this case the only thing immediately known about the object is that it is gzip-compressed, with no information regarding the underlying object type. Moreover, the object is not eligible for decompressive transcoding. reference : https://cloud.google.com/storage/docs/transcoding
beam's ReadFromText with compression_type=CompressionTypes.GZIP works fine with above option
p | "Read GCS File" >> beam.io.ReadFromText(file_pattern=file_path,compression_type=CompressionTypes.GZIP, skip_header_lines=int(skip_header))
Ways to compress the file
- Implicitly by specifying
gsutil cp -Z <filename> <bucket> - Explicitly by compressing the file first like
gzip <filename>and load it to GCS
For more details around which combination works please see the table below :
Hi @kennknowles @sqlboy ,
The option that works correctly so far is as below
- Do a explicit compression of the file - gzip
- Upload the file to GCS with correct content type - application/gzip
gsutil -h "Content-Type:application/gzip" cp sample.csv.gz gs://gcp-sandbox-1-359004/scn4/
- Content encoding will not be set
gcloud storage objects describe gs://gcp-sandbox-1-359004/scn4/sample.csv.gz
bucket: gcp-sandbox-1-359004
contentType: application/gzip
crc32c: v1lNUQ==
etag: CLnDx+CIif0CEAE=
generation: '1675967308358073'
The only caveat here is user will not be able to have benefit of transcoding as when the user attempts to download from the bucket, he will get a .gz file.
While we explore this caveat with the client, we wanted to check if Option 1 mentioned in the comment (https://github.com/apache/beam/issues/18390#issuecomment-1179313964) can be fixed.
As this option will give best of both worlds, dataflow will be able to read a compressed file and user can take benefit of transcoding.
Please let me know if any alternate suggestion.
.take-issue
@BjornPrime is working on fixing #25676, which might fix this issue as well.
In encountering this while migrating the GCS client, I do not believe the migration will resolve this issue on its own. It seems to be related to how GCSFileSystem handles compressed files.
I haven't thought about this in a while, but is there a problem with always passing Accept-encoding: gzip ?
I am encountering similar issue when uploading my SQL files from Github via CI. not sure if this issue is still fixed. I tried having paramter: headers: |- content-type: application/octet-stream but it did't make any change in the error.
same as https://github.com/apache/beam/issues/31040
cc @shunping