python-snappy icon indicating copy to clipboard operation
python-snappy copied to clipboard

How to handle snappy files generated by Trino?

Open jfNasciment0 opened this issue 1 year ago • 1 comments

Hello,

With the new release to 0.7.1 the I can't decompress CSV files generated by Trino, I think the issue is related with the Hadoop_snappy. Does anyone know how it can fixed?

from snappy import snappy_formats

csv_file = 'csv_67dba65a.snappy'

def read_file(file_path):
    return open(file_path, 'rb')

decompress_func, read_chunk  = snappy_formats.get_decompress_function(
    'auto',
    read_file(csv_file)
)
decompressed_stream = io.BytesIO()
# Decompress the data
decompress_func(
    read_file(csv_file),
    decompressed_stream,
    start_chunk=read_chunk
)
decompressed_stream.seek(0)

print(f"Compressed file: {read_file(csv_file).read()}")
print(f"DeCompressed file: {decompressed_stream.read()}")

This code has different outputs based on the version:

  • 0.7.0 Compressed file: b'\x00\x00\x00\x04\x00\x00\x00\x06\x04\x0c"a"\n' DeCompressed file: b'"a"\n"a"\n'

  • 0.7.1

  .venv/lib/python3.12/site-packages/snappy/snappy_formats.py", line 64, in get_decompress_function
      decompress_func, read_chunk = guess_format_by_header(fin)

  .venv/lib/python3.12/site-packages/snappy/snappy_formats.py", line 59, in guess_format_by_header
      raise UncompressError("Can't detect archive format")
  snappy.snappy.UncompressError: Can't detect archive format

jfNasciment0 avatar Mar 14 '24 10:03 jfNasciment0

Since you are the second to ask, we might be able to re-implement this, at least compress/decompress (as opposed to streaming)

martindurant avatar Mar 15 '24 17:03 martindurant