polars icon indicating copy to clipboard operation
polars copied to clipboard

Gzipped CSV files can not always be read anymore.

Open ghuls opened this issue 3 years ago • 13 comments

What language are you using?

Python.

Have you tried latest version of polars?

  • [yes]

What version of polars are you using?

0.13.51 and latest git version.

Latest working polars version is 0.13.34. 0.13.35 and later are broken.

The commit for 0.13.35 updated libz-sys from 1.1.5 to 1.1.6 and broke reading gzipped CSV files in some cases: https://github.com/pola-rs/polars/commit/ed931910ff18867879ec0f5343a373c3a976b991

I compiled the latest git version of polars with:

diff --git a/py-polars/Cargo.lock b/py-polars/Cargo.lock
index 939e751c81..f04404f8b3 100644
--- a/py-polars/Cargo.lock
+++ b/py-polars/Cargo.lock
@@ -453,11 +453,13 @@ checksum = "7360491ce676a36bf9bb3c56c1aa791658183a54d2744120f27285738d90465a"
 
 [[package]]
 name = "flate2"
-version = "1.0.24"
+version = "1.0.22"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "f82b0f4c27ad9f8bfd1f3208d882da2b09c301bc1c828fd3a00d0216d2fbbff6"
+checksum = "1e6988e897c1c9c485f43b47a529cef42fde0547f9d8d41a7062518f1d8fc53f"
 dependencies = [
+ "cfg-if",
  "crc32fast",
+ "libc",
  "libz-sys",
  "miniz_oxide",
 ]
@@ -842,9 +844,9 @@ dependencies = [
 
 [[package]]
 name = "libz-sys"
-version = "1.1.8"
+version = "1.1.5"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "9702761c3935f8cc2f101793272e202c72b99da8f4224a19ddcf1279a6450bbf"
+checksum = "6f35facd4a5673cb5a48822be2be1d4236c1c99cb4113cab7061ac720d5bf859"
 dependencies = [
  "cc",
  "cmake",
@@ -936,11 +938,12 @@ dependencies = [
 
 [[package]]
 name = "miniz_oxide"
-version = "0.5.3"
+version = "0.4.4"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "6f5c75688da582b8ffc1f1799e9db273f32133c49e048f614d22ec3256773ccc"
+checksum = "a92518e98c078586bc6c934028adcca4c92a53d6a958196de835170a01d84e4b"
 dependencies = [
  "adler",
+ "autocfg",
 ]
 
 [[package]]

And the gzipped file that didn't work, can be read again.

What operating system are you using polars on?

CentOS 7.

What language version are you using

python 3.10

Describe your bug.

Gzipped (with multiple gzip streams in this case) CSV files can not always be read. With polars 0.13.34 or polars with libz-sys 1.1.5 patch it works.

What are the steps to reproduce the behavior?

In [1]: import polars as pl

In [2]: df = pl.read_csv('atac_fragments.head40000000.tsv.gz', skip_rows=52, has_headers=False,  sep="\t", use_pyarrow=False)
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
<ipython-input-2-0d225fac1a3f> in <module>
----> 1 df = pl.read_csv('atac_fragments.head40000000.tsv.gz', skip_rows=52, has_headers=False,  sep="\t", use_pyarrow=False)

~/software/polars/py-polars/polars/io.py in read_csv(file, has_header, columns, new_columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_count_name, row_count_offset, sample_size, **kwargs)
    420 
    421     with _prepare_file_arg(file, **storage_options) as data:
--> 422         df = DataFrame._read_csv(
    423             file=data,
    424             has_header=has_header,

~/software/polars/py-polars/polars/internals/frame.py in _read_csv(cls, file, has_header, columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, sample_size)
    584         projection, columns = handle_projection_columns(columns)
    585 
--> 586         self._df = PyDataFrame.read_csv(
    587             file,
    588             infer_schema_length,

ComputeError: invalid utf8 data in csv

ghuls avatar Jul 04 '22 21:07 ghuls

It seems like it is still not fixed in all cases in 0.14.11.

Now with a different file.

File ~/software/anaconda3/envs/pycistopic/lib/python3.10/site-packages/polars/internals/dataframe/frame.py:608, in DataFrame._read_csv(cls, file, has_header, columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char)
    601         raise ValueError(
    602             "cannot use glob patterns and integer based projection as `columns`"
    603             " argument; Use columns: List[str]"
    604         )
    606 projection, columns = handle_projection_columns(columns)
--> 608 self._df = PyDataFrame.read_csv(
    609     file,
    610     infer_schema_length,
    611     batch_size,
    612     has_header,
    613     ignore_errors,
    614     n_rows,
    615     skip_rows,
    616     projection,
    617     sep,
    618     rechunk,
    619     columns,
    620     encoding,
    621     n_threads,
    622     path,
    623     dtype_list,
    624     dtype_slice,
    625     low_memory,
    626     comment_char,
    627     quote_char,
    628     processed_null_values,
    629     parse_dates,
    630     skip_rows_after_header,
    631     _prepare_row_count_args(row_count_name, row_count_offset),
    632     sample_size=sample_size,
    633     eol_char=eol_char,
    634 )
    635 return self

ComputeError: Could not parse `�d1-2W"��Ø7~��������}f|��˰,�xE��[���ɥz���{/��v�Ǝ9��N^c�-�W6�);���5g�|-
                                                                                                     zoc�1�a�%�ls]ۈ6i;j�c��/�u���Y�k\���Ow��'9N��} as dtype Int32 at column 2.
The current offset in the file is 316 bytes.

Consider specifying the correct dtype, increasing
the number of records used to infer the schema,
running the parser with `ignore_parser_errors=true`
or  adding `�d1-2W"��Ø7~��������}f|��˰,�xE��[���ɥz���{/��v�Ǝ9��N^c�-�W6�);���5g�|-
                                                                                  zoc�1�a�%�ls]ۈ6i;j�c��/�u���Y�k\���Ow��'9N��} to the `null_values` list.

ghuls avatar Sep 16 '22 09:09 ghuls

So we must go back to the other libz impl?

ritchie46 avatar Sep 16 '22 10:09 ritchie46

I am not sure.

I can read the file if I give n_rows with a big value.

So the n_rows branch seems to work all the time. https://github.com/pola-rs/polars/blob/master/polars/polars-io/src/csv/utils.rs#L487

Not sure why the None branch doesn't work with zlib-ng, but works with other zlib implementations.

ghuls avatar Sep 16 '22 10:09 ghuls

Upstream: https://github.com/rust-lang/libz-sys/issues/104

ghuls avatar Jan 14 '23 14:01 ghuls

Same issue.

polars/internals/io.py", line 107, in _prepare_file_arg
    return BytesIO(file.read_bytes().decode(encoding_str).encode("utf8"))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'cp932' codec can't decode byte 0x8b in position 1: illegal multibyte sequence

stalkerg avatar Jan 31 '23 15:01 stalkerg

Same issue.

polars/internals/io.py", line 107, in _prepare_file_arg
    return BytesIO(file.read_bytes().decode(encoding_str).encode("utf8"))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'cp932' codec can't decode byte 0x8b in position 1: illegal multibyte sequence

Are you sure your compressed file is encoded as cp932?

gzip -cd file > file_test
# What is the output of file:
file file_test_encoding
```

ghuls avatar Jan 31 '23 16:01 ghuls

@ghuls yes, it opened fine if uncompressed or if I use with gzip.open(path) as fp:

stalkerg avatar Feb 01 '23 00:02 stalkerg

@stalkerg Could you share the file?

ghuls avatar Feb 01 '23 09:02 ghuls

@ghuls I can't, but I can try to make a sample.

stalkerg avatar Feb 01 '23 13:02 stalkerg

I have an example file that does not parse with decompress or decompress-fast, but if gunzip'ed parses just fine. I have tried compressing with native (macos) gzip and flate2, and I have tried parsing with decompress and decompress-fast. With decompress-fast, I get the error invalid utf8 data in csv. With decompress, I get an error that appears to indicate that a line is malformed or a newline character is missing.

Error: ComputeError(Owned("Could not parse `402023-03-12T07:42:04.346613+00:00` as dtype Int64 at column 8.\nThe current offset in the file is 3785687038 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nenabling the `ignore_errors` flag, or adding\n`402023-03-12T07:42:04.346613+00:00` to the `null_values` list."))

The file is about 786M compressed w/ default gzip and 5.8G uncompressed. The error on decompress is near 3.5G. So far in my experience, other smaller files (de)compressed with the same mechanism have been successfully parsed.

I am directly using the Rust interface with polars 0.27.2 like the following to hit the error CsvReader::from_path(&path)?.finish()?.

trueb2 avatar Mar 13 '23 17:03 trueb2

It seems to be a schema mismatch. Did you try ignoring errors or set all dtypes to utf8?

ritchie46 avatar Mar 13 '23 18:03 ritchie46

I discounted the schema as the actual problem, but it is spot on. At some point in the gunzip'ing and gzip'ing, the file became corrupted. I have a similar loading error with pandas and no error with polars or pandas after regenerating the file and repeating the process, using just the decompress feature.

When the file was gzip'ed from the cli, ie gzip data.csv, the data was loaded successfully by pandas and polars. When the file was compressed with the polars csv writer, the csv was corrupted.

This would be a separate issue.

Error: ComputeError(Owned("Could not parse `-` as dtype Int64 at column 8.\nThe current offset in the file is 4314125977 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nenabling the `ignore_errors` flag, or adding\n`-` to the `null_values` list."))
log::info!("Writing gzipped CSV file: {:?}", gz_path);
let gz_file = std::fs::File::create(&gz_path)?;
let gz = flate2::write::GzEncoder::new(gz_file, flate2::Compression::default());
let mut csv_writer = CsvWriter::new(gz);
let mut clone_df = df.clone();
csv_writer.finish(&mut clone_df)?;

Skimming through the data I see invalid CSV like

,,,,,,,712
,,,,,,,714
,,,,,,,,,,,399
,,,,,,,388
2023-03-10T14:40:20.145045+00:00,47,87,1,0,-746,-4176,624
,,,,,,,399

trueb2 avatar Mar 14 '23 17:03 trueb2

something wrong with buffers here

stalkerg avatar Mar 28 '23 04:03 stalkerg

I am not sure if the issue is fixed, but without a reproducible example, this issue cannot be addressed. I am closing this. If a reproducible example is found, please open a new issue for this.

stinodego avatar Jan 16 '24 13:01 stinodego