polars
polars copied to clipboard
Gzipped CSV files can not always be read anymore.
What language are you using?
Python.
Have you tried latest version of polars?
- [yes]
What version of polars are you using?
0.13.51 and latest git version.
Latest working polars version is 0.13.34. 0.13.35 and later are broken.
The commit for 0.13.35 updated libz-sys from 1.1.5 to 1.1.6 and broke reading gzipped CSV files in some cases: https://github.com/pola-rs/polars/commit/ed931910ff18867879ec0f5343a373c3a976b991
I compiled the latest git version of polars with:
diff --git a/py-polars/Cargo.lock b/py-polars/Cargo.lock
index 939e751c81..f04404f8b3 100644
--- a/py-polars/Cargo.lock
+++ b/py-polars/Cargo.lock
@@ -453,11 +453,13 @@ checksum = "7360491ce676a36bf9bb3c56c1aa791658183a54d2744120f27285738d90465a"
[[package]]
name = "flate2"
-version = "1.0.24"
+version = "1.0.22"
source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "f82b0f4c27ad9f8bfd1f3208d882da2b09c301bc1c828fd3a00d0216d2fbbff6"
+checksum = "1e6988e897c1c9c485f43b47a529cef42fde0547f9d8d41a7062518f1d8fc53f"
dependencies = [
+ "cfg-if",
"crc32fast",
+ "libc",
"libz-sys",
"miniz_oxide",
]
@@ -842,9 +844,9 @@ dependencies = [
[[package]]
name = "libz-sys"
-version = "1.1.8"
+version = "1.1.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "9702761c3935f8cc2f101793272e202c72b99da8f4224a19ddcf1279a6450bbf"
+checksum = "6f35facd4a5673cb5a48822be2be1d4236c1c99cb4113cab7061ac720d5bf859"
dependencies = [
"cc",
"cmake",
@@ -936,11 +938,12 @@ dependencies = [
[[package]]
name = "miniz_oxide"
-version = "0.5.3"
+version = "0.4.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "6f5c75688da582b8ffc1f1799e9db273f32133c49e048f614d22ec3256773ccc"
+checksum = "a92518e98c078586bc6c934028adcca4c92a53d6a958196de835170a01d84e4b"
dependencies = [
"adler",
+ "autocfg",
]
[[package]]
And the gzipped file that didn't work, can be read again.
What operating system are you using polars on?
CentOS 7.
What language version are you using
python 3.10
Describe your bug.
Gzipped (with multiple gzip streams in this case) CSV files can not always be read. With polars 0.13.34 or polars with libz-sys 1.1.5 patch it works.
What are the steps to reproduce the behavior?
In [1]: import polars as pl
In [2]: df = pl.read_csv('atac_fragments.head40000000.tsv.gz', skip_rows=52, has_headers=False, sep="\t", use_pyarrow=False)
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
<ipython-input-2-0d225fac1a3f> in <module>
----> 1 df = pl.read_csv('atac_fragments.head40000000.tsv.gz', skip_rows=52, has_headers=False, sep="\t", use_pyarrow=False)
~/software/polars/py-polars/polars/io.py in read_csv(file, has_header, columns, new_columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_count_name, row_count_offset, sample_size, **kwargs)
420
421 with _prepare_file_arg(file, **storage_options) as data:
--> 422 df = DataFrame._read_csv(
423 file=data,
424 has_header=has_header,
~/software/polars/py-polars/polars/internals/frame.py in _read_csv(cls, file, has_header, columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, sample_size)
584 projection, columns = handle_projection_columns(columns)
585
--> 586 self._df = PyDataFrame.read_csv(
587 file,
588 infer_schema_length,
ComputeError: invalid utf8 data in csv
It seems like it is still not fixed in all cases in 0.14.11.
Now with a different file.
File ~/software/anaconda3/envs/pycistopic/lib/python3.10/site-packages/polars/internals/dataframe/frame.py:608, in DataFrame._read_csv(cls, file, has_header, columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char)
601 raise ValueError(
602 "cannot use glob patterns and integer based projection as `columns`"
603 " argument; Use columns: List[str]"
604 )
606 projection, columns = handle_projection_columns(columns)
--> 608 self._df = PyDataFrame.read_csv(
609 file,
610 infer_schema_length,
611 batch_size,
612 has_header,
613 ignore_errors,
614 n_rows,
615 skip_rows,
616 projection,
617 sep,
618 rechunk,
619 columns,
620 encoding,
621 n_threads,
622 path,
623 dtype_list,
624 dtype_slice,
625 low_memory,
626 comment_char,
627 quote_char,
628 processed_null_values,
629 parse_dates,
630 skip_rows_after_header,
631 _prepare_row_count_args(row_count_name, row_count_offset),
632 sample_size=sample_size,
633 eol_char=eol_char,
634 )
635 return self
ComputeError: Could not parse `�d1-2W"��Ø7~��������}f|��˰,�xE��[���ɥz���{/��v�Ǝ9��N^c�-�W6�);���5g�|-
zoc�1�a�%�ls]ۈ6i;j�c��/�u���Y�k\���Ow��'9N��} as dtype Int32 at column 2.
The current offset in the file is 316 bytes.
Consider specifying the correct dtype, increasing
the number of records used to infer the schema,
running the parser with `ignore_parser_errors=true`
or adding `�d1-2W"��Ø7~��������}f|��˰,�xE��[���ɥz���{/��v�Ǝ9��N^c�-�W6�);���5g�|-
zoc�1�a�%�ls]ۈ6i;j�c��/�u���Y�k\���Ow��'9N��} to the `null_values` list.
So we must go back to the other libz impl?
I am not sure.
I can read the file if I give n_rows with a big value.
So the n_rows branch seems to work all the time. https://github.com/pola-rs/polars/blob/master/polars/polars-io/src/csv/utils.rs#L487
Not sure why the None branch doesn't work with zlib-ng, but works with other zlib implementations.
Upstream: https://github.com/rust-lang/libz-sys/issues/104
Same issue.
polars/internals/io.py", line 107, in _prepare_file_arg
return BytesIO(file.read_bytes().decode(encoding_str).encode("utf8"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'cp932' codec can't decode byte 0x8b in position 1: illegal multibyte sequence
Same issue.
polars/internals/io.py", line 107, in _prepare_file_arg return BytesIO(file.read_bytes().decode(encoding_str).encode("utf8")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'cp932' codec can't decode byte 0x8b in position 1: illegal multibyte sequence
Are you sure your compressed file is encoded as cp932?
gzip -cd file > file_test
# What is the output of file:
file file_test_encoding
```
@ghuls yes, it opened fine if uncompressed or if I use with gzip.open(path) as fp:
@stalkerg Could you share the file?
@ghuls I can't, but I can try to make a sample.
I have an example file that does not parse with decompress or decompress-fast, but if gunzip'ed parses just fine. I have tried compressing with native (macos) gzip and flate2, and I have tried parsing with decompress and decompress-fast. With decompress-fast, I get the error invalid utf8 data in csv. With decompress, I get an error that appears to indicate that a line is malformed or a newline character is missing.
Error: ComputeError(Owned("Could not parse `402023-03-12T07:42:04.346613+00:00` as dtype Int64 at column 8.\nThe current offset in the file is 3785687038 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nenabling the `ignore_errors` flag, or adding\n`402023-03-12T07:42:04.346613+00:00` to the `null_values` list."))
The file is about 786M compressed w/ default gzip and 5.8G uncompressed. The error on decompress is near 3.5G. So far in my experience, other smaller files (de)compressed with the same mechanism have been successfully parsed.
I am directly using the Rust interface with polars 0.27.2 like the following to hit the error
CsvReader::from_path(&path)?.finish()?.
It seems to be a schema mismatch. Did you try ignoring errors or set all dtypes to utf8?
I discounted the schema as the actual problem, but it is spot on. At some point in the gunzip'ing and gzip'ing, the file became corrupted. I have a similar loading error with pandas and no error with polars or pandas after regenerating the file and repeating the process, using just the decompress feature.
When the file was gzip'ed from the cli, ie gzip data.csv, the data was loaded successfully by pandas and polars. When the file was compressed with the polars csv writer, the csv was corrupted.
This would be a separate issue.
Error: ComputeError(Owned("Could not parse `-` as dtype Int64 at column 8.\nThe current offset in the file is 4314125977 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nenabling the `ignore_errors` flag, or adding\n`-` to the `null_values` list."))
log::info!("Writing gzipped CSV file: {:?}", gz_path);
let gz_file = std::fs::File::create(&gz_path)?;
let gz = flate2::write::GzEncoder::new(gz_file, flate2::Compression::default());
let mut csv_writer = CsvWriter::new(gz);
let mut clone_df = df.clone();
csv_writer.finish(&mut clone_df)?;
Skimming through the data I see invalid CSV like
,,,,,,,712
,,,,,,,714
,,,,,,,,,,,399
,,,,,,,388
2023-03-10T14:40:20.145045+00:00,47,87,1,0,-746,-4176,624
,,,,,,,399
something wrong with buffers here
I am not sure if the issue is fixed, but without a reproducible example, this issue cannot be addressed. I am closing this. If a reproducible example is found, please open a new issue for this.