polars
polars copied to clipboard
Reading avro file returns ComputeError: OutOfSpec in both Rust and Python
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
Here is the error I am getting:
thread 'main' panicked at 'file not read: ComputeError(ErrString("OutOfSpec"))', src/main.rs:29:50 stack backtrace: 0: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 2: core::result::unwrap_failed at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/result.rs:1790:5 3: core::result::Result<T,E>::expect at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/result.rs:1069:23 4: westend_example::main at ./src/main.rs:29:19 5: core::ops::function::FnOnce::call_once at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/ops/function.rs:250:5
Reproducible example
let fname = "westend_westend_20230411_westend_blocks.avro";
let file = File::open(fname).expect("file not found");
println!("file found! {:?}", file);
let df_avro = AvroReader::new(file).finish().expect("file not read");
println!("df_avro: {:?}", df_avro);
Expected behavior
The same file was read in just fine using pandas. westend_westend_20230411_westend_blocks.txt Rename the .txt file to .avro. For some reason Github wouldn't allow me to upload an avro file.
Installed versions
Are you 100% certain the file is correct?
At least I can confirm that the attachment can be open with fastavro. Have no domain on AVRO file spec tho
This keeps happening. Fyi ^^^. On latest 0.35.4.
Can confirm this is a valid issue. From my investigation it appears to be the fault of the code that writes the avro file.
Reading an avro that was written from 0.35 crashed when reading in 0.33 and 0.35. Reading an avro in 0.33 that was written by 0.33 worked. Reading an avro in 0.35 that was written by 0.33 worked.
I have this error too, some avro files can be read but some cannot be read despite all of them read by spark without any issue, I don't see any pattern yet.
Not sure if this covers all cases, since there are no reproducers, but here's one reproducer:
import sys
import polars as pl
file_path = "test.avro"
df = pl.DataFrame({"a": list(range(int(sys.argv[1])))})
df.write_avro(file_path)
# This fails:
pl.read_avro(file_path)
print("Success!")
If the number of items is smaller than 262,144, this works fine. Larger sizes break. Demonstration:
$ python 9249.py 262143
Success!
$ python 9249.py 262144
Traceback (most recent call last):
File "/home/itamarst/devel/polars/py-polars/9249.py", line 14, in <module>
pl.read_avro(file_path)
File "/home/itamarst/devel/polars/py-polars/polars/io/avro.py", line 38, in read_avro
return pl.DataFrame._read_avro(source, n_rows=n_rows, columns=columns)
File "/home/itamarst/devel/polars/py-polars/polars/dataframe/frame.py", line 782, in _read_avro
self._df = PyDataFrame.read_avro(source, columns, projection, n_rows)
polars.exceptions.ComputeError: avro-error: OutOfSpec
I believe this has something to do with block size logic, because if you have multiple chunks in the DataFrame, it gets written as corresponding Avro blocks, and if you have two blocks you get this error on much smaller block size threshold.
Not going to look into this any more.
Investigation didn't lead anywhere obvious, and I got deep enough I'd have to dig into avro-schema to see if there are bugs there. E.g. it's possible the zig decoding isn't quite implemented right. Unfortunately it seems pretty moribund as a project... https://docs.rs/apache-avro/ seems a lot better maintained.
A few more experiments with released Polars rather than main got slightly different results, but basically the same issue: some lengths work, some fail to roundtrip.
@stinodego Given how easy it is to create Avro files that aren't roundtrippable (or if it's a reading rather than writing bug, how easy it is to find files that aren't readable), prioritization should probably be high... or Avro should be dropped. Because it's very easy to break.