polars icon indicating copy to clipboard operation
polars copied to clipboard

Reading avro file returns ComputeError: OutOfSpec in both Rust and Python

Open rogerjbos opened this issue 2 years ago • 5 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

Here is the error I am getting:

thread 'main' panicked at 'file not read: ComputeError(ErrString("OutOfSpec"))', src/main.rs:29:50 stack backtrace: 0: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 2: core::result::unwrap_failed at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/result.rs:1790:5 3: core::result::Result<T,E>::expect at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/result.rs:1069:23 4: westend_example::main at ./src/main.rs:29:19 5: core::ops::function::FnOnce::call_once at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/ops/function.rs:250:5

Reproducible example

let fname = "westend_westend_20230411_westend_blocks.avro";
let file = File::open(fname).expect("file not found");
println!("file found! {:?}", file);

let df_avro = AvroReader::new(file).finish().expect("file not read");
println!("df_avro: {:?}", df_avro);

Expected behavior

The same file was read in just fine using pandas. westend_westend_20230411_westend_blocks.txt Rename the .txt file to .avro. For some reason Github wouldn't allow me to upload an avro file.

Installed versions

polars-io = { version = "0.30.0", features = ["avro"]}

rogerjbos avatar Jun 06 '23 02:06 rogerjbos

Are you 100% certain the file is correct?

ritchie46 avatar Jun 06 '23 06:06 ritchie46

At least I can confirm that the attachment can be open with fastavro. Have no domain on AVRO file spec tho

cjackal avatar Aug 01 '23 13:08 cjackal

This keeps happening. Fyi ^^^. On latest 0.35.4.

vertexclique avatar Dec 25 '23 14:12 vertexclique

Can confirm this is a valid issue. From my investigation it appears to be the fault of the code that writes the avro file.

Reading an avro that was written from 0.35 crashed when reading in 0.33 and 0.35. Reading an avro in 0.33 that was written by 0.33 worked. Reading an avro in 0.35 that was written by 0.33 worked.

mangoleaf avatar Jan 02 '24 22:01 mangoleaf

I have this error too, some avro files can be read but some cannot be read despite all of them read by spark without any issue, I don't see any pattern yet.

musa-karimli-m10 avatar Mar 17 '24 12:03 musa-karimli-m10

Not sure if this covers all cases, since there are no reproducers, but here's one reproducer:

import sys
import polars as pl

file_path = "test.avro"
df = pl.DataFrame({"a": list(range(int(sys.argv[1])))})

df.write_avro(file_path)

# This fails:
pl.read_avro(file_path)
print("Success!")

If the number of items is smaller than 262,144, this works fine. Larger sizes break. Demonstration:

$ python 9249.py 262143
Success!
$ python 9249.py 262144
Traceback (most recent call last):
  File "/home/itamarst/devel/polars/py-polars/9249.py", line 14, in <module>
    pl.read_avro(file_path)
  File "/home/itamarst/devel/polars/py-polars/polars/io/avro.py", line 38, in read_avro
    return pl.DataFrame._read_avro(source, n_rows=n_rows, columns=columns)
  File "/home/itamarst/devel/polars/py-polars/polars/dataframe/frame.py", line 782, in _read_avro
    self._df = PyDataFrame.read_avro(source, columns, projection, n_rows)
polars.exceptions.ComputeError: avro-error: OutOfSpec

I believe this has something to do with block size logic, because if you have multiple chunks in the DataFrame, it gets written as corresponding Avro blocks, and if you have two blocks you get this error on much smaller block size threshold.

itamarst avatar Apr 04 '24 13:04 itamarst

Not going to look into this any more.

Investigation didn't lead anywhere obvious, and I got deep enough I'd have to dig into avro-schema to see if there are bugs there. E.g. it's possible the zig decoding isn't quite implemented right. Unfortunately it seems pretty moribund as a project... https://docs.rs/apache-avro/ seems a lot better maintained.

itamarst avatar Apr 04 '24 15:04 itamarst

A few more experiments with released Polars rather than main got slightly different results, but basically the same issue: some lengths work, some fail to roundtrip.

@stinodego Given how easy it is to create Avro files that aren't roundtrippable (or if it's a reading rather than writing bug, how easy it is to find files that aren't readable), prioritization should probably be high... or Avro should be dropped. Because it's very easy to break.

itamarst avatar Apr 04 '24 15:04 itamarst