parquet-wasm icon indicating copy to clipboard operation
parquet-wasm copied to clipboard

Can't write an Arrow table if it contains list

Open timspro opened this issue 1 year ago • 5 comments

I'm expecting the following code to work but am getting an error "RuntimeError: unreachable" when running in Node.js v20.17.0, thrown by fromIPCStream().

import { tableFromArrays, tableToIPC } from "apache-arrow"
import { Table } from "parquet-wasm"

const table = tableFromArrays({
  column: [[1, 2], [3, 4]],
})
const ipc = tableToIPC(table, "stream")
Table.fromIPCStream(ipc)

I tried changing "stream" to "file" but that didn't work either with the error "Io error: failed to fill whole buffer".

I was able to get other examples working locally that didn't have a list (for example, column: [1, 2] and column: [{a: 1}, {a: 2}]).

It does work if using typed arrays: column: [new Int32Array([1, 2]), new Int32Array([3, 4])]. So, I do have a workaround. However, I originally wanted to write a list of structs with Int32 values and now will have to do a struct of typed arrays. Perhaps that is what is intended.

timspro avatar Sep 18 '24 15:09 timspro

If you compile with --debug flag turned on, then you can see the actual Rust error, instead of just RuntimeError: unreachable.

With the test in https://github.com/kylebarron/parquet-wasm/pull/607, the error is:

stderr | tests/js/index.test.ts > should read IPC stream correctly
panicked at /Users/kyle/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-ipc-53.0.0/src/convert.rs:98:30:
called `Option::unwrap()` on a `None` value

So the rust code is panicking on this line: https://github.com/apache/arrow-rs/blob/5414f1d7c0683c64d69cf721a83c17d677c78a71/arrow-ipc/src/convert.rs#L98

If we load this data in pyarrow, we see:

In [1]: import pyarrow as pa

In [3]: pa.ipc.open_stream("data.arrows").read_all()
Out[3]:
pyarrow.Table
column: list<: double>
  child 0, : double
----
column: [[[1,2],[3,4]]]

So the list's inner field does not have a name set. I'm not sure if that's allowed by the spec (it's rare at least). Either the JS IPC writer or the Rust IPC reader is incorrect.

kylebarron avatar Sep 18 '24 15:09 kylebarron

I checked with @jorisvandenbossche and saw that the IPC spec doesn't require a name to be set, so this is an issue on the Rust side. (Though there should be a default name set)

kylebarron avatar Sep 18 '24 15:09 kylebarron

Created https://github.com/apache/arrow-rs/issues/6415. Otherwise, you can work around this by manually setting a field name for any inner lists.

kylebarron avatar Sep 18 '24 16:09 kylebarron

Thanks for the commentary. The type inference done be tableFromArrays() is passing the empty name: https://github.com/apache/arrow/blob/main/js/src/factories.ts#L153.

I was then able to get around the issue by passing in the List type directly:

import { Field, Int32, List, tableFromArrays, tableToIPC, vectorFromArray } from "apache-arrow"
import { Table } from "parquet-wasm"

const table = tableFromArrays({
  column: vectorFromArray(
    [[1, 2], [3, 4]],
    new List(new Field("_", new Int32())) // fails if "" passed instead
  ),
})
const ipc = tableToIPC(table, "stream")
Table.fromIPCStream(ipc)

This is a fine workaround for me.

timspro avatar Sep 18 '24 16:09 timspro

This was fixed in https://github.com/apache/arrow-rs/pull/8557, so the next release containing the next release of parquet will solve this.

kylebarron avatar Oct 06 '25 11:10 kylebarron