nodejs-polars
nodejs-polars copied to clipboard
[NodeJS]: readIPC from buffer fails with 'Arrow file does not contain correct header', while it works in ArrowJS
Using Node.JS
What version of polars are you using?
"nodejs-polars": "^0.2.0"
What operating system are you using polars on?
MacOS Big Sur 11.1
Describe your bug.
Reading in a buffer from an .ipc (ArrowStream) file using readIPC fails with Error: Arrow file does not contain correct header. At the same time the file is not corrupt since it can be loaded using apache-arrow's Table.from method
What are the steps to reproduce the behavior?
See code example below. I'll post both the .arrow file (works) and .ipc file (doesn't work) as attachment
const pl = require('nodejs-polars');
const { Table } = require('apache-arrow')
const { readFileSync } = require('fs');
const fromArrow = readFileSync('hits.arrow');
const fromIPC = readFileSync('hits.ipc');
// Read Arrow file by Arrow.js -> works
const df = Table.from([fromArrow])
console.log("df", df.count()) // 10
// Read Arrow file by polars -> works
const dfPolars = pl.readIPC(fromArrow)
console.log("dfPolars", dfPolars) // prints nice table with 10 entries
// Read IPC (ArrowStream) file by Arrow.js -> works
const dfIpc = Table.from([fromIPC])
console.log("dfIpc", dfIpc.count()) // 10
// Read IPC (ArrowStream) by polars -> Fails
const dfIpcPolars = pl.readIPC(fromIPC)
console.log("dfIpcPolars", dfIpcPolars) // Error: Arrow file does not contain correct header
- .arrow file that works correctly
- .ipc file that doesn't work
The IPC readers are implemented upstream. Could you make this issue here? https://github.com/jorgecarleitao/arrow2
I am a bit surprised about pl.readIPC(fromArrow) and pl.readIPC(fromIPC): shouldn't these be two different signatures? One thing is to read a stream (.ipc), the other is a file (.arrow). I think that we are just missing a readIPCStream in Polars' API that can read arrow streams (as opposed to arrow files).
Ah.. Polars doesn't have that distinction no. So the IPC is the stream and the .arrow is the feather file as the IPC data + additional headers?
Then we must add this.
Hi!
I'm keen to get this into polars, as Snowflake uses this for their response format and would be awesome to get it in for reading data straight from SF into Polars.
Here is a quick primer about the streaming files from Arrow: https://arrow.apache.org/docs/python/ipc.html And the guide here from arrow2 about reading the stream: https://jorgecarleitao.github.io/arrow2/io/ipc_stream_read.html
IMHO, supporting files initially is fine, later can do other streaming support.
I've started looking into this, and the major blocker I can see is projections.
In arrow2, projections are not supported here: https://github.com/jorgecarleitao/arrow2/blob/main/src/io/ipc/read/stream.rs#L185
So we will need to build the projection from the chunks.
Thoughts?
Transfering this to the NodeJS repo as I have no way to reproduce this using Python/Rust. Not sure if this is still relevant.
@stinodego
Python Polars 0.19.2 throws the same error on this file: exceptions.ArrowErrorException: OutOfSpec("InvalidHeader")
df = pl.read_ipc('https://paste.c-net.org/ViperMoronic')
It seems that .ipc file needs to start and end with ARROW1 for Polars to work
@0xgeert Please try: pl.read_ipc_stream using py-polars as described here. It works fine for me. Thx