[FFI] - RangeError: byte length of BigInt64Array should be a multiple of 8
I tried to load a new Parquet table, using the same method I always use, but that method failed with the following error:
(venv) [crow@crow-pc ode]$ node misc/parquetFailing.js
file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:300
? new dataType.ArrayType(copyBuffer(dataView.buffer, dataPtr, length * byteWidth))
^
RangeError: byte length of BigInt64Array should be a multiple of 8
at new BigInt64Array (<anonymous>)
at parseDataContent (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:300:15)
at parseData (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:175:16)
at parseData (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:139:23)
at parseTable (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:935:28)
at file:///home/crow/repos/ode/misc/parquetFailing.js:25:19
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Node.js v18.20.4
This error is thrown when trying to load the table with FFI, but does not happen when we use the original implementation.
Since I already found a workaround, this bug isn't a huge priority for me. But I thought you guys might want to know about it.
Here is some reproducible code:
import * as arrow from 'apache-arrow'
import { parseTable } from 'arrow-js-ffi'
import { wasmMemory, readParquet } from 'parquet-wasm'
const url =
'https://huggingface.co/api/datasets/tiiuae/falcon-refinedweb/parquet/default/train/320.parquet'
// This one will succeed
;(async () => {
const resp = await fetch(url)
const buffer = new Uint8Array(await resp.arrayBuffer())
const arrowWasmTable = readParquet(buffer)
const table = arrow.tableFromIPC(arrowWasmTable.intoIPCStream())
table.free()
console.log('successfully loaded table via parquet-wasm')
})()
// This one will fail
;(async () => {
const resp = await fetch(url)
const buffer = new Uint8Array(await resp.arrayBuffer())
const ffiTable = readParquet(buffer).intoFFI()
const table = parseTable(
wasmMemory().buffer,
ffiTable.arrayAddrs(),
ffiTable.schemaAddr()
)
table.free()
console.log('successfully loaded table via FFI')
})()
Versions:
- parquet-wasm v0.6.1
- arrow-js-ffi v0.4.2
- node v18.20.4
@Vectorrent I'm unable to reproduce this:
- Node v20.9.0
- arrow-js-ffi latest main (which is the same effectively as latest released)
- parquet-wasm 0.6.1
With this test case:
// issue129.test.ts
import { readFileSync } from "fs";
import { readParquet, wasmMemory } from "parquet-wasm";
import { describe, it, expect } from "vitest";
import * as arrow from "apache-arrow";
import * as wasm from "rust-arrow-ffi";
import { parseTable } from "../src";
wasm.setPanicHook();
describe("issue 129", (t) => {
const buffer = readFileSync("0320.parquet");
const ffiTable = readParquet(buffer).intoFFI();
const memory = wasmMemory();
const table = parseTable(
memory.buffer,
ffiTable.arrayAddrs(),
ffiTable.schemaAddr()
);
ffiTable.free();
console.log(table.schema);
it("Should pass", () => {
expect(true).toBeTruthy();
});
});
Schema {
fields: [
Field {
name: 'content',
type: [Utf8],
nullable: true,
metadata: Map(0) {}
},
Field {
name: 'url',
type: [Utf8],
nullable: true,
metadata: Map(0) {}
},
Field {
name: 'timestamp',
type: [Timestamp_ [Timestamp]],
nullable: true,
metadata: Map(0) {}
},
Field {
name: 'dump',
type: [Utf8],
nullable: true,
metadata: Map(0) {}
},
Field {
name: 'segment',
type: [Utf8],
nullable: true,
metadata: Map(0) {}
},
Field {
name: 'image_urls',
type: [List],
nullable: true,
metadata: Map(0) {}
}
],
metadata: Map(1) {
'huggingface' => '{"info": {"features": {"content": {"dtype": "string", "_type": "Value"}, "url": {"dtype": "string", "_type": "Value"}, "timestamp": {"dtype": "timestamp[s]", "_type": "Value"}, "dump": {"dtype": "string", "_type": "Value"}, "segment": {"dtype": "string", "_type": "Value"}, "image_urls": {"feature": {"feature": {"dtype": "string", "_type": "Value"}, "_type": "Sequence"}, "_type": "Sequence"}}}}'
},
dictionaries: Map(0) {},
metadataVersion: 4
}
Strange. I tried your code (i.e. loading from disk), and that fails too. I upgraded to Node v22, and apache-arrow v17.0.0 - with no luck. Not sure what else to try; maybe it's an engine thing? I'm running on Linux.
Anyway, not a huge priority, since I do have a workaround. Just thought it was worth reporting.
Are you able to slice that data (i.e. take the first 5 rows) and save it as a Parquet file that also fails for you? Then we could check that data in to Git and add it as a test case to this repo.
It's good that reading from IPC works, but I do want to make sure that arrow-js-ffi is stable!
I sliced 5 rows with PyArrow, saved them to disk, then tried FFI again with the new file. No dice, it still fails.
Here's the sliced file: https://mega.nz/file/CRsFDJrC#3lRSoohQ1kohnqzX0O0TmVtjrsfgKRgj0KMLzxf2nU8
Ok, cool, thanks for making that file.
For reference, I find it much easier to zip a Parquet file and share that via github in the issue itself.
I ran into the same error today and traced it back to the same line. My quick & dirty fix was to multiply byteWidth by 2 in the node_modules code ¯\_(ツ)_/¯. I've just started working with Arrow/Parquet so this is more a question than suggestion, but are Timestamps ever not BigInt64? It'd make sense to have Int32 with the different units, so I mostly ask because of this line in arrow-js:
interface Timestamp_<T extends Timestamps = Timestamps> extends DataType<T> {
TArray: BigInt64Array;
TValue: number;
ArrayType: BigIntArrayConstructor<BigInt64Array>;
}
but that's as far as my investigation went (so far). Happy to help figure this out!
Versions
- Node: v22.13.0
- arrow-js-ffi: 0.4.2
- parquet-wasm: 0.6.1
It looks like that is a mistake.
Timestamp and Duration types should always be int64, while Time types can be int32 or int64. So it looks like the byteWidth for timestamp and duration should always be 8.
Would you like to make a PR?