Reading from stdin doesn't work
I tried the new releases but get an error.
> csv2arrow data/simple.csv -n
Schema:
{
"fields": [
{
"name": "a",
"data_type": "Int64",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
},
{
"name": "b",
"data_type": "Boolean",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
}
],
"metadata": {}
}
> cat data/simple.csv | csv2arrow /dev/stdin -n
Error: SchemaError("Error inferring schema: Io error: Seeking outside of buffer, please report to https://github.com/domoritz/arrow-tools/issues/new")
I am on macOS.
Also hitting on Mac with json2arrow, yet slightly different error:
% cat test.json | json2arrow > test.arrow
error: the following required arguments were not provided:
<JSON>
Usage: json2arrow <JSON> [ARROW]
For more information, try '--help'.
% json2arrow --help
Usage: json2arrow [OPTIONS] <JSON> [ARROW]
Arguments:
<JSON> Input JSON file, stdin if not present
[ARROW] Output file, stdout if not present
Options:
-s, --schema-file <SCHEMA_FILE>
File with Arrow schema in JSON format
-m, --max-read-records <MAX_READ_RECORDS>
The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed
-p, --print-schema
Print the schema to stderr
-n, --dry
Only print the schema
-h, --help
Print help
-V, --version
Print version
% json2arrow -V
json2arrow 0.17.10
Yeah, for now don't use stdin. Would love a pull request to fix this.
Is it the same across all the tools?
Pretty much, yes. The seekable reader we use across the libraries is in a shared crate in https://github.com/domoritz/arrow-tools/tree/main/crates/arrow-tools. We can move any shared functionality into that crate.
This seems to work now.
$ cat data/simple.csv | csv2arrow /dev/stdin -n
Schema:
{
"fields": [
{
"name": "a",
"data_type": "Int64",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
},
{
"name": "b",
"data_type": "Boolean",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
}
],
"metadata": {}
}
This seems to work now.
$ cat data/simple.csv | csv2arrow /dev/stdin -n Schema: { "fields": [ { "name": "a", "data_type": "Int64", "nullable": true, "dict_id": 0, "dict_is_ordered": false, "metadata": {} }, { "name": "b", "data_type": "Boolean", "nullable": true, "dict_id": 0, "dict_is_ordered": false, "metadata": {} } ], "metadata": {} }
what was the fix commit / pr?
I don't know. Could you do a git bisect to find out?
@domoritz no, sorry.
Unfortunately I'm still seeing this error with json2arrow 0.18.1 on macOS 13.6.9.
This fails:
% echo '{"a": 1, "b": 2}' | jq -c 'to_entries|.[]' | json2arrow -n /dev/stdin
Error: SchemaError("Error inferring schema: Io error: Seeking outside of buffer, please report to https://github.com/domoritz/arrow-tools/issues/new")
This works:
% echo '{"a": 1, "b": 2}' | jq -c 'to_entries|.[]' > test.ndjson
% json2arrow -n test.ndjson
Schema:
{
"fields": [
{
"name": "key",
"data_type": "Utf8",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
},
{
"name": "value",
"data_type": "Int64",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
}
],
"metadata": {}
}
I also tried running in Debian, with the same result:
docker run -it --rm debian
apt update && apt install -y curl xz-utils jq
curl -L https://github.com/domoritz/arrow-tools/releases/download/v0.18.1/json2arrow-x86_64-unknown-linux-gnu.tar.xz | unxz | tar x
echo '{"a": 1, "b": 2}' | jq -c 'to_entries|.[]' | ./json2arrow-x86_64-unknown-linux-gnu/json2arrow -n /dev/stdin
I think this check might need to change:
- if self.pos < self.buffered_bytes {
+ if self.pos <= self.buffered_bytes {
Thanks, that seems to fix it.
arrow-tools/crates/json2arrow on main [!⇡1] via 🦀 v1.81.0 ❯ echo '{"a": 1, "b": 2}' | jq -c 'to_entries|.[]' | cargo run -- -n /dev/stdin
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.05s
Running `/Users/dominik/Developer/arrow-tools/target/debug/json2arrow -n /dev/stdin`
Schema:
{
"fields": [
{
"name": "key",
"data_type": "Utf8",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
},
{
"name": "value",
"data_type": "Int64",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
}
],
"metadata": {}
}
arrow-tools/crates/json2arrow on main [!⇡1] via 🦀 v1.81.0 ❯ echo '{"a": 1, "b": 2}' | jq -c 'to_entries|.[]' | json2arrow -n /dev/stdin
Error: SchemaError("Error inferring schema: Io error: Seeking outside of buffer, please report to https://github.com/domoritz/arrow-tools/issues/new")