arrow-tools icon indicating copy to clipboard operation
arrow-tools copied to clipboard

Reading from stdin doesn't work

Open domoritz opened this issue 3 years ago • 4 comments

I tried the new releases but get an error.

> csv2arrow data/simple.csv -n
Schema:

{
  "fields": [
    {
      "name": "a",
      "data_type": "Int64",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
    {
      "name": "b",
      "data_type": "Boolean",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }
  ],
  "metadata": {}
}
> cat data/simple.csv | csv2arrow /dev/stdin -n
Error: SchemaError("Error inferring schema: Io error: Seeking outside of buffer, please report to https://github.com/domoritz/arrow-tools/issues/new")

I am on macOS.

domoritz avatar Apr 12 '23 03:04 domoritz

Also hitting on Mac with json2arrow, yet slightly different error:

% cat test.json | json2arrow > test.arrow
error: the following required arguments were not provided:
  <JSON>

Usage: json2arrow <JSON> [ARROW]

For more information, try '--help'.
% json2arrow --help
Usage: json2arrow [OPTIONS] <JSON> [ARROW]

Arguments:
  <JSON>   Input JSON file, stdin if not present
  [ARROW]  Output file, stdout if not present

Options:
  -s, --schema-file <SCHEMA_FILE>
          File with Arrow schema in JSON format
  -m, --max-read-records <MAX_READ_RECORDS>
          The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed
  -p, --print-schema
          Print the schema to stderr
  -n, --dry
          Only print the schema
  -h, --help
          Print help
  -V, --version
          Print version
% json2arrow -V
json2arrow 0.17.10

ukd1 avatar Mar 28 '24 20:03 ukd1

Yeah, for now don't use stdin. Would love a pull request to fix this.

domoritz avatar Mar 28 '24 21:03 domoritz

Is it the same across all the tools?

ukd1 avatar Mar 28 '24 21:03 ukd1

Pretty much, yes. The seekable reader we use across the libraries is in a shared crate in https://github.com/domoritz/arrow-tools/tree/main/crates/arrow-tools. We can move any shared functionality into that crate.

domoritz avatar Mar 29 '24 00:03 domoritz

This seems to work now.

$ cat data/simple.csv | csv2arrow /dev/stdin -n
Schema:

{
  "fields": [
    {
      "name": "a",
      "data_type": "Int64",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
    {
      "name": "b",
      "data_type": "Boolean",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }
  ],
  "metadata": {}
}

domoritz avatar Jun 08 '24 15:06 domoritz

This seems to work now.

$ cat data/simple.csv | csv2arrow /dev/stdin -n
Schema:

{
  "fields": [
    {
      "name": "a",
      "data_type": "Int64",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
    {
      "name": "b",
      "data_type": "Boolean",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }
  ],
  "metadata": {}
}

what was the fix commit / pr?

ukd1 avatar Jun 08 '24 17:06 ukd1

I don't know. Could you do a git bisect to find out?

domoritz avatar Jun 08 '24 19:06 domoritz

@domoritz no, sorry.

ukd1 avatar Jun 11 '24 15:06 ukd1

Unfortunately I'm still seeing this error with json2arrow 0.18.1 on macOS 13.6.9.

This fails:

% echo '{"a": 1, "b": 2}' | jq -c 'to_entries|.[]' | json2arrow -n /dev/stdin
Error: SchemaError("Error inferring schema: Io error: Seeking outside of buffer, please report to https://github.com/domoritz/arrow-tools/issues/new")

This works:

% echo '{"a": 1, "b": 2}' | jq -c 'to_entries|.[]' > test.ndjson             
% json2arrow -n test.ndjson 
Schema:
{
  "fields": [
    {
      "name": "key",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
    {
      "name": "value",
      "data_type": "Int64",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }
  ],
  "metadata": {}
}

I also tried running in Debian, with the same result:

docker run -it --rm debian

apt update && apt install -y curl xz-utils jq
curl -L https://github.com/domoritz/arrow-tools/releases/download/v0.18.1/json2arrow-x86_64-unknown-linux-gnu.tar.xz | unxz | tar x
echo '{"a": 1, "b": 2}' | jq -c 'to_entries|.[]' | ./json2arrow-x86_64-unknown-linux-gnu/json2arrow -n /dev/stdin

mootari avatar Sep 14 '24 20:09 mootari

I think this check might need to change:

-            if self.pos < self.buffered_bytes {
+            if self.pos <= self.buffered_bytes {

mootari avatar Sep 15 '24 12:09 mootari

Thanks, that seems to fix it.

arrow-tools/crates/json2arrow on  main [!⇡1] via 🦀 v1.81.0 ❯ echo '{"a": 1, "b": 2}' | jq -c 'to_entries|.[]' | cargo run -- -n /dev/stdin
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.05s
     Running `/Users/dominik/Developer/arrow-tools/target/debug/json2arrow -n /dev/stdin`
Schema:
{
  "fields": [
    {
      "name": "key",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
    {
      "name": "value",
      "data_type": "Int64",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }
  ],
  "metadata": {}
}
arrow-tools/crates/json2arrow on  main [!⇡1] via 🦀 v1.81.0 ❯ echo '{"a": 1, "b": 2}' | jq -c 'to_entries|.[]' | json2arrow -n /dev/stdin
Error: SchemaError("Error inferring schema: Io error: Seeking outside of buffer, please report to https://github.com/domoritz/arrow-tools/issues/new")

domoritz avatar Sep 18 '24 01:09 domoritz