arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-31186: [C++] file: Enable FIFO as an OSFile path.

Open pspoerri opened this issue 1 year ago • 5 comments

Rationale for this change

Allow reading from non-seekable FIFO paths (e.g. stdin).

Example: Currently the following code snippet is not allowed:

from pyarrow import csv, input_stream

stdin = input_stream('/dev/stdin')
data = csv.read_csv(stdin)
print(data)

Running this code will always trigger an OSError:

# cat test.csv | python test2.py
Traceback (most recent call last):
  File "/mnt/nvme0n1/psp/arrow-test/test.py", line 4, in <module>
    stdin = input_stream('/dev/stdin')
  File "pyarrow/io.pxi", line 2690, in pyarrow.lib.input_stream
  File "pyarrow/io.pxi", line 1164, in pyarrow.lib.OSFile.__cinit__
  File "pyarrow/io.pxi", line 1176, in pyarrow.lib.OSFile._open_readable
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: lseek failed

To get around this implementation one has to read stdin through a python file:

import sys
import os

from pyarrow import csv

stdin = os.fdopen(sys.stdin.fileno(), "rb")
data = csv.read_csv(stdin)
print(data)

Example csv:

customer_id,customer
1,customer1
2,customer2

What changes are included in this PR?

Set the size of the OSFile to -1 if the stream is not seekable. This has been used here to configure non-seekable file descriptors: https://github.com/pspoerri/arrow/blob/main/cpp/src/arrow/io/file.cc#L94-L95

Are these changes tested?

I tested the code samples above.

Are there any user-facing changes?

Not that I am aware of.

  • GitHub Issue: #31186

pspoerri avatar May 16 '24 14:05 pspoerri

:warning: GitHub issue #31186 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar May 16 '24 14:05 github-actions[bot]

:warning: GitHub issue #31186 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar May 16 '24 14:05 github-actions[bot]

:warning: GitHub issue #31186 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar May 16 '24 14:05 github-actions[bot]

:warning: GitHub issue #31186 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar May 16 '24 15:05 github-actions[bot]

:warning: GitHub issue #31186 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar May 16 '24 15:05 github-actions[bot]

Note: it's not obvious that we should support this at all. ReadableFile implements the RandomAccessFile interface. This implies you can call Seek but also ReadAt on the resulting file.

pitrou avatar May 23 '24 15:05 pitrou