GH-31186: [C++] file: Enable FIFO as an OSFile path.
Rationale for this change
Allow reading from non-seekable FIFO paths (e.g. stdin).
Example: Currently the following code snippet is not allowed:
from pyarrow import csv, input_stream
stdin = input_stream('/dev/stdin')
data = csv.read_csv(stdin)
print(data)
Running this code will always trigger an OSError:
# cat test.csv | python test2.py
Traceback (most recent call last):
File "/mnt/nvme0n1/psp/arrow-test/test.py", line 4, in <module>
stdin = input_stream('/dev/stdin')
File "pyarrow/io.pxi", line 2690, in pyarrow.lib.input_stream
File "pyarrow/io.pxi", line 1164, in pyarrow.lib.OSFile.__cinit__
File "pyarrow/io.pxi", line 1176, in pyarrow.lib.OSFile._open_readable
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: lseek failed
To get around this implementation one has to read stdin through a python file:
import sys
import os
from pyarrow import csv
stdin = os.fdopen(sys.stdin.fileno(), "rb")
data = csv.read_csv(stdin)
print(data)
Example csv:
customer_id,customer
1,customer1
2,customer2
What changes are included in this PR?
Set the size of the OSFile to -1 if the stream is not seekable. This has been used here to configure non-seekable file descriptors: https://github.com/pspoerri/arrow/blob/main/cpp/src/arrow/io/file.cc#L94-L95
Are these changes tested?
I tested the code samples above.
Are there any user-facing changes?
Not that I am aware of.
- GitHub Issue: #31186
:warning: GitHub issue #31186 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #31186 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #31186 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #31186 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #31186 has been automatically assigned in GitHub to PR creator.
Note: it's not obvious that we should support this at all. ReadableFile implements the RandomAccessFile interface. This implies you can call Seek but also ReadAt on the resulting file.