arrow ARROW-13763: [Python] Close files in ParquetFile & ParquetDatasetPiece

Will fix ARROW-13763

A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer.

Adds context manager to pq.ParquetFile to close input file, and ensure reads within pq.ParquetDataset and pq.read_table are closed.


# user opened file-like object will not be closed
with open('file.parquet', 'rb') as f:
    with pq.ParquetFile(f) as p:
        table = p.read()
        assert not f.closed  # did not inadvertently close the open file
        assert not p.closed
    assert not f.closed      # parquet context exit didn't close it
    assert not p.closed      # references the input file status
assert f.closed              # normal context exit close
assert p.closed              # ...

# path-like will be closed upon exit or `ParquetFile.close`
with pq.ParquetFile('file.parquet') as p:
    table = p.read()
    assert not p.closed
assert p.closed

Aug 09 '22 11:08 milesgranger

https://issues.apache.org/jira/browse/ARROW-13763

Aug 09 '22 11:08 github-actions[bot]

But in the case the user passes an file object, we should maybe not close it? (while such a PythonFile object will pass through a close() call to the underlying file object)

So this would imply calling ParquetFile.close would also not close the passed file? So my second example would be this?

p = pq.ParquetFile(buf)
table = p.read()
assert buf.closed is False

p.close()  # In the situation of a passed in open file object, this call would mean nothing, right?
assert buf.closed is False

Aug 10 '22 04:08 milesgranger

A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer.

Has that issue been opened already?

Aug 17 '22 11:08 pitrou

It seems ARROW-16421 might cross over enough with the intended issue. See recent comments there on whether explicit closing ought to be done in C++.

Aug 17 '22 11:08 milesgranger

The windows failure is happening on master and other PRs as well.

Aug 17 '22 11:08 jorisvandenbossche

Benchmark runs are scheduled for baseline = b0422e5feb6d335f8e726f6221d42f2a044cd02c and contender = 951663a41c183c8fec5a4da9a8f9daf45ed85451. 951663a41c183c8fec5a4da9a8f9daf45ed85451 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:3.95% :arrow_up:0.85%] test-mac-arm [Failed :arrow_down:3.01% :arrow_up:0.27%] ursa-i9-9960x [Finished :arrow_down:1.57% :arrow_up:0.92%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 951663a4 ec2-t3-xlarge-us-east-2 [Finished] 951663a4 test-mac-arm [Failed] 951663a4 ursa-i9-9960x [Finished] 951663a4 ursa-thinkcentre-m75q [Finished] b0422e5f ec2-t3-xlarge-us-east-2 [Failed] b0422e5f test-mac-arm [Failed] b0422e5f ursa-i9-9960x [Finished] b0422e5f ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Aug 17 '22 20:08 ursabot

['Python', 'R'] benchmarks have high level of regressions. test-mac-arm ursa-i9-9960x

Aug 17 '22 20:08 ursabot

arrow arrow copied to clipboard

ARROW-13763: [Python] Close files in ParquetFile & ParquetDatasetPiece

arrow
arrow copied to clipboard