arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Open kshitij12345 opened this issue 2 years ago • 10 comments

Add filesystem support to pq.read_metadata and pq.read_schema.

kshitij12345 avatar Jul 17 '22 08:07 kshitij12345

https://issues.apache.org/jira/browse/ARROW-16719

github-actions[bot] avatar Jul 17 '22 08:07 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Jul 17 '22 08:07 github-actions[bot]

macOS CI Failure looks unrelated.

kshitij12345 avatar Jul 17 '22 20:07 kshitij12345

Thank you for working on this issue @kshitij12345! LGTM +1

@jorisvandenbossche can you please have a look before we merge this PR?

AlenkaF avatar Jul 18 '22 18:07 AlenkaF

Failures look unrelated. Should I retrigger the CI?

kshitij12345 avatar Jul 20 '22 15:07 kshitij12345

I think some of the failures are related:

=================================== FAILURES ===================================
_______________________ test_metadata_schema_filesystem ________________________
tmpdir = local('/tmp/pytest-of-root/pytest-0/test_metadata_schema_filesyste0')
    def test_metadata_schema_filesystem(tmpdir):
        table = pa.table({"a": [1, 2, 3]})
        # URI writing to local file.
        fname = "data.parquet"
        file_path = 'file:///' + os.path.join(str(tmpdir), fname)
        pq.write_table(table, file_path)
        # Get expected `metadata` from path.
        metadata = pq.read_metadata(tmpdir / fname)
        schema = table.schema
        assert pq.read_metadata(file_path).equals(metadata)
>       assert pq.read_metadata(
            fname, filesystem=f'file:///{tmpdir}').equals(metadata)
opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/parquet/test_metadata.py:553: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet/__init__.py:3425: in read_metadata
    file = ParquetFile(where, memory_map=memory_map,
opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet/__init__.py:287: in __init__
    self.reader.open(
pyarrow/_parquet.pyx:1225: in pyarrow._parquet.ParquetReader.open
    ???
pyarrow/io.pxi:1674: in pyarrow.lib.get_reader
    ???
pyarrow/io.pxi:1665: in pyarrow.lib.get_native_file
    ???
pyarrow/io.pxi:943: in pyarrow.lib.OSFile.__cinit__
    ???
pyarrow/io.pxi:953: in pyarrow.lib.OSFile._open_readable
    ???
pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   FileNotFoundError: [Errno 2] Failed to open local file 'data.parquet'. Detail: [errno 2] No such file or directory

AlenkaF avatar Jul 20 '22 17:07 AlenkaF

Thanks for catching that @AlenkaF!

~~And Windows strikes 😓 ! I don't access to a Windows system. I think file:/// is not handled by Windows. Do you have any recommendations or is it ok to skip that particular approach on Windows?~~

Looks to be happening on other platforms as well. My bad, read the CI logs incorrectly.

kshitij12345 avatar Jul 20 '22 17:07 kshitij12345

Gentle ping @jorisvandenbossche @AlenkaF

kshitij12345 avatar Jul 27 '22 12:07 kshitij12345

Gentle ping :)

kshitij12345 avatar Aug 04 '22 12:08 kshitij12345

@kshitij12345 there are some test failures that actually seem related

jorisvandenbossche avatar Aug 09 '22 18:08 jorisvandenbossche

@jorisvandenbossche CI failure looks irrelevant. PTAL :)

kshitij12345 avatar Aug 16 '22 13:08 kshitij12345

I pushed a small additional update to the test (mainly changing to use our internal tempdir fixture instead of tmpdir)

Thanks again for the PR @kshitij12345 !

jorisvandenbossche avatar Aug 17 '22 11:08 jorisvandenbossche

Benchmark runs are scheduled for baseline = f6127fca7ade9665f31493d37929346e651ed0e4 and contender = 42ed37e3fc84465f365531e611f1bf632b599e7b. 42ed37e3fc84465f365531e611f1bf632b599e7b is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Finished :arrow_down:3.44% :arrow_up:2.89%] test-mac-arm [Failed :arrow_down:4.38% :arrow_up:1.92%] ursa-i9-9960x [Finished :arrow_down:5.12% :arrow_up:2.42%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 42ed37e3 ec2-t3-xlarge-us-east-2 [Finished] 42ed37e3 test-mac-arm [Failed] 42ed37e3 ursa-i9-9960x [Finished] 42ed37e3 ursa-thinkcentre-m75q [Finished] f6127fca ec2-t3-xlarge-us-east-2 [Finished] f6127fca test-mac-arm [Failed] f6127fca ursa-i9-9960x [Finished] f6127fca ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Aug 17 '22 16:08 ursabot

['Python', 'R'] benchmarks have high level of regressions. test-mac-arm ursa-i9-9960x

ursabot avatar Aug 17 '22 16:08 ursabot

@jorisvandenbossche Thank you very much :)

kshitij12345 avatar Aug 18 '22 07:08 kshitij12345