ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: support read_parquet for backend with no native support

Open jitingxu1 opened this issue 1 year ago • 9 comments

Description of changes

Support read_parquet for backends that do not have native support (like duckdb). This implementation leverages the PyArrow read_table function.

If a backend does not have its own version, it will fall back on this pyarrow implementation.

Issues closed

This addresses part of issue #9448. Additional tasks related to this issue will be completed and submitted individually.

jitingxu1 avatar Aug 01 '24 00:08 jitingxu1

@jitingxu1 This PR has a lot of failures. Can you take a look so we can decide how to move forward?

cpcloud avatar Aug 01 '24 17:08 cpcloud

rewrite it to support urls as the input:

  • regular url: https, 'ftp' and so on
  • fsspec compatible url: s3, gcp
  • local files
    • support: single file, directory, glob patterns

@cpcloud for review again, Thanks

jitingxu1 avatar Aug 08 '24 16:08 jitingxu1

increased the test coverage. @cpcloud

jitingxu1 avatar Aug 21 '24 16:08 jitingxu1

HI @cpcloud ,

I got several timeout error in the CI for this PR, is there something we need to fix in another PR

FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_74[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_72[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_94[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_92[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_07[trino]@tpcds - Failed: Timeout >90.0s

is it related to trino setup in the CI?

jitingxu1 avatar Sep 19 '24 17:09 jitingxu1

Hi @cpcloud

In this PR, I have the trino/impala test on read_parquet, it reads about 7300 rows from functional_alltypes.parquet(seems like insertion after 10k rows will have a performance issue), I suspect it impacts the trino database performance, I have the following timeout error in other tests, I suspect it is caused by read large parquet file in test_read_parquet, does this make sense?

Should I skip the Trino and Impala in this test too? Or do you have better way to handle this?

I got several timeout error in the CI for this PR, is there something we need to fix in another PR

FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_74[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_72[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_94[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_92[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_07[trino]@tpcds - Failed: Timeout >90.0s

is it related to trino setup in the CI?

jitingxu1 avatar Sep 20 '24 19:09 jitingxu1

I'm going to try something here to see if I can isolate which test is leaving us in a (sometimes) broken state only on the nix osx runs

gforsyth avatar Sep 23 '24 17:09 gforsyth

Ok, that's one pass for the nix mac osx job. I'm going to cycle it a few times to make sure.

gforsyth avatar Sep 23 '24 18:09 gforsyth

Ok, that's one pass for the nix mac osx job. I'm going to cycle it a few times to make sure.

Ok, skipping the mocked URL test on DuckDB seems to have resolved the nested transaction failures on the nix osx CI job

gforsyth avatar Sep 23 '24 18:09 gforsyth

Ok, that's one pass for the nix mac osx job. I'm going to cycle it a few times to make sure.

Ok, skipping the mocked URL test on DuckDB seems to have resolved the nested transaction failures on the nix osx CI job

Thank you so much.

jitingxu1 avatar Sep 24 '24 18:09 jitingxu1

This PR is stale and had a number of unaddressed issues.

cpcloud avatar Dec 17 '24 13:12 cpcloud