feat: support read_parquet for backend with no native support
Description of changes
Support read_parquet for backends that do not have native support (like duckdb). This implementation leverages the PyArrow read_table function.
If a backend does not have its own version, it will fall back on this pyarrow implementation.
Issues closed
This addresses part of issue #9448. Additional tasks related to this issue will be completed and submitted individually.
@jitingxu1 This PR has a lot of failures. Can you take a look so we can decide how to move forward?
rewrite it to support urls as the input:
- regular url:
https, 'ftp' and so on - fsspec compatible url:
s3,gcp - local files
- support: single file, directory, glob patterns
@cpcloud for review again, Thanks
increased the test coverage. @cpcloud
HI @cpcloud ,
I got several timeout error in the CI for this PR, is there something we need to fix in another PR
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_74[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_72[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_94[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_92[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_07[trino]@tpcds - Failed: Timeout >90.0s
is it related to trino setup in the CI?
Hi @cpcloud
In this PR, I have the trino/impala test on read_parquet, it reads about 7300 rows from functional_alltypes.parquet(seems like insertion after 10k rows will have a performance issue), I suspect it impacts the trino database performance, I have the following timeout error in other tests, I suspect it is caused by read large parquet file in test_read_parquet, does this make sense?
Should I skip the Trino and Impala in this test too? Or do you have better way to handle this?
I got several timeout error in the CI for this PR, is there something we need to fix in another PR
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_74[trino]@tpcds - Failed: Timeout >90.0s FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_72[trino]@tpcds - Failed: Timeout >90.0s FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_94[trino]@tpcds - Failed: Timeout >90.0s FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_92[trino]@tpcds - Failed: Timeout >90.0s FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_07[trino]@tpcds - Failed: Timeout >90.0sis it related to trino setup in the CI?
I'm going to try something here to see if I can isolate which test is leaving us in a (sometimes) broken state only on the nix osx runs
Ok, that's one pass for the nix mac osx job. I'm going to cycle it a few times to make sure.
Ok, that's one pass for the nix mac osx job. I'm going to cycle it a few times to make sure.
Ok, skipping the mocked URL test on DuckDB seems to have resolved the nested transaction failures on the nix osx CI job
Ok, that's one pass for the nix mac osx job. I'm going to cycle it a few times to make sure.
Ok, skipping the mocked URL test on DuckDB seems to have resolved the nested transaction failures on the nix osx CI job
Thank you so much.
This PR is stale and had a number of unaddressed issues.