delta-sharing icon indicating copy to clipboard operation
delta-sharing copied to clipboard

Add load_as methods for pyarrow dataset and table

Open chitralverma opened this issue 2 years ago • 5 comments

Adds separate implementations for load_as_pyarrow_table and load_as_pyarrow_dataset that allows users to read delta sharing tables as pyarrow table and dataset respectively.

  • [x] Add basic implementation
  • [x] Fix lint
  • [x] Refactor common code
  • [x] Verify performance with and without limit
  • [x] Add tests - converter
  • [x] Add tests - reader
  • [ ] Add tests - delta_sharing
  • [x] Add examples
  • [ ] Fix review comments

closes https://github.com/delta-io/delta-sharing/issues/238

chitralverma avatar Dec 23 '22 08:12 chitralverma

@goodwillpunning @linzhou-db From the build logs I can see that the PYARROW_VERSION has been pinned to 4.x somewhere in the environment variables. This version of pyarrow came out in May, 2021 and since then there have been 6 major version releases.

Seems like there are some API inconsistencies the pinned version 4.x which is causing build failure on GitHub but locally test cases are passing. I also verified with versions 5.x to 10.x and was not able to reproduce the issue. Can you please unpin or upgrade this PYARROW_VERSION.

chitralverma avatar Dec 28 '22 16:12 chitralverma

Thanks @chitralverma , will take a look once back in Jan. cc @zsxwing

linzhou-db avatar Dec 29 '22 23:12 linzhou-db

Also what's your thought on loading cdf in pyarrow? is it something not needed for now?

linzhou-db avatar Jan 11 '23 20:01 linzhou-db

Also what's your thought on loading cdf in pyarrow? is it something not needed for now?

I would prefer to raise a separate PR for the CDF to keep things simple and concise, this is just for the data.

chitralverma avatar Jan 11 '23 21:01 chitralverma

@chitralverma @linzhou-db can we revive this PR?

ion-elgreco avatar Feb 22 '24 14:02 ion-elgreco