polars icon indicating copy to clipboard operation
polars copied to clipboard

fix(python): fix delta issues

Open chitralverma opened this issue 2 years ago • 16 comments
trafficstars

Aims to fixes the following issues,

  • https://github.com/pola-rs/polars/issues/5790
  • https://github.com/pola-rs/polars/issues/5785

Checklist

  • [x] Fix imports issue
  • [x] Resolve paths before passing to delta for relative path support
  • [x] read_delta should rely on delta side implementation
  • [x] scan_delta should work as expected without relying on delta implementation.
  • [x] create internal test matrix of [local, s3, azure, gcs] X [read, scan] X [absolute, relative paths, absolute with external fs, relative with external fs]
  • [x] additional cases for relative paths
  • [x] perform tests in a separate env
  • [x] added example to read and scan from azure
  • [x] added example to read and scan from gcs

Notes

  1. Till the time https://github.com/delta-io/delta-rs/issues/1015 is not resolved, I have added a simple delta specific path resolution in scan_delta. Later we can remove this in favor of the implementation provided in delta-rs itself.
  2. read_delta now relies on the implementation provided in delta-rs itself.
  3. For GCS, Azure and S3 there is no relative path support, full URI must be provided.

Testing

  1. For LocalFS, GCS, Azure and S3, I've created an internal test matrix as described in the check list which contains over 35 additional tests other than the unit tests. I'll check how to mock different pyarrow.fs somehow and then add these to the unit tests later.
  2. This will matrix was also executed in a separate venv, after locally installing polars with fixes from a local full build (maturin build) as the import errors were not caught in the unit tests before.

Signed-off-by: chitralverma [email protected]

chitralverma avatar Dec 13 '22 17:12 chitralverma

CC @dominikpeter @stinodego @ritchie46

chitralverma avatar Dec 13 '22 21:12 chitralverma

I don't have any delta access. Could you give it a try @stinodego?

ritchie46 avatar Dec 14 '22 07:12 ritchie46

I'll check the implementation when I get home tonight :) Thanks for your work, @chitralverma!

stinodego avatar Dec 14 '22 11:12 stinodego

I'll check the implementation when I get home tonight :) Thanks for your work, @chitralverma!

Cool, thanks!

chitralverma avatar Dec 14 '22 13:12 chitralverma

In regards to Delta Lake write support, will that come later? Is there a plan for that too?

andrei-ionescu avatar Dec 14 '22 15:12 andrei-ionescu

In regards to Delta Lake write support, will that come later? Is there a plan for that too?

We could. Delta-rs accepts pyarrow recordbatches so we can feed it the arrow table.

ritchie46 avatar Dec 14 '22 15:12 ritchie46

@ritchie46: I'm interested in Rust approach. No Python for my use case. Will this come later? Any ticket or proposal for it?

andrei-ionescu avatar Dec 14 '22 15:12 andrei-ionescu

@ritchie46: I'm interested in Rust approach. No Python for my use case. Will this come later? Any ticket or proposal for it?

@andrei-ionescu there's https://github.com/pola-rs/polars/issues/2858, but its not specific to python or rust, but the functionality of read and write itself.

The read/ scan part was easy to add on the python side without bloating rust binaries, see here

Having said that I believe, the write will have to be written on the rust side and from there imported in to python side for max performance.

chitralverma avatar Dec 14 '22 16:12 chitralverma

@chitralverma Why this nice functionality of reading Delta and scan is added on the top layer (Python) and not in the lower layer of Rust with Python bindings?

andrei-ionescu avatar Dec 14 '22 16:12 andrei-ionescu

@chitralverma Why this nice functionality of reading Delta and scan is added on the top layer (Python) and not in the lower layer of Rust with Python bindings?

Because it is not needed to bloat our python binary with a deltareader.

You can open an issue request for a deltareader on the rust side that is behind a feature flag.

ritchie46 avatar Dec 14 '22 16:12 ritchie46

@stinodego I am done with the changes, you can now review this as per your convenience.

chitralverma avatar Dec 14 '22 17:12 chitralverma

I can confirm this fully resolves all my issues! Great work.

stinodego avatar Dec 14 '22 21:12 stinodego

I can confirm this fully resolves all my issues! Great work.

did you try the scan_delta for lazy reads as well? with the recent predicate push downs, I believe this should be quite nice.

chitralverma avatar Dec 15 '22 02:12 chitralverma

I can confirm this fully resolves all my issues! Great work.

did you try the scan_delta for lazy reads as well? with the recent predicate push downs, I believe this should be quite nice.

Actually it seems like scan_delta doesn't work yet!

Both abs path and relative path give the same error:

exceptions.ComputeError: PyErr { type: <class 'FileNotFoundError'>, value: FileNotFoundError(2, "Failed to open local file '/home/stijn/code/polars-delta/polars/py-polars/data/test_delta/home/stijn/code/polars-delta/polars/py-polars/data/test_delta/1-cda303ee-8a77-4d0b-94c1-e361b3ea75d2-0.parquet'. Detail: [errno 2] No such file or directory"), traceback: Some(<traceback object at 0x7fb5e00d0fc0>) }

Looks like it's misconstructing the path! It's in there twice.

And when using an Azure path, I get:

pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: az://bakplan/test_delta

stinodego avatar Dec 15 '22 08:12 stinodego

I can confirm this fully resolves all my issues! Great work.

did you try the scan_delta for lazy reads as well? with the recent predicate push downs, I believe this should be quite nice.

Actually it seems like scan_delta doesn't work yet!

Both abs path and relative path give the same error:

exceptions.ComputeError: PyErr { type: <class 'FileNotFoundError'>, value: FileNotFoundError(2, "Failed to open local file '/home/stijn/code/polars-delta/polars/py-polars/data/test_delta/home/stijn/code/polars-delta/polars/py-polars/data/test_delta/1-cda303ee-8a77-4d0b-94c1-e361b3ea75d2-0.parquet'. Detail: [errno 2] No such file or directory"), traceback: Some(<traceback object at 0x7fb5e00d0fc0>) }

Looks like it's misconstructing the path! It's in there twice.

And when using an Azure path, I get:

pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: az://bakplan/test_delta

Did you follow documentation with the example for azure scan_delta?

can you share the code you are using?

chitralverma avatar Dec 15 '22 09:12 chitralverma

Did you follow documentation with the example for azure scan_delta?

can you share the code you are using?

For the normal absolute/relative paths, my code is simply:

import polars as pl

abs_path = "/home/stijn/code/polars-delta/polars/py-polars/data/test_delta"
df = pl.scan_delta(abs_path).collect()
print(df)

rel_path = "data/test_delta"
df = pl.scan_delta(rel_path).collect()
print(df)

I cloned your repo, which resides in /home/stijn/code/polars-delta/polars, then I created this script in the py-polars directory, and a small test deltatable in py-polars/data/test_delta. There appears to be a simple bug in the parsing of the URI somewhere, as you can see the absolute path is in there twice (concatenated for some reason).

Then for Azure, I actually wasn't following the docs for Azure 🙈 but when I do, I get a new error. Will update in a moment, I have a few things to figure out.

stinodego avatar Dec 15 '22 09:12 stinodego

For the normal absolute/relative paths, my code is simply:

import polars as pl

abs_path = "/home/stijn/code/polars-delta/polars/py-polars/data/test_delta"
df = pl.scan_delta(abs_path).collect()
print(df)

rel_path = "data/test_delta"
df = pl.scan_delta(rel_path).collect()
print(df)

I cloned your repo, which resides in /home/stijn/code/polars-delta/polars, then I created this script in the py-polars directory, and a small test deltatable in py-polars/data/test_delta. There appears to be a simple bug in the parsing of the URI somewhere, as you can see the absolute path is in there twice (concatenated for some reason).

@stinodego I followed your exact steps and tried to run the script you mentioned and it worked on my end. even tried it in a separate venv and it works.

image

Can you check the branch that you are using for this ?

git clone  --branch fix-delta-issues --single-branch https://github.com/chitralverma/polars.git

Then for Azure, I actually wasn't following the docs for Azure 🙈 but when I do, I get a new error. Will update in a moment, I have a few things to figure out.

For Azure, I'm referring to these docs. Let me know if you we able to make it work?

chitralverma avatar Dec 17 '22 06:12 chitralverma

Maybe we should just merge this, since it already fixes all read_delta issues. I will take a closer look at the scan_delta when I have time; maybe I am doing something wrong and I don't want to hold up this improvement.

@ritchie46 Are you OK with that?

stinodego avatar Dec 17 '22 11:12 stinodego

Maybe we should just merge this, since it already fixes all read_delta issues. I will take a closer look at the scan_delta when I have time; maybe I am doing something wrong and I don't want to hold up this improvement.

@ritchie46 Are you OK with that?

Would also look forward to the import error being fixed.

dominikpeter avatar Dec 17 '22 11:12 dominikpeter

Yes, definitely! Thanks for the help on this @chitralverma and @stinodego 🙏

ritchie46 avatar Dec 17 '22 11:12 ritchie46

Thanks a lot!

chitralverma avatar Dec 17 '22 13:12 chitralverma

for some reason this PR was missed by the labeller and doesn't have any tags.

chitralverma avatar Dec 17 '22 13:12 chitralverma

for some reason this PR was missed by the labeller and doesn't have any tags.

Weird - I added them manually, so this PR will show up in the changelog 👍

stinodego avatar Dec 18 '22 07:12 stinodego