polars
polars copied to clipboard
fix(python): fix delta issues
Aims to fixes the following issues,
- https://github.com/pola-rs/polars/issues/5790
- https://github.com/pola-rs/polars/issues/5785
Checklist
- [x] Fix imports issue
- [x] Resolve paths before passing to delta for relative path support
- [x]
read_deltashould rely on delta side implementation - [x]
scan_deltashould work as expected without relying on delta implementation. - [x] create internal test matrix of [local, s3, azure, gcs] X [read, scan] X [absolute, relative paths, absolute with external fs, relative with external fs]
- [x] additional cases for relative paths
- [x] perform tests in a separate env
- [x] added example to read and scan from azure
- [x] added example to read and scan from gcs
Notes
- Till the time https://github.com/delta-io/delta-rs/issues/1015 is not resolved, I have added a simple delta specific path resolution in
scan_delta. Later we can remove this in favor of the implementation provided in delta-rs itself. read_deltanow relies on the implementation provided in delta-rs itself.- For GCS, Azure and S3 there is no relative path support, full URI must be provided.
Testing
- For LocalFS, GCS, Azure and S3, I've created an internal test matrix as described in the check list which contains over 35 additional tests other than the unit tests. I'll check how to mock different
pyarrow.fssomehow and then add these to the unit tests later. - This will matrix was also executed in a separate venv, after locally installing polars with fixes from a local full build (
maturin build) as the import errors were not caught in the unit tests before.
Signed-off-by: chitralverma [email protected]
CC @dominikpeter @stinodego @ritchie46
I don't have any delta access. Could you give it a try @stinodego?
I'll check the implementation when I get home tonight :) Thanks for your work, @chitralverma!
I'll check the implementation when I get home tonight :) Thanks for your work, @chitralverma!
Cool, thanks!
In regards to Delta Lake write support, will that come later? Is there a plan for that too?
In regards to Delta Lake write support, will that come later? Is there a plan for that too?
We could. Delta-rs accepts pyarrow recordbatches so we can feed it the arrow table.
@ritchie46: I'm interested in Rust approach. No Python for my use case. Will this come later? Any ticket or proposal for it?
@ritchie46: I'm interested in Rust approach. No Python for my use case. Will this come later? Any ticket or proposal for it?
@andrei-ionescu there's https://github.com/pola-rs/polars/issues/2858, but its not specific to python or rust, but the functionality of read and write itself.
The read/ scan part was easy to add on the python side without bloating rust binaries, see here
Having said that I believe, the write will have to be written on the rust side and from there imported in to python side for max performance.
@chitralverma Why this nice functionality of reading Delta and scan is added on the top layer (Python) and not in the lower layer of Rust with Python bindings?
@chitralverma Why this nice functionality of reading Delta and scan is added on the top layer (Python) and not in the lower layer of Rust with Python bindings?
Because it is not needed to bloat our python binary with a deltareader.
You can open an issue request for a deltareader on the rust side that is behind a feature flag.
@stinodego I am done with the changes, you can now review this as per your convenience.
I can confirm this fully resolves all my issues! Great work.
I can confirm this fully resolves all my issues! Great work.
did you try the scan_delta for lazy reads as well? with the recent predicate push downs, I believe this should be quite nice.
I can confirm this fully resolves all my issues! Great work.
did you try the scan_delta for lazy reads as well? with the recent predicate push downs, I believe this should be quite nice.
Actually it seems like scan_delta doesn't work yet!
Both abs path and relative path give the same error:
exceptions.ComputeError: PyErr { type: <class 'FileNotFoundError'>, value: FileNotFoundError(2, "Failed to open local file '/home/stijn/code/polars-delta/polars/py-polars/data/test_delta/home/stijn/code/polars-delta/polars/py-polars/data/test_delta/1-cda303ee-8a77-4d0b-94c1-e361b3ea75d2-0.parquet'. Detail: [errno 2] No such file or directory"), traceback: Some(<traceback object at 0x7fb5e00d0fc0>) }
Looks like it's misconstructing the path! It's in there twice.
And when using an Azure path, I get:
pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: az://bakplan/test_delta
I can confirm this fully resolves all my issues! Great work.
did you try the scan_delta for lazy reads as well? with the recent predicate push downs, I believe this should be quite nice.
Actually it seems like
scan_deltadoesn't work yet!Both abs path and relative path give the same error:
exceptions.ComputeError: PyErr { type: <class 'FileNotFoundError'>, value: FileNotFoundError(2, "Failed to open local file '/home/stijn/code/polars-delta/polars/py-polars/data/test_delta/home/stijn/code/polars-delta/polars/py-polars/data/test_delta/1-cda303ee-8a77-4d0b-94c1-e361b3ea75d2-0.parquet'. Detail: [errno 2] No such file or directory"), traceback: Some(<traceback object at 0x7fb5e00d0fc0>) }Looks like it's misconstructing the path! It's in there twice.
And when using an Azure path, I get:
pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: az://bakplan/test_delta
Did you follow documentation with the example for azure scan_delta?
can you share the code you are using?
Did you follow documentation with the example for azure scan_delta?
can you share the code you are using?
For the normal absolute/relative paths, my code is simply:
import polars as pl
abs_path = "/home/stijn/code/polars-delta/polars/py-polars/data/test_delta"
df = pl.scan_delta(abs_path).collect()
print(df)
rel_path = "data/test_delta"
df = pl.scan_delta(rel_path).collect()
print(df)
I cloned your repo, which resides in /home/stijn/code/polars-delta/polars, then I created this script in the py-polars directory, and a small test deltatable in py-polars/data/test_delta. There appears to be a simple bug in the parsing of the URI somewhere, as you can see the absolute path is in there twice (concatenated for some reason).
Then for Azure, I actually wasn't following the docs for Azure 🙈 but when I do, I get a new error. Will update in a moment, I have a few things to figure out.
For the normal absolute/relative paths, my code is simply:
import polars as pl abs_path = "/home/stijn/code/polars-delta/polars/py-polars/data/test_delta" df = pl.scan_delta(abs_path).collect() print(df) rel_path = "data/test_delta" df = pl.scan_delta(rel_path).collect() print(df)I cloned your repo, which resides in
/home/stijn/code/polars-delta/polars, then I created this script in thepy-polarsdirectory, and a small test deltatable inpy-polars/data/test_delta. There appears to be a simple bug in the parsing of theURIsomewhere, as you can see the absolute path is in there twice (concatenated for some reason).
@stinodego I followed your exact steps and tried to run the script you mentioned and it worked on my end. even tried it in a separate venv and it works.
Can you check the branch that you are using for this ?
git clone --branch fix-delta-issues --single-branch https://github.com/chitralverma/polars.git
Then for Azure, I actually wasn't following the docs for Azure 🙈 but when I do, I get a new error. Will update in a moment, I have a few things to figure out.
For Azure, I'm referring to these docs. Let me know if you we able to make it work?
Maybe we should just merge this, since it already fixes all read_delta issues. I will take a closer look at the scan_delta when I have time; maybe I am doing something wrong and I don't want to hold up this improvement.
@ritchie46 Are you OK with that?
Maybe we should just merge this, since it already fixes all
read_deltaissues. I will take a closer look at thescan_deltawhen I have time; maybe I am doing something wrong and I don't want to hold up this improvement.@ritchie46 Are you OK with that?
Would also look forward to the import error being fixed.
Yes, definitely! Thanks for the help on this @chitralverma and @stinodego 🙏
Thanks a lot!
for some reason this PR was missed by the labeller and doesn't have any tags.
for some reason this PR was missed by the labeller and doesn't have any tags.
Weird - I added them manually, so this PR will show up in the changelog 👍