feast
feast copied to clipboard
fix: Dask zero division error if parquet dataset has only one partition
Signed-off-by: Max Zwiessele [email protected]
What this PR does / why we need it:
When loading data in parquet dataset format it can happen that the dataset only has one partition in the event_timestamp
column. If that is the case, dask will fail to process the dataset, erroring with a ZeroDividionError
similar to
https://github.com/feast-dev/feast/blob/769c31869eb8d9bb693f8a2876cc68b8cdd16521/sdk/python/feast/infra/offline_stores/file.py#L327
This PR adds a try-catch
block to gracefully circumvent the error and process the data in only one partition.
Which issue(s) this PR fixes:
N/A
/assign @woop
Could you add a test for this case?
Yes, I need help though. We need a test folder with a parquet dataset in the test S3 bucket. That parquet dataset must have only one partition in the event_timestamp
column. The reason is simply that this code correction fixes exactly that edge case.
Related to #3235
Before merging this: Is it possible to update the Dask version feast relies on? Or in other words, why is the version restricted like this?
"dask>=2021.*,<2022.02.0"
@achals I'll need help with the tests. Happy to do the ground work. Please point me to the right testing suite (in the unit tests) for loading a local parquet file from a FileSource.
Codecov Report
Base: 67.50% // Head: 58.12% // Decreases project coverage by -9.38%
:warning:
Coverage data is based on head (
393bf5c
) compared to base (b48d36b
). Patch coverage: 45.45% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## master #3236 +/- ##
==========================================
- Coverage 67.50% 58.12% -9.39%
==========================================
Files 179 213 +34
Lines 16371 17832 +1461
==========================================
- Hits 11051 10364 -687
- Misses 5320 7468 +2148
Flag | Coverage Δ | |
---|---|---|
integrationtests | ? |
|
unittests | 58.12% <45.45%> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
sdk/python/tests/utils/e2e_test_validation.py | 55.44% <ø> (-33.67%) |
:arrow_down: |
...ation/feature_repos/universal/data_sources/file.py | 45.63% <40.00%> (-24.27%) |
:arrow_down: |
sdk/python/feast/infra/offline_stores/file.py | 72.90% <50.00%> (-22.57%) |
:arrow_down: |
...sts/integration/registration/test_universal_cli.py | 20.20% <0.00%> (-79.80%) |
:arrow_down: |
...ts/integration/offline_store/test_offline_write.py | 26.08% <0.00%> (-73.92%) |
:arrow_down: |
...fline_store/test_universal_historical_retrieval.py | 28.75% <0.00%> (-71.25%) |
:arrow_down: |
...ests/integration/e2e/test_python_feature_server.py | 29.50% <0.00%> (-70.50%) |
:arrow_down: |
...dk/python/tests/integration/e2e/test_validation.py | 27.55% <0.00%> (-69.30%) |
:arrow_down: |
...s/integration/registration/test_universal_types.py | 32.25% <0.00%> (-67.75%) |
:arrow_down: |
sdk/python/feast/infra/online_stores/redis.py | 28.39% <0.00%> (-66.67%) |
:arrow_down: |
... and 169 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@achals I've added the test, please take a look and tell me if this does the trick!
hey @mzwiessele thanks for the PR! a few responses to your comments above:
Yes, I need help though. We need a test folder with a parquet dataset in the test S3 bucket. That parquet dataset must have only one partition in the event_timestamp column. The reason is simply that this code correction fixes exactly that edge case.
unless I'm misunderstanding, there's no need for the parquet dataset to be in an S3 bucket, right? the only thing that matters is that the Parquet dataset doesn't have npartitions
set
if that's correct, I think the best way to write a test would be to do it locally - you can check out our unit tests in sdk/python/feast/infra/tests/unit
, all of which run locally. I would try to imitate test_e2e_local
in test_e2e_local.py
- it's a good example of setting up a dataset locally, constructing a feature repo + FeatureStore
object, and then running some tests, in your case I think you would just want to check that get_historical_features
works on a Parquet dataset with npartitions
unset
Before merging this: Is it possible to update the Dask version feast relies on? Or in other words, why is the version restricted like this?
I forget exactly why we restricted that version; I'll go back and check, and if there's no strong reason I'm happy to bump up the upper bound restriction (although I think that can happen in a follow up PR)
@mzwiessele also left some additional comments for you on #3235!
@felixwang9817 I've added the dataset source to the test suite as suggested at #3235 to test loading of parquet datasets.
Do you know if any of the created datasets feature only one partition in the event_timestamp column?
@felixwang9817 @achals kind ping :)
@felixwang9817 @achals kind ping :)
The DCO check still seems to be failing! Good to merge as soon as that's fixed. https://github.com/feast-dev/feast/pull/3236/checks?check_run_id=8744237672
@achals fixed :)
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: achals, mzwiessele
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [achals]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment