Signed-off-by: Max Zwiessele [email protected]

What this PR does / why we need it: When loading data in parquet dataset format it can happen that the dataset only has one partition in the event_timestamp column. If that is the case, dask will fail to process the dataset, erroring with a ZeroDividionError similar to

https://github.com/feast-dev/feast/blob/769c31869eb8d9bb693f8a2876cc68b8cdd16521/sdk/python/feast/infra/offline_stores/file.py#L327

This PR adds a try-catch block to gracefully circumvent the error and process the data in only one partition.

Which issue(s) this PR fixes:

N/A

Sep 20 '22 16:09 mzwiessele

/assign @woop

Sep 20 '22 16:09 mzwiessele

Could you add a test for this case?

Yes, I need help though. We need a test folder with a parquet dataset in the test S3 bucket. That parquet dataset must have only one partition in the event_timestamp column. The reason is simply that this code correction fixes exactly that edge case.

Related to #3235

Sep 21 '22 08:09 mzwiessele

Before merging this: Is it possible to update the Dask version feast relies on? Or in other words, why is the version restricted like this?

"dask>=2021.*,<2022.02.0"

Sep 21 '22 13:09 mzwiessele

@achals I'll need help with the tests. Happy to do the ground work. Please point me to the right testing suite (in the unit tests) for loading a local parquet file from a FileSource.

Sep 23 '22 09:09 mzwiessele

Codecov Report

Base: 67.50% // Head: 58.12% // Decreases project coverage by -9.38% :warning:

Coverage data is based on head (393bf5c) compared to base (b48d36b). Patch coverage: 45.45% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3236      +/-   ##
==========================================
- Coverage   67.50%   58.12%   -9.39%     
==========================================
  Files         179      213      +34     
  Lines       16371    17832    +1461     
==========================================
- Hits        11051    10364     -687     
- Misses       5320     7468    +2148

Flag	Coverage Δ
integrationtests	`?`
unittests	`58.12% <45.45%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdk/python/tests/utils/e2e_test_validation.py	`55.44% <ø> (-33.67%)`	:arrow_down:
...ation/feature_repos/universal/data_sources/file.py	`45.63% <40.00%> (-24.27%)`	:arrow_down:
sdk/python/feast/infra/offline_stores/file.py	`72.90% <50.00%> (-22.57%)`	:arrow_down:
...sts/integration/registration/test_universal_cli.py	`20.20% <0.00%> (-79.80%)`	:arrow_down:
...ts/integration/offline_store/test_offline_write.py	`26.08% <0.00%> (-73.92%)`	:arrow_down:
...fline_store/test_universal_historical_retrieval.py	`28.75% <0.00%> (-71.25%)`	:arrow_down:
...ests/integration/e2e/test_python_feature_server.py	`29.50% <0.00%> (-70.50%)`	:arrow_down:
...dk/python/tests/integration/e2e/test_validation.py	`27.55% <0.00%> (-69.30%)`	:arrow_down:
...s/integration/registration/test_universal_types.py	`32.25% <0.00%> (-67.75%)`	:arrow_down:
sdk/python/feast/infra/online_stores/redis.py	`28.39% <0.00%> (-66.67%)`	:arrow_down:
... and 169 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Sep 23 '22 16:09 codecov-commenter

@achals I've added the test, please take a look and tell me if this does the trick!

Sep 23 '22 16:09 mzwiessele

hey @mzwiessele thanks for the PR! a few responses to your comments above:

Yes, I need help though. We need a test folder with a parquet dataset in the test S3 bucket. That parquet dataset must have only one partition in the event_timestamp column. The reason is simply that this code correction fixes exactly that edge case.

unless I'm misunderstanding, there's no need for the parquet dataset to be in an S3 bucket, right? the only thing that matters is that the Parquet dataset doesn't have npartitions set

if that's correct, I think the best way to write a test would be to do it locally - you can check out our unit tests in sdk/python/feast/infra/tests/unit, all of which run locally. I would try to imitate test_e2e_local in test_e2e_local.py - it's a good example of setting up a dataset locally, constructing a feature repo + FeatureStore object, and then running some tests, in your case I think you would just want to check that get_historical_features works on a Parquet dataset with npartitions unset

Before merging this: Is it possible to update the Dask version feast relies on? Or in other words, why is the version restricted like this?

I forget exactly why we restricted that version; I'll go back and check, and if there's no strong reason I'm happy to bump up the upper bound restriction (although I think that can happen in a follow up PR)

Sep 24 '22 01:09 felixwang9817

@mzwiessele also left some additional comments for you on #3235!

Sep 24 '22 01:09 felixwang9817

@felixwang9817 I've added the dataset source to the test suite as suggested at #3235 to test loading of parquet datasets.

Do you know if any of the created datasets feature only one partition in the event_timestamp column?

Oct 05 '22 09:10 mzwiessele

@felixwang9817 @achals kind ping :)

Oct 11 '22 16:10 mzwiessele

@felixwang9817 @achals kind ping :)

The DCO check still seems to be failing! Good to merge as soon as that's fixed. https://github.com/feast-dev/feast/pull/3236/checks?check_run_id=8744237672

Oct 11 '22 16:10 achals

@achals fixed :)

Oct 12 '22 15:10 mzwiessele

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: achals, mzwiessele

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [achals]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 12 '22 16:10 feast-ci-bot

fix: Dask zero division error if parquet dataset has only one partition

Codecov Report