beam
beam copied to clipboard
WIP: Dataframe API ML preprocessing notebook
- PR that implements a notebook to demonstrate the usage of the beam dataframe API as a preprocessing tool for ML training
WIP:
- [ ] Find a method to implement the one-hot-encoding for encoding categorical variables: related to ticket #22268
- [x] Fix bug that returns
ValueError: No producer for ref_PCollection_PCollection_265
when attempting to merge two deferred datasets : related to ticket #22267 - [ ] Have only one installation script for Beam with the latest implemented functions in the Dataframe API instead of installing from source
@rezarokni @TheNeuralBit
I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430
CC: @KevinGG
I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430
CC: @KevinGG
Commented in https://github.com/apache/beam/issues/21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
Have only one installation script for Beam with the latest implemented functions in the Dataframe API instead of installing from source
This is just blocked on the 2.41.0 release, right?
I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430 CC: @KevinGG
Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
Note if we have to do that to unblock this change, it will be blocked until 2.42.0 is out.
Have only one installation script for Beam with the latest implemented functions in the Dataframe API instead of installing from source
This is just blocked on the 2.41.0 release, right?
I am aware of the particular release date of 2.41.0. I suppose it will depend if we manage to resolve all the friction points before that release.
I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430 CC: @KevinGG
Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
Note if we have to do that to unblock this change, it will be blocked until 2.42.0 is out.
Would it be easier to execute the work-around with ‘loc.setitem’? https://github.com/apache/beam/issues/22267
I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430 CC: @KevinGG
Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
Note if we have to do that to unblock this change, it will be blocked until 2.42.0 is out.
Would it be easier to execute the work-around with ‘loc.setitem’? #22267
The work-around is applied to a specific typed composite transform. So the difficulty is the same.
I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430 CC: @KevinGG
Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
Is there any update on this or a potential workaround for merging Deferred dataframes?
I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430 CC: @KevinGG
Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219
Is there any update on this or a potential workaround for merging Deferred dataframes?
Just sent out https://github.com/apache/beam/pull/23069, this should mitigate the unintended pruning issues.
@PhilippeMoussalli Could you please take a look at https://github.com/apache/beam/pull/23069?
@PhilippeMoussalli Could you please take a look at #23069?
@KevinGG I just tested it out and it checks out! Thanks again for taking this up.
Implemented latest feedback @TheNeuralBit @davidcavazos :)
Run Website PreCommit