beam icon indicating copy to clipboard operation
beam copied to clipboard

WIP: Dataframe API ML preprocessing notebook

Open PhilippeMoussalli opened this issue 2 years ago • 12 comments

  • PR that implements a notebook to demonstrate the usage of the beam dataframe API as a preprocessing tool for ML training

WIP:

  • [ ] Find a method to implement the one-hot-encoding for encoding categorical variables: related to ticket #22268
  • [x] Fix bug that returns ValueError: No producer for ref_PCollection_PCollection_265 when attempting to merge two deferred datasets : related to ticket #22267
  • [ ] Have only one installation script for Beam with the latest implemented functions in the Dataframe API instead of installing from source

PhilippeMoussalli avatar Aug 04 '22 13:08 PhilippeMoussalli

@rezarokni @TheNeuralBit

PhilippeMoussalli avatar Aug 04 '22 13:08 PhilippeMoussalli

I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430

CC: @KevinGG

TheNeuralBit avatar Aug 04 '22 16:08 TheNeuralBit

I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430

CC: @KevinGG

Commented in https://github.com/apache/beam/issues/21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219

kevingg avatar Aug 05 '22 17:08 kevingg

Have only one installation script for Beam with the latest implemented functions in the Dataframe API instead of installing from source

This is just blocked on the 2.41.0 release, right?

TheNeuralBit avatar Aug 09 '22 23:08 TheNeuralBit

I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430 CC: @KevinGG

Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219

Note if we have to do that to unblock this change, it will be blocked until 2.42.0 is out.

TheNeuralBit avatar Aug 09 '22 23:08 TheNeuralBit

Have only one installation script for Beam with the latest implemented functions in the Dataframe API instead of installing from source

This is just blocked on the 2.41.0 release, right?

I am aware of the particular release date of 2.41.0. I suppose it will depend if we manage to resolve all the friction points before that release.

PhilippeMoussalli avatar Aug 17 '22 16:08 PhilippeMoussalli

I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430 CC: @KevinGG

Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219

Note if we have to do that to unblock this change, it will be blocked until 2.42.0 is out.

Would it be easier to execute the work-around with ‘loc.setitem’? https://github.com/apache/beam/issues/22267

PhilippeMoussalli avatar Aug 17 '22 16:08 PhilippeMoussalli

I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430 CC: @KevinGG

Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219

Note if we have to do that to unblock this change, it will be blocked until 2.42.0 is out.

Would it be easier to execute the work-around with ‘loc.setitem’? #22267

The work-around is applied to a specific typed composite transform. So the difficulty is the same.

kevingg avatar Aug 17 '22 16:08 kevingg

I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430 CC: @KevinGG

Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219

Is there any update on this or a potential workaround for merging Deferred dataframes?

PhilippeMoussalli avatar Sep 07 '22 10:09 PhilippeMoussalli

I think the "No producer" error is a bug with PCollection pruning in interactive beam: #21430 CC: @KevinGG

Commented in #21430, we can disable pruning for dataframe like what we did for TestStream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_fragment.py#L219

Is there any update on this or a potential workaround for merging Deferred dataframes?

Just sent out https://github.com/apache/beam/pull/23069, this should mitigate the unintended pruning issues.

kevingg avatar Sep 07 '22 19:09 kevingg

@PhilippeMoussalli Could you please take a look at https://github.com/apache/beam/pull/23069?

kevingg avatar Sep 11 '22 17:09 kevingg

@PhilippeMoussalli Could you please take a look at #23069?

@KevinGG I just tested it out and it checks out! Thanks again for taking this up.

PhilippeMoussalli avatar Sep 14 '22 09:09 PhilippeMoussalli

Implemented latest feedback @TheNeuralBit @davidcavazos :)

PhilippeMoussalli avatar Nov 03 '22 12:11 PhilippeMoussalli

Run Website PreCommit

damccorm avatar Nov 03 '22 14:11 damccorm