dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Optionally defer Dask compute in dataprep.clean

Open amanderson opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe. I'm happy to see that cleaning methods are implemented with Dask. I've noticed that most, if not all, cleaning methods include

    with ProgressBar(minimum=1, disable=not progress):
        df, stats = dask.compute(df, stats)

before returning a result. In practice, however, it'd be quite common to use multiple cleaning methods in tandem with further manipulation of the data downstream. The current implementation prevents Dask optimisation over the whole transformation pipeline, as Dask can't make optimisation across computes.

Describe the solution you'd like When passing a Dask dataframe to a clean method, it would be nice to optionally defer compute and return back a Dask dataframe. I understand this would option would disable the progress bar and report, but these features are really only useful in an interactive notebook session.

Describe alternatives you've considered I've not come across alternatives.

amanderson avatar Feb 18 '22 10:02 amanderson

Hi,@amanderson. Thanks for your advise! Actually we are currently considering to optimize the progress bar and remove the report part. Your advise will be very useful to us. I agree with your idea that " it'd be quite common to use multiple cleaning methods in tandem with further manipulation of the data downstream". It is also valuable for us to consider the parallel thing of running multiple clean functions in the same time.

qidanrui avatar Feb 18 '22 20:02 qidanrui