dataprep
dataprep copied to clipboard
Optionally defer Dask compute in dataprep.clean
Is your feature request related to a problem? Please describe. I'm happy to see that cleaning methods are implemented with Dask. I've noticed that most, if not all, cleaning methods include
with ProgressBar(minimum=1, disable=not progress):
df, stats = dask.compute(df, stats)
before returning a result. In practice, however, it'd be quite common to use multiple cleaning methods in tandem with further manipulation of the data downstream. The current implementation prevents Dask optimisation over the whole transformation pipeline, as Dask can't make optimisation across computes.
Describe the solution you'd like When passing a Dask dataframe to a clean method, it would be nice to optionally defer compute and return back a Dask dataframe. I understand this would option would disable the progress bar and report, but these features are really only useful in an interactive notebook session.
Describe alternatives you've considered I've not come across alternatives.
Hi,@amanderson. Thanks for your advise! Actually we are currently considering to optimize the progress bar and remove the report part. Your advise will be very useful to us. I agree with your idea that " it'd be quite common to use multiple cleaning methods in tandem with further manipulation of the data downstream". It is also valuable for us to consider the parallel thing of running multiple clean functions in the same time.