featuretools
featuretools copied to clipboard
Support cuDF dataframes
Related to supporting Dask and Koalas DataFrames, we should support cuDF DataFrames to enable us to take advantage of GPU acceleration.
I'm one of the cuDF maintainers where I'm happy to answer any questions that you have about cuDF.
I'd also add that cuDF integrates with Dask as well, so if you're supporting Dask there's an ability to use cuDF under Dask as well.
I'm one of the cuDF maintainers where I'm happy to answer any questions that you have about cuDF.
I'd also add that cuDF integrates with Dask as well, so if you're supporting Dask there's an ability to use cuDF under Dask as well.
The current version has Dask support already, could you elaborate how to use cuDF under Dask?
The current version has Dask support already, could you elaborate how to use cuDF under Dask?
Essentially instead of using Pandas dataframes for the partitions in a dask.dataframe.DataFrame
object, it can use cuDF dataframes transparently. We have a small shim package called dask-cudf
(https://github.com/rapidsai/cudf/tree/branch-0.15/python/dask_cudf) for handling things like IO where Dask would otherwise need a module dispatch, but otherwise it's just using it exactly the same as how you'd use Dask dataframes.
To go from a Dask dataframe backed by Pandas to a Dask dataframe backed by cudf you could do (intentionally not using dask-cudf
):
import dask
import dask.dataframe as dd
import cudf
df = dask.datasets.timeseries()
gdf = df.map_partitions(cudf.from_pandas)
# gdf is now a `dask.dataframe.DataFrame` backed by cuDF DataFrame objects and you can continue to work with it as you normally would a Dask DataFrame object.
I just wanted to share an update on this.
With the current work in progress minimal changes to support cudf
, the loan repayment benchmark shared here without the cutoff-times
bit, we are seeing decent performance boosts.
These are tentative numbers obtained on the workflow.
Feature tools(Deep Feature Sythesis) speed up: cudf-dfs= 4.65 s vs pandas=157.49 (33.86 x)
E2E Overall(reading+dfs, writing): 22.2 s vs 3min 35s (9.68x)
I am still working towards adding cutoff_times
and a bunch of code clean-up, completeness, and testing work, so this is very much a work in progress, but I wanted to share an initial update with you guys.
We’re really excited about the opportunity to help contribute to featuretools and hope we can help speed up Featuretools with GPUs.
@VibhuJawa Thanks for sharing this - very interesting and promising results!
We are pausing on this until GitHub actions supports GPU instances
We are pausing on this until GitHub actions supports GPU instances
One option to use GPU instancess with Github actions is to use self hosted GPU runners. The cuDF CI and other libraries have been migrated to gpu
enabled GIthub actions in case that helps.
https://docs.rapids.ai/resources/github-actions/#gpu-label-combinations
Not planned at this time.