featuretools icon indicating copy to clipboard operation
featuretools copied to clipboard

Support cuDF dataframes

Open kmax12 opened this issue 4 years ago • 7 comments

Related to supporting Dask and Koalas DataFrames, we should support cuDF DataFrames to enable us to take advantage of GPU acceleration.

kmax12 avatar Apr 15 '20 16:04 kmax12

I'm one of the cuDF maintainers where I'm happy to answer any questions that you have about cuDF.

I'd also add that cuDF integrates with Dask as well, so if you're supporting Dask there's an ability to use cuDF under Dask as well.

kkraus14 avatar Apr 15 '20 17:04 kkraus14

I'm one of the cuDF maintainers where I'm happy to answer any questions that you have about cuDF.

I'd also add that cuDF integrates with Dask as well, so if you're supporting Dask there's an ability to use cuDF under Dask as well.

The current version has Dask support already, could you elaborate how to use cuDF under Dask?

imadcat avatar Jul 25 '20 21:07 imadcat

The current version has Dask support already, could you elaborate how to use cuDF under Dask?

Essentially instead of using Pandas dataframes for the partitions in a dask.dataframe.DataFrame object, it can use cuDF dataframes transparently. We have a small shim package called dask-cudf (https://github.com/rapidsai/cudf/tree/branch-0.15/python/dask_cudf) for handling things like IO where Dask would otherwise need a module dispatch, but otherwise it's just using it exactly the same as how you'd use Dask dataframes.

To go from a Dask dataframe backed by Pandas to a Dask dataframe backed by cudf you could do (intentionally not using dask-cudf):

import dask
import dask.dataframe as dd
import cudf

df = dask.datasets.timeseries()
gdf = df.map_partitions(cudf.from_pandas)

# gdf is now a `dask.dataframe.DataFrame` backed by cuDF DataFrame objects and you can continue to work with it as you normally would a Dask DataFrame object.

kkraus14 avatar Jul 27 '20 01:07 kkraus14

I just wanted to share an update on this.

With the current work in progress minimal changes to support cudf, the loan repayment benchmark shared here without the cutoff-times bit, we are seeing decent performance boosts.

These are tentative numbers obtained on the workflow.

Feature tools(Deep Feature Sythesis) speed up: cudf-dfs= 4.65 s vs pandas=157.49 (33.86 x)
E2E Overall(reading+dfs, writing): 22.2 s vs 3min 35s (9.68x) 

I am still working towards adding cutoff_times and a bunch of code clean-up, completeness, and testing work, so this is very much a work in progress, but I wanted to share an initial update with you guys.

We’re really excited about the opportunity to help contribute to featuretools and hope we can help speed up Featuretools with GPUs.

VibhuJawa avatar Nov 20 '20 17:11 VibhuJawa

@VibhuJawa Thanks for sharing this - very interesting and promising results!

thehomebrewnerd avatar Nov 20 '20 21:11 thehomebrewnerd

We are pausing on this until GitHub actions supports GPU instances

gsheni avatar Dec 09 '22 20:12 gsheni

We are pausing on this until GitHub actions supports GPU instances

One option to use GPU instancess with Github actions is to use self hosted GPU runners. The cuDF CI and other libraries have been migrated to gpu enabled GIthub actions in case that helps.

https://docs.rapids.ai/resources/github-actions/#gpu-label-combinations

VibhuJawa avatar Jan 04 '23 22:01 VibhuJawa

Not planned at this time.

thehomebrewnerd avatar May 10 '24 15:05 thehomebrewnerd