cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[FEA] Add TimeSeriesSplit

Open ZeroCool2u opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. I would like to be able to use cuML to work on time series problems, especially ones that require train/test data splits that are time series focused. To do this I need to use the sklearn TimeSeriesSplit object.

Describe the solution you'd like I would like a cuML equivalent of the TimeSeriesSplit class that is available in sklearn that can be used directly as part of a cuML Pipeline object and with the cross_val_score method.

Describe alternatives you've considered I could reimplement this myself from scratch, but this would be error prone and generally risky as poor time series splitting behavior is a common source of data leakage in ML problems.

Additional context I have an SVR model that takes ~8 minutes to train per split. I have a dataset that is ~1 million observations and I need to train across thousands of splits in this dataset meaning my runtime is 8 min * N where N is large. On a system with a 7950X3D with 32 processes and 32 GB 6000 MT/S RAM cross validation using the TimeSeriesSplit ran for more than 5 days. Running the SVR model on the same system using cuML with an RTX 3090 decreased per split training to less than 30 seconds (via WSL2). However, I cannot completely migrate to cuML without the TimeSeriesSplit implementation.

ZeroCool2u avatar Jul 31 '24 14:07 ZeroCool2u

Thanks for the issue! We will look into it, SVMs in general are an area of personal interest so would love to see this here.

dantegd avatar Aug 19 '24 00:08 dantegd