cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[FEA] Add Cross Validators to cuml

Open tanaymeh opened this issue 2 years ago • 23 comments

Is your feature request related to a problem? Please describe. I would really love to see Cross validators such as KFold, StratifiedKFold, GroupKFold, etc in cuml. It will help make RAPIDS data science pipelines more independent of scikit-learn.

Describe the solution you'd like I would like to add the following in the first iteration (since there are many cross-validators in scikit-learn):

Describe alternatives you've considered I am currently not aware of any alternatives for using the above cross-validators natively in cuml.

tanaymeh avatar Mar 26 '22 15:03 tanaymeh

@heytanay you should be able to use scikit-learn's cross validators directly with a cuML model. Can you try that and let me know if it works?

divyegala avatar Mar 28 '22 17:03 divyegala

@heytanay you should be able to use scikit-learn's cross validators directly with a cuML model. Can you try that and let me know if it works?

Hi, If I pass in a cuDF dataframe to a scikit-learn cross validator, I get an "Implicit conversion to Numpy array" error. Following is the example snippet:

df = cudf.read_csv("train.csv")

X = df.drop(['target'], axis=1)

kfold = StratifiedKFold(n_splits=5)
for train_idx, valid_idx in kfold.split(X=X, y=df['target']):
    print(train_idx, valid_idx)

Error:

TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU array, consider using cupy.asarray(...)
To explicitly construct a host array, consider using .to_array()

To get around this, I believe I will have to convert the cuDF dataframe to a normal pandas dataframe which would be impractical for large tabular files.

tanaymeh avatar Mar 29 '22 04:03 tanaymeh

Thanks for filing this feature request. This is a limitation of the current approach.

Often when the data transfer time is non-trivial the estimator training time is the bulk of the time spent. E.g., time spent training CPU model on 3 GB dataset >> time spent transferring 3 GB dataset, enabling cuML to still provide large speedups. Are you in a scenario in which this isn't the case? Would you be able to share a bit more information?

beckernick avatar Mar 29 '22 13:03 beckernick

@beckernick correct me if I am wrong, but shouldn't we be able to get away by using CuPy arrays, since they have the same mechanisms as NumPy arrays?

divyegala avatar Mar 29 '22 14:03 divyegala

@beckernick I don't have a specific dataset in mind but there have been a few instances on Kaggle as well as off kaggle where I was dealing with really large datasets or when converting the data frame to pandas data frames would not be very practical.

That's the reason I proposed implementing these cross-validators.

tanaymeh avatar Mar 29 '22 14:03 tanaymeh

@divyegala CuPy arrays will fail the internal _validate_data checks in scikit-learn that ultimately call down to np.asarray and run into the equivalent implicit conversion error. There's been some recent discussion about paths forward, though.

@heytanay thanks for the additional context. I agree that using GPU data structures in cross validators is a reasonable feature request. Just wanted to provide some color on the training time vs transfer time impact

beckernick avatar Mar 29 '22 15:03 beckernick

@divyegala , I was thinking of the explicit cross-validator utilities like cross_val_score. KFold and similar functions should work, as you suggested.

from cuml.datasets import make_regression
from sklearn.model_selection import KFold
import cuml

X, y = make_regression()

clf = cuml.neighbors.KNeighborsRegressor()
kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
 98 99] TEST: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49] TEST: [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
 98 99]

beckernick avatar Mar 29 '22 19:03 beckernick

Hi @beckernick, @divyegala Is this something that can be worked upon then? I would love to do the implementation of cross-validators if I have the green light.

tanaymeh avatar Apr 02 '22 08:04 tanaymeh

@heytanay we would absolutely welcome a contribution from your side here. Let me know if I can be a resource to you in any way during your PR process, be it with questions about build, code, examples, etc.

divyegala avatar Apr 04 '22 15:04 divyegala

@divyegala Thanks! I will start working on it and open a draft PR post-haste. I wanted to clear out a doubt: I don't suppose I'll have to write CUDA kernels here since in hindsight this looks like something that we can do using only Python.

Would love to get your views on it.

tanaymeh avatar Apr 05 '22 03:04 tanaymeh

@heytanay yep, I don't foresee any need of CUDA here. You should be able to leverage features from cuML or our dependencies to directly build this feature out in Python.

divyegala avatar Apr 05 '22 05:04 divyegala

@divyegala I've opened a draft PR here, currently, I have just copied all the necessary functions as-it-is from the scikit-learn source code to cuml, I will be adapting them to cuml and adding tests as we go.

tanaymeh avatar Apr 05 '22 07:04 tanaymeh

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar May 05 '22 08:05 github-actions[bot]

Still working on this.

tanaymeh avatar May 06 '22 03:05 tanaymeh

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Jun 05 '22 05:06 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Sep 03 '22 05:09 github-actions[bot]

Still on this

tanaymeh avatar Sep 03 '22 14:09 tanaymeh

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Oct 03 '22 15:10 github-actions[bot]

Hi everyone! Really sorry for closing this issue but I have been caught up in job and research work and won't be able to complete the implementation by myself. I have left the PR open (if anyone wants to pick up where I left off and complete the implementation).

tanaymeh avatar Nov 10 '22 12:11 tanaymeh

No problem all! I'm going to reopen the issue, as this is still a valid feature request

beckernick avatar Nov 10 '22 16:11 beckernick

How is it going? any progress on this? @beckernick

AnVuTrong avatar Mar 05 '24 18:03 AnVuTrong

Some related work is happening in https://github.com/rapidsai/cuml/pull/5743 .

trivialfis avatar Mar 06 '24 08:03 trivialfis