Describe the bug I was comparing the results of my work converted to use cuML over scikit-learn, with respect to the kNN Classification. For cuML when I run a test size of 10% my test accuracy crosses above my training accuracy around k=100 but the same code ran on normal scikit-learn the accuracy curves stay strictly separated with no crossover. Then, when i increase the test size to 20% i get the opposite result with my cuML accuracy curves staying strictly separated and my scikit-learn curves beginning their crossover around k=60. will include a screenshot in the attachments.

Steps/Code to reproduce bug I have provided both sets of code using cuml and scikit-learn

Expected behavior I would expect the accuracy to be relatively the same using cuml and scikit-learn, however I am producing deviations.

Environment details (please complete the following information):

  • Environment location: home pc

  • Linux Distro/Architecture: Pop!_OS 22.04 LTS x86_64

  • GPU Model/Driver: NVIDIA GeForce RTX 4090

  • CPU Model: Ryzen 9 7950x

  • CUDA: when i run nvcc -V inside the rapids environment I get: Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0 when I run nvidia-smi i get: NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3

aethyn@pop-os:~$ neofetch ///////////// aethyn@pop-os ///////////////////// ------------- ///////767//////////////// OS: Pop!_OS 22.04 LTS x86_64 //////7676767676////////////// Kernel: 6.6.10-76060610-generic /////76767//7676767////////////// Uptime: 5 hours, 9 mins /////767676///76767/////////////// Packages: 1982 (dpkg), 25 (flatpak) ///////767676///76767.///7676/////// Shell: bash 5.1.16 /////////767676//76767///767676//////// Resolution: 3840x2160, 3840x2160, 3840x2160 //////////76767676767////76767///////// DE: GNOME 42.5 ///////////76767676//////7676////////// WM: Mutter ////////////,7676,///////767/////////// WM Theme: Pop /////////////*7676///////76//////////// Theme: Pop-dark [GTK2/3] ///////////////7676//////////////////// Icons: Pop [GTK2/3] ///////////////7676///767//////////// Terminal: gnome-terminal //////////////////////'//////////// CPU: AMD Ryzen 9 7950X (32) @ 5.881GHz //////.7676767676767676767,////// GPU: AMD ATI 6c:00.0 Device 164e /////767676767676767676767///// GPU: NVIDIA 01:00.0 NVIDIA Corporation Device 2684 /////////////////////////// Memory: 18801MiB / 63423MiB /////////////////////

  Method of cuDF & cuML install: miniconda3 (rapids-23.12)

packages in environment at /home/aethyn/miniconda3/envs/rapids-23.12:

Name Version Build Channel

@evanhowington Thanks for the issue! You mentioned on Slack that the zip file with your data wasn't uploaded. Can you try that again? There is a 25 MB file size limit for zip files, so you may need to split up the data (you mentioned the size was a few megabytes).

@bdice I updated the original post to include the zip file at the bottom of it under "Additional Context".

I did some digging and it appears scikit-learn uses a numpy random state instance while cuML uses a cupy random state instance by default with an option of using a numpy random state instance.

I have not had a chance to test the numpy random state instance on cuML yet. I'm still trying to figure out to invoke the optional numpy random state instance in cuML. Is it just calling numpy.random.RandomState in the cuML as follows: random_state = numpy.random.RandomState ?

If it is the random_state causing the discrepancy perhaps something like train_test_split(X, y, test_size=0.1, random_state=42, random_state_environment={"cupy", "numpy"}) where one specifies where to pull the random state from. Also, maybe the default could be numpy so that the results would match up with someone running the same code on scikit-learn, with the option to be to choose cupy. I only suggest that because if the desire is for them to produce equivalent results out of the box with cuML offering a speedup, we recognize that scikit-learn cant always call a cupy random state on all devices so the cuML default could be a numpy random state for the sake of reproducible results.

Thanks for the issue @evanhowington, I had written a response and closed my tab before submitting :(.

The issue very likely is not coming from using the random state either from numpy or cupy. Haven't yet tested it myself, but given the difference in the parallel/CUDA code it might just be an inherent difference.

