cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] cudf-cuda11 not working in Databricks DBR 13.3 ML LTS on GPU instance

Open jcampabadal-db opened this issue 1 year ago • 4 comments

Describe the bug

cudf-cuda11 is not using GPU while running on a Databricks DBR 13.3 ML LTS with GPU instance.

Steps/Code to reproduce bug

Using DBR 14.3 ML with GPU fails with error:

Internal error message: Spark error: Driver down cause: java.lang.IllegalArgumentException: This RAPIDS Plugin build does not support Spark build 3.5.0-databricks. Supported Spark versions: 3.1.1 {buildver=311}, 3.1.2 {buildver=312}, 3.1.3 {buildver=313}, 3.2.0 {buildver=320}, 3.2.1 {buildver=321}, 3.2.1-cloudera-3.2.7171000 {buildver=321cdh}, 3.2.2 {buildver=322}, 3.2.3 {buildver=323}, 3.2.4 {buildver=324}, 3.3.0 {buildver=330}, 3.3.0-cloudera-3.3.7180 {buildver=330cdh}, 3.3.0-databricks {buildver=330db}, 3.3.1 {buildver=331}, 3.3.2 {buildver=332}, 3.3.2-cloudera-3.3.7190 {buildver=332cdh}, 3.3.2-databricks {buildver=332db}, 3.3.3 {buildver=333}, 3.3.4 {buildver=334}, 3.4.0 {buildver=340}, 3.4.1 {buildver=341}, 3.4.1-databricks {buildver=341db}, 3.4.2 {buildver=342}, 3.5.0 {buildver=350}, 3.5.1 {buildver=351}. Consult the Release documentation at https://nvidia.github.io/spark-rapids/docs/download.html

We are following these guides:

https://docs.rapids.ai/deployment/stable/platforms/databricks/

https://docs.nvidia.com/spark-rapids/user-guide/23.12/getting-started/databricks.html

Expected behavior

For cudf-cuda11 package to utilize GPU to perform pandas operations.

Environment overview (please complete the following information)

  • Environment location: AWS cloud
  • Method of cuDF install: /databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com cuml-cu11==24.6.* cudf-cu11==24.6.*

Here I load cudf and I made sure it shows <module 'pandas' (ModuleAccelerator(fast=cudf, slow=pandas))> when printing pd.

image

How to debug why cuDF shows 0 per-gpu usage but only Per-GPU frame buffer utilization bytes? It seems to be only using the CPU. Please advise it seems cudf-cuda11 supports Cuda 11.2+ which the DBR release contains and the library is loaded just fine.

We are using this NVIDIA notebook for testing rapid cudf pandas accelerator:

https://colab.research.google.com/drive/12tCzP94zFG2BRduACucn5Q_OcX1TUKY3

jcampabadal-db avatar Jun 14 '24 23:06 jcampabadal-db

Can you try using the cudf.pandas.profile magic? https://docs.rapids.ai/api/cudf/stable/cudf_pandas/usage/#understanding-performance-the-cudf-pandas-profiler

I think this should tell you which operations are running on the GPU and which are running on CPU.

lithomas1 avatar Jun 17 '24 16:06 lithomas1

Thank you @lithomas1, will check that

jcampabadal-db avatar Jun 17 '24 18:06 jcampabadal-db

@lithomas1 I had been working with @jcampabadal-db on this, I observed super slow performance on GPU with following output on both Databricks DBR 13.3 ML(CUDA11.7) and Databricks DBR 14.3 ML(CUDA 11.8) on AWS EC2 g5.xlarge [A10G] following same command from https://docs.rapids.ai/api/cudf/stable/cudf_pandas/usage/#understanding-performance-the-cudf-pandas-profiler

but the output is below (noticed took several minutes), how to workaround or resolve such performance issue?

/databricks/python/lib/python3.10/site-packages/cupy/cuda/compiler.py:233: PerformanceWarning: Jitify is performing a one-time only warm-up to populate the persistent cache, this may take a few seconds and will be improved in a future release...
  jitify._init_module()

                                       Total time elapsed: 225.300 seconds                                 
                                       3 GPU function calls in 224.665 seconds                               
                                        1 CPU function calls in 0.012 seconds                                
                                                                                                             
                                                        Stats                                                
                                                                                                             
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Function                ┃ GPU ncalls ┃ GPU cumtime ┃ GPU percall ┃ CPU ncalls ┃ CPU cumtime ┃ CPU percall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ DataFrame               │ 1          │ 0.145       │ 0.145       │ 0          │ 0.000       │ 0.000       │
│ DataFrame.min           │ 1          │ 224.520     │ 224.520     │ 0          │ 0.000       │ 0.000       │
│ DataFrame.groupby       │ 1          │ 0.000       │ 0.000       │ 0          │ 0.000       │ 0.000       │
│ DataFrameGroupBy.filter │ 0          │ 0.000       │ 0.000       │ 1          │ 0.012       │ 0.012       │
└─────────────────────────┴────────────┴─────────────┴─────────────┴────────────┴─────────────┴─────────────┘

Not all pandas operations ran on the GPU. The following functions required CPU fallback:

  • DataFrameGroupBy.filter

ericwong2965 avatar Jun 17 '24 22:06 ericwong2965

Also if I follow https://docs.nvidia.com/spark-rapids/user-guide/23.12/getting-started/databricks.html

sometimes I run into OOM error even loading small dataset:

import cudf
import requests
from io import StringIO

url = "https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode("utf-8")

tips_df = cudf.read_csv(StringIO(content))

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /__w/cudf/cudf/python/cudf/build/cp310-cp310-linux_x86_64/_deps/rmm-src/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory

ericwong2965 avatar Jun 18 '24 00:06 ericwong2965

Are you using cuDF Pandas alongside Spark RAPIDS in a single application/workflow or is this independent of Spark?

Would be curious to know if you experience this error when following only this guide https://docs.rapids.ai/deployment/stable/platforms/databricks/ (or if it's perhaps related to some combination).

beckernick avatar Jul 03 '24 16:07 beckernick

@beckernick thanks for reply on this - actually this ticket was the issues encountered after following this guide pointed above - latest RAPIDS release mandate support of Databricks only till 13.3ML as describedin https://nvidia.github.io/spark-rapids/docs/download.html otherwise Databricks Spark cluster failed to boot up

ericwong2965 avatar Jul 03 '24 18:07 ericwong2965

I just ran through the steps outlined above to try and reproduce the 15 second time and no GPU usage mentioned in https://github.com/rapidsai/cudf/issues/16041#issue-2354273462 and also the >200 second time mentioned in https://github.com/rapidsai/cudf/issues/16041#issuecomment-2174551352.

Setup

I followed the single-node deployment documentation and launched a cluster with an g5.xlarge (A10G) using the 14.3 LTS ML runtime.

I used the following init script outlined in the documentation to install RAPIDS.

#!/bin/bash
set -e

# Install RAPIDS libraries
pip install --extra-index-url=https://pypi.nvidia.com \
    "cudf-cu11" \
    "cuml-cu11" \
    "dask-cudf-cu11" \
    "dask-cuda==24.06"

15 second pd.read_parquet() and no GPU usage

I was able to reproduce the 15 second load time, but not the zero GPU usage.

In a new notebook I copied some of the cells from the Colab example notebook.

%load_ext cudf.pandas
import pandas as pd
!wget https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet

The read_parquet() cell took around 15 seconds for me.

df = pd.read_parquet(
    "nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
)

However I found that this was due to the data being downloaded to the /Workspace network filesystem which is very slow. If I instead downloaded the data to local storage at /local_disk0 the read_parquet() call took around 200ms.

image

I then ran through the other operations from the notebook, none of which took a particularly long amount of time and the profiler shows it successfully used the GPU.

image

>200 second pd.min()

Then I tried copying the profiler example and also observed the pd.min() operation taking an unusually long amount of time.

%%cudf.pandas.profile
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3,4,3]})

df.min(axis=1)
out = df.groupby('a').filter(
    lambda group: len(group) > 1
)
image

Environment

Here's a pip freeze of the Python environment.

`pip freeze`
absl-py==1.0.0
accelerate==0.25.0
aiohttp==3.9.1
aiosignal==1.3.1
anyio==3.5.0
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
astor==0.8.1
asttokens==2.0.5
astunparse==1.6.3
async-timeout==4.0.3
attrs==22.1.0
audioread==3.0.1
azure-core==1.29.1
azure-cosmos==4.3.1
azure-storage-blob==12.19.0
azure-storage-file-datalake==12.14.0
backcall==0.2.0
bcrypt==3.2.0
beautifulsoup4==4.11.1
black==22.6.0
bleach==4.1.0
blinker==1.4
blis==0.7.11
boto3==1.24.28
botocore==1.27.96
cachetools==5.3.2
catalogue==2.0.10
category-encoders==2.6.3
certifi==2022.12.7
cffi==1.15.1
chardet==4.0.0
charset-normalizer==2.0.4
click==8.1.7
cloudpathlib==0.16.0
cloudpickle==2.0.0
cmake==3.28.1
cmdstanpy==1.2.0
comm==0.1.2
confection==0.1.4
configparser==5.2.0
contourpy==1.0.5
cryptography==39.0.1
cubinlinker-cu11==0.3.0.post2
cuda-python==11.8.3
cudf-cu11==24.6.1
cuml-cu11==24.6.1
cupy-cuda11x==13.2.0
cycler==0.11.0
cymem==2.0.8
Cython==0.29.32
dacite==1.8.1
dask==2024.5.1
dask-cuda==24.6.0
dask-cudf-cu11==24.6.1
dask-expr==1.1.1
databricks-automl-runtime==0.2.20
databricks-cli==0.18.0
databricks-feature-engineering==0.2.1
databricks-sdk==0.1.6
dataclasses-json==0.6.3
datasets==2.15.0
dbl-tempo==0.1.26
dbus-python==1.2.18
debugpy==1.6.7
decorator==5.1.1
deepspeed==0.12.4
defusedxml==0.7.1
dill==0.3.6
diskcache==5.6.3
distlib==0.3.7
distributed==2024.5.1
distributed-ucxx-cu11==0.38.0
distro==1.7.0
distro-info==1.1+ubuntu0.2
docstring-to-markdown==0.11
einops==0.7.0
entrypoints==0.4
evaluate==0.4.1
executing==0.8.3
facets-overview==1.1.1
fastjsonschema==2.19.1
fastrlock==0.8.2
fasttext==0.9.2
filelock==3.9.0
flash-attn==2.3.6
Flask==2.2.5
flatbuffers==23.5.26
fonttools==4.25.0
frozenlist==1.4.1
fsspec==2023.6.0
future==0.18.3
gast==0.4.0
gitdb==4.0.11
GitPython==3.1.27
google-api-core==2.15.0
google-auth==2.21.0
google-auth-oauthlib==1.0.0
google-cloud-core==2.4.1
google-cloud-storage==2.11.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
greenlet==2.0.1
grpcio==1.48.2
grpcio-status==1.48.1
gunicorn==20.1.0
gviz-api==1.10.0
h5py==3.7.0
hjson==3.1.0
holidays==0.38
horovod==0.28.1
htmlmin==0.1.12
httplib2==0.20.2
huggingface-hub==0.19.4
idna==3.4
ImageHash==4.3.1
imbalanced-learn==0.11.0
importlib-resources==6.1.1
importlib_metadata==8.0.0
ipykernel==6.25.0
ipython==8.14.0
ipython-genutils==0.2.0
ipywidgets==7.7.2
isodate==0.6.1
itsdangerous==2.0.1
jedi==0.18.1
jeepney==0.7.1
Jinja2==3.1.2
jmespath==0.10.0
joblib==1.2.0
joblibspark==0.5.1
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.17.3
jupyter-client==7.3.4
jupyter-server==1.23.4
jupyter_core==5.2.0
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.0
keras==2.14.0
keyring==23.5.0
kiwisolver==1.4.4
langchain==0.0.348
langchain-core==0.0.13
langcodes==3.3.0
langsmith==0.0.79
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lazy_loader==0.3
libclang==15.0.6.1
librosa==0.10.1
libucx-cu11==1.15.0.post1
lightgbm==4.1.0
lit==17.0.6
llvmlite==0.43.0
locket==1.0.0
lxml==4.9.1
Mako==1.2.0
Markdown==3.4.1
markdown-it-py==3.0.0
MarkupSafe==2.1.1
marshmallow==3.20.2
matplotlib==3.7.0
matplotlib-inline==0.1.6
mccabe==0.7.0
mdurl==0.1.2
mistune==0.8.4
ml-dtypes==0.2.0
mlflow-skinny==2.9.2
more-itertools==8.10.0
mpmath==1.2.1
msgpack==1.0.7
multidict==6.0.4
multimethod==1.10
multiprocess==0.70.14
murmurhash==1.0.10
mypy-extensions==0.4.3
nbclassic==0.5.2
nbclient==0.5.13
nbconvert==6.5.4
nbformat==5.7.0
nest-asyncio==1.5.6
networkx==2.8.4
ninja==1.11.1.1
nltk==3.7
nodeenv==1.8.0
notebook==6.5.2
notebook_shim==0.2.2
numba==0.60.0
numpy==1.23.5
nvtx==0.2.10
oauthlib==3.2.0
openai==0.28.1
opt-einsum==3.3.0
packaging==23.2
pandas==2.2.2
pandocfilters==1.5.0
paramiko==2.9.2
parso==0.8.3
partd==1.4.2
pathspec==0.10.3
patsy==0.5.3
petastorm==0.12.1
pexpect==4.8.0
phik==0.12.4
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==2.5.2
plotly==5.9.0
pluggy==1.0.0
pmdarima==2.0.4
pooch==1.4.0
preshed==3.0.9
prompt-toolkit==3.0.36
prophet==1.1.5
protobuf==4.24.0
psutil==5.9.0
psycopg2==2.9.3
ptxcompiler-cu11==0.8.1.post1
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==16.1.0
pyarrow-hotfix==0.5
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.11.1
pycparser==2.21
pydantic==1.10.6
pyflakes==3.1.0
Pygments==2.18.0
PyGObject==3.42.1
PyJWT==2.3.0
pylibraft-cu11==24.6.0
PyNaCl==1.5.0
pynvml==11.4.1
pyodbc==4.0.32
pyparsing==3.0.9
pyright==1.1.294
pyrsistent==0.18.0
pytesseract==0.3.10
python-apt==2.4.0+ubuntu3
python-dateutil==2.8.2
python-editor==1.0.4
python-lsp-jsonrpc==1.1.1
python-lsp-server==1.8.0
pytoolconfig==1.2.5
pytz==2022.7
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==23.2.0
raft-dask-cu11==24.6.0
rapids-dask-dependency==24.6.0
regex==2022.7.9
requests==2.28.1
requests-oauthlib==1.3.1
responses==0.18.0
rich==13.7.1
rmm-cu11==24.6.0
rope==1.7.0
rsa==4.9
s3transfer==0.6.2
safetensors==0.4.1
scikit-learn==1.1.1
scipy==1.10.0
seaborn==0.12.2
SecretStorage==3.3.1
Send2Trash==1.8.0
sentence-transformers==2.2.2
sentencepiece==0.1.99
shap==0.44.0
simplejson==3.17.6
six==1.16.0
slicer==0.0.7
smart-open==5.2.1
smmap==5.0.0
sniffio==1.2.0
sortedcontainers==2.4.0
soundfile==0.12.1
soupsieve==2.3.2.post1
soxr==0.3.7
spacy==3.7.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
spark-tensorflow-distributor==1.0.0
SQLAlchemy==1.4.39
sqlparse==0.4.2
srsly==2.4.8
ssh-import-id==5.11
stack-data==0.2.0
stanio==0.3.0
statsmodels==0.13.5
sympy==1.11.1
tabulate==0.8.10
tangled-up-in-unicode==0.2.0
tblib==3.0.0
tenacity==8.1.0
tensorboard==2.14.1
tensorboard-data-server==0.7.2
tensorboard-plugin-profile==2.14.0
tensorflow==2.14.1
tensorflow-estimator==2.14.0
tensorflow-io-gcs-filesystem==0.35.0
termcolor==2.4.0
terminado==0.17.1
thinc==8.2.2
threadpoolctl==2.2.0
tiktoken==0.5.2
tinycss2==1.2.1
tokenize-rt==4.2.1
tokenizers==0.15.0
tomli==2.0.1
toolz==0.12.1
torch==2.0.1+cu118
torchvision==0.15.2+cu118
tornado==6.1
tqdm==4.64.1
traitlets==5.7.1
transformers==4.36.1
treelite==4.1.2
triton==2.0.0
typeguard==2.13.3
typer==0.9.0
typing-inspect==0.9.0
typing_extensions==4.4.0
tzdata==2024.1
ucx-py-cu11==0.38.0
ucxx-cu11==0.38.0
ujson==5.4.0
unattended-upgrades==0.1
urllib3==1.26.14
virtualenv==20.16.7
visions==0.7.5
wadllib==1.3.6
wasabi==1.1.2
wcwidth==0.2.5
weasel==0.3.4
webencodings==0.5.1
websocket-client==0.58.0
Werkzeug==2.2.2
whatthepatch==1.0.2
widgetsnbextension==3.6.1
wordcloud==1.9.3
wrapt==1.14.1
xgboost==1.7.6
xxhash==3.4.1
yapf==0.33.0
yarl==1.9.4
ydata-profiling==4.2.0
zict==3.0.0
zipp==3.11.0

And an nvidia-smi output.

Fri Jul 12 13:21:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   30C    P0              56W / 300W |    258MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

It's also worth noting that despite the CUDA version being listed as 12.2 the 14.3 LTS ML runtime only has the cuda-toolkit for 11.8.

$ ls -ld /usr/local/cuda*
lrwxrwxrwx  1 root root   22 Jul 12 13:38 /usr/local/cuda -> /etc/alternatives/cuda
lrwxrwxrwx  1 root root   25 Jul 12 13:38 /usr/local/cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x 10 root root 4096 Jul 12 13:38 /usr/local/cuda-11.8

jacobtomlinson avatar Jul 12 '24 13:07 jacobtomlinson

Switching to the 15.3 ML runtime gives us cuda-toolkit==12.1 which allows us to use the CUDA 12 packages for cudf.

#!/bin/bash
set -e

# Install RAPIDS libraries
pip install --extra-index-url=https://pypi.nvidia.com \
    "cudf-cu12" \
    "cuml-cu12" \
    "dask-cudf-cu12" \
    "dask-cuda==24.06"

However the >200 seconds issue still persists.

%%cudf.pandas.profile
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3,4,3]})

df.min(axis=1)
out = df.groupby('a').filter(
    lambda group: len(group) > 1
)
image

jacobtomlinson avatar Jul 12 '24 14:07 jacobtomlinson