cudf
cudf copied to clipboard
[BUG] cudf-cuda11 not working in Databricks DBR 13.3 ML LTS on GPU instance
Describe the bug
cudf-cuda11 is not using GPU while running on a Databricks DBR 13.3 ML LTS with GPU instance.
Steps/Code to reproduce bug
Using DBR 14.3 ML with GPU fails with error:
Internal error message: Spark error: Driver down cause: java.lang.IllegalArgumentException: This RAPIDS Plugin build does not support Spark build 3.5.0-databricks. Supported Spark versions: 3.1.1 {buildver=311}, 3.1.2 {buildver=312}, 3.1.3 {buildver=313}, 3.2.0 {buildver=320}, 3.2.1 {buildver=321}, 3.2.1-cloudera-3.2.7171000 {buildver=321cdh}, 3.2.2 {buildver=322}, 3.2.3 {buildver=323}, 3.2.4 {buildver=324}, 3.3.0 {buildver=330}, 3.3.0-cloudera-3.3.7180 {buildver=330cdh}, 3.3.0-databricks {buildver=330db}, 3.3.1 {buildver=331}, 3.3.2 {buildver=332}, 3.3.2-cloudera-3.3.7190 {buildver=332cdh}, 3.3.2-databricks {buildver=332db}, 3.3.3 {buildver=333}, 3.3.4 {buildver=334}, 3.4.0 {buildver=340}, 3.4.1 {buildver=341}, 3.4.1-databricks {buildver=341db}, 3.4.2 {buildver=342}, 3.5.0 {buildver=350}, 3.5.1 {buildver=351}. Consult the Release documentation at https://nvidia.github.io/spark-rapids/docs/download.html
We are following these guides:
https://docs.rapids.ai/deployment/stable/platforms/databricks/
https://docs.nvidia.com/spark-rapids/user-guide/23.12/getting-started/databricks.html
Expected behavior
For cudf-cuda11 package to utilize GPU to perform pandas operations.
Environment overview (please complete the following information)
- Environment location: AWS cloud
- Method of cuDF install: /databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com cuml-cu11==24.6.* cudf-cu11==24.6.*
Here I load cudf and I made sure it shows <module 'pandas' (ModuleAccelerator(fast=cudf, slow=pandas))> when printing pd.
How to debug why cuDF shows 0 per-gpu usage but only Per-GPU frame buffer utilization bytes? It seems to be only using the CPU. Please advise it seems cudf-cuda11 supports Cuda 11.2+ which the DBR release contains and the library is loaded just fine.
We are using this NVIDIA notebook for testing rapid cudf pandas accelerator:
https://colab.research.google.com/drive/12tCzP94zFG2BRduACucn5Q_OcX1TUKY3
Can you try using the cudf.pandas.profile magic? https://docs.rapids.ai/api/cudf/stable/cudf_pandas/usage/#understanding-performance-the-cudf-pandas-profiler
I think this should tell you which operations are running on the GPU and which are running on CPU.
Thank you @lithomas1, will check that
@lithomas1 I had been working with @jcampabadal-db on this, I observed super slow performance on GPU with following output on both Databricks DBR 13.3 ML(CUDA11.7) and Databricks DBR 14.3 ML(CUDA 11.8) on AWS EC2 g5.xlarge [A10G] following same command from https://docs.rapids.ai/api/cudf/stable/cudf_pandas/usage/#understanding-performance-the-cudf-pandas-profiler
but the output is below (noticed took several minutes), how to workaround or resolve such performance issue?
/databricks/python/lib/python3.10/site-packages/cupy/cuda/compiler.py:233: PerformanceWarning: Jitify is performing a one-time only warm-up to populate the persistent cache, this may take a few seconds and will be improved in a future release...
jitify._init_module()
Total time elapsed: 225.300 seconds
3 GPU function calls in 224.665 seconds
1 CPU function calls in 0.012 seconds
Stats
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Function ┃ GPU ncalls ┃ GPU cumtime ┃ GPU percall ┃ CPU ncalls ┃ CPU cumtime ┃ CPU percall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ DataFrame │ 1 │ 0.145 │ 0.145 │ 0 │ 0.000 │ 0.000 │
│ DataFrame.min │ 1 │ 224.520 │ 224.520 │ 0 │ 0.000 │ 0.000 │
│ DataFrame.groupby │ 1 │ 0.000 │ 0.000 │ 0 │ 0.000 │ 0.000 │
│ DataFrameGroupBy.filter │ 0 │ 0.000 │ 0.000 │ 1 │ 0.012 │ 0.012 │
└─────────────────────────┴────────────┴─────────────┴─────────────┴────────────┴─────────────┴─────────────┘
Not all pandas operations ran on the GPU. The following functions required CPU fallback:
- DataFrameGroupBy.filter
Also if I follow https://docs.nvidia.com/spark-rapids/user-guide/23.12/getting-started/databricks.html
sometimes I run into OOM error even loading small dataset:
import cudf
import requests
from io import StringIO
url = "https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode("utf-8")
tips_df = cudf.read_csv(StringIO(content))
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /__w/cudf/cudf/python/cudf/build/cp310-cp310-linux_x86_64/_deps/rmm-src/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory
Are you using cuDF Pandas alongside Spark RAPIDS in a single application/workflow or is this independent of Spark?
Would be curious to know if you experience this error when following only this guide https://docs.rapids.ai/deployment/stable/platforms/databricks/ (or if it's perhaps related to some combination).
@beckernick thanks for reply on this - actually this ticket was the issues encountered after following this guide pointed above - latest RAPIDS release mandate support of Databricks only till 13.3ML as describedin https://nvidia.github.io/spark-rapids/docs/download.html otherwise Databricks Spark cluster failed to boot up
I just ran through the steps outlined above to try and reproduce the 15 second time and no GPU usage mentioned in https://github.com/rapidsai/cudf/issues/16041#issue-2354273462 and also the >200 second time mentioned in https://github.com/rapidsai/cudf/issues/16041#issuecomment-2174551352.
Setup
I followed the single-node deployment documentation and launched a cluster with an g5.xlarge (A10G) using the 14.3 LTS ML runtime.
I used the following init script outlined in the documentation to install RAPIDS.
#!/bin/bash
set -e
# Install RAPIDS libraries
pip install --extra-index-url=https://pypi.nvidia.com \
"cudf-cu11" \
"cuml-cu11" \
"dask-cudf-cu11" \
"dask-cuda==24.06"
15 second pd.read_parquet() and no GPU usage
I was able to reproduce the 15 second load time, but not the zero GPU usage.
In a new notebook I copied some of the cells from the Colab example notebook.
%load_ext cudf.pandas
import pandas as pd
!wget https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet
The read_parquet() cell took around 15 seconds for me.
df = pd.read_parquet(
"nyc_parking_violations_2022.parquet",
columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
)
However I found that this was due to the data being downloaded to the /Workspace network filesystem which is very slow. If I instead downloaded the data to local storage at /local_disk0 the read_parquet() call took around 200ms.
I then ran through the other operations from the notebook, none of which took a particularly long amount of time and the profiler shows it successfully used the GPU.
>200 second pd.min()
Then I tried copying the profiler example and also observed the pd.min() operation taking an unusually long amount of time.
%%cudf.pandas.profile
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3,4,3]})
df.min(axis=1)
out = df.groupby('a').filter(
lambda group: len(group) > 1
)
Environment
Here's a pip freeze of the Python environment.
`pip freeze`
absl-py==1.0.0
accelerate==0.25.0
aiohttp==3.9.1
aiosignal==1.3.1
anyio==3.5.0
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
astor==0.8.1
asttokens==2.0.5
astunparse==1.6.3
async-timeout==4.0.3
attrs==22.1.0
audioread==3.0.1
azure-core==1.29.1
azure-cosmos==4.3.1
azure-storage-blob==12.19.0
azure-storage-file-datalake==12.14.0
backcall==0.2.0
bcrypt==3.2.0
beautifulsoup4==4.11.1
black==22.6.0
bleach==4.1.0
blinker==1.4
blis==0.7.11
boto3==1.24.28
botocore==1.27.96
cachetools==5.3.2
catalogue==2.0.10
category-encoders==2.6.3
certifi==2022.12.7
cffi==1.15.1
chardet==4.0.0
charset-normalizer==2.0.4
click==8.1.7
cloudpathlib==0.16.0
cloudpickle==2.0.0
cmake==3.28.1
cmdstanpy==1.2.0
comm==0.1.2
confection==0.1.4
configparser==5.2.0
contourpy==1.0.5
cryptography==39.0.1
cubinlinker-cu11==0.3.0.post2
cuda-python==11.8.3
cudf-cu11==24.6.1
cuml-cu11==24.6.1
cupy-cuda11x==13.2.0
cycler==0.11.0
cymem==2.0.8
Cython==0.29.32
dacite==1.8.1
dask==2024.5.1
dask-cuda==24.6.0
dask-cudf-cu11==24.6.1
dask-expr==1.1.1
databricks-automl-runtime==0.2.20
databricks-cli==0.18.0
databricks-feature-engineering==0.2.1
databricks-sdk==0.1.6
dataclasses-json==0.6.3
datasets==2.15.0
dbl-tempo==0.1.26
dbus-python==1.2.18
debugpy==1.6.7
decorator==5.1.1
deepspeed==0.12.4
defusedxml==0.7.1
dill==0.3.6
diskcache==5.6.3
distlib==0.3.7
distributed==2024.5.1
distributed-ucxx-cu11==0.38.0
distro==1.7.0
distro-info==1.1+ubuntu0.2
docstring-to-markdown==0.11
einops==0.7.0
entrypoints==0.4
evaluate==0.4.1
executing==0.8.3
facets-overview==1.1.1
fastjsonschema==2.19.1
fastrlock==0.8.2
fasttext==0.9.2
filelock==3.9.0
flash-attn==2.3.6
Flask==2.2.5
flatbuffers==23.5.26
fonttools==4.25.0
frozenlist==1.4.1
fsspec==2023.6.0
future==0.18.3
gast==0.4.0
gitdb==4.0.11
GitPython==3.1.27
google-api-core==2.15.0
google-auth==2.21.0
google-auth-oauthlib==1.0.0
google-cloud-core==2.4.1
google-cloud-storage==2.11.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
greenlet==2.0.1
grpcio==1.48.2
grpcio-status==1.48.1
gunicorn==20.1.0
gviz-api==1.10.0
h5py==3.7.0
hjson==3.1.0
holidays==0.38
horovod==0.28.1
htmlmin==0.1.12
httplib2==0.20.2
huggingface-hub==0.19.4
idna==3.4
ImageHash==4.3.1
imbalanced-learn==0.11.0
importlib-resources==6.1.1
importlib_metadata==8.0.0
ipykernel==6.25.0
ipython==8.14.0
ipython-genutils==0.2.0
ipywidgets==7.7.2
isodate==0.6.1
itsdangerous==2.0.1
jedi==0.18.1
jeepney==0.7.1
Jinja2==3.1.2
jmespath==0.10.0
joblib==1.2.0
joblibspark==0.5.1
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.17.3
jupyter-client==7.3.4
jupyter-server==1.23.4
jupyter_core==5.2.0
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.0
keras==2.14.0
keyring==23.5.0
kiwisolver==1.4.4
langchain==0.0.348
langchain-core==0.0.13
langcodes==3.3.0
langsmith==0.0.79
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lazy_loader==0.3
libclang==15.0.6.1
librosa==0.10.1
libucx-cu11==1.15.0.post1
lightgbm==4.1.0
lit==17.0.6
llvmlite==0.43.0
locket==1.0.0
lxml==4.9.1
Mako==1.2.0
Markdown==3.4.1
markdown-it-py==3.0.0
MarkupSafe==2.1.1
marshmallow==3.20.2
matplotlib==3.7.0
matplotlib-inline==0.1.6
mccabe==0.7.0
mdurl==0.1.2
mistune==0.8.4
ml-dtypes==0.2.0
mlflow-skinny==2.9.2
more-itertools==8.10.0
mpmath==1.2.1
msgpack==1.0.7
multidict==6.0.4
multimethod==1.10
multiprocess==0.70.14
murmurhash==1.0.10
mypy-extensions==0.4.3
nbclassic==0.5.2
nbclient==0.5.13
nbconvert==6.5.4
nbformat==5.7.0
nest-asyncio==1.5.6
networkx==2.8.4
ninja==1.11.1.1
nltk==3.7
nodeenv==1.8.0
notebook==6.5.2
notebook_shim==0.2.2
numba==0.60.0
numpy==1.23.5
nvtx==0.2.10
oauthlib==3.2.0
openai==0.28.1
opt-einsum==3.3.0
packaging==23.2
pandas==2.2.2
pandocfilters==1.5.0
paramiko==2.9.2
parso==0.8.3
partd==1.4.2
pathspec==0.10.3
patsy==0.5.3
petastorm==0.12.1
pexpect==4.8.0
phik==0.12.4
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==2.5.2
plotly==5.9.0
pluggy==1.0.0
pmdarima==2.0.4
pooch==1.4.0
preshed==3.0.9
prompt-toolkit==3.0.36
prophet==1.1.5
protobuf==4.24.0
psutil==5.9.0
psycopg2==2.9.3
ptxcompiler-cu11==0.8.1.post1
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==16.1.0
pyarrow-hotfix==0.5
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.11.1
pycparser==2.21
pydantic==1.10.6
pyflakes==3.1.0
Pygments==2.18.0
PyGObject==3.42.1
PyJWT==2.3.0
pylibraft-cu11==24.6.0
PyNaCl==1.5.0
pynvml==11.4.1
pyodbc==4.0.32
pyparsing==3.0.9
pyright==1.1.294
pyrsistent==0.18.0
pytesseract==0.3.10
python-apt==2.4.0+ubuntu3
python-dateutil==2.8.2
python-editor==1.0.4
python-lsp-jsonrpc==1.1.1
python-lsp-server==1.8.0
pytoolconfig==1.2.5
pytz==2022.7
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==23.2.0
raft-dask-cu11==24.6.0
rapids-dask-dependency==24.6.0
regex==2022.7.9
requests==2.28.1
requests-oauthlib==1.3.1
responses==0.18.0
rich==13.7.1
rmm-cu11==24.6.0
rope==1.7.0
rsa==4.9
s3transfer==0.6.2
safetensors==0.4.1
scikit-learn==1.1.1
scipy==1.10.0
seaborn==0.12.2
SecretStorage==3.3.1
Send2Trash==1.8.0
sentence-transformers==2.2.2
sentencepiece==0.1.99
shap==0.44.0
simplejson==3.17.6
six==1.16.0
slicer==0.0.7
smart-open==5.2.1
smmap==5.0.0
sniffio==1.2.0
sortedcontainers==2.4.0
soundfile==0.12.1
soupsieve==2.3.2.post1
soxr==0.3.7
spacy==3.7.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
spark-tensorflow-distributor==1.0.0
SQLAlchemy==1.4.39
sqlparse==0.4.2
srsly==2.4.8
ssh-import-id==5.11
stack-data==0.2.0
stanio==0.3.0
statsmodels==0.13.5
sympy==1.11.1
tabulate==0.8.10
tangled-up-in-unicode==0.2.0
tblib==3.0.0
tenacity==8.1.0
tensorboard==2.14.1
tensorboard-data-server==0.7.2
tensorboard-plugin-profile==2.14.0
tensorflow==2.14.1
tensorflow-estimator==2.14.0
tensorflow-io-gcs-filesystem==0.35.0
termcolor==2.4.0
terminado==0.17.1
thinc==8.2.2
threadpoolctl==2.2.0
tiktoken==0.5.2
tinycss2==1.2.1
tokenize-rt==4.2.1
tokenizers==0.15.0
tomli==2.0.1
toolz==0.12.1
torch==2.0.1+cu118
torchvision==0.15.2+cu118
tornado==6.1
tqdm==4.64.1
traitlets==5.7.1
transformers==4.36.1
treelite==4.1.2
triton==2.0.0
typeguard==2.13.3
typer==0.9.0
typing-inspect==0.9.0
typing_extensions==4.4.0
tzdata==2024.1
ucx-py-cu11==0.38.0
ucxx-cu11==0.38.0
ujson==5.4.0
unattended-upgrades==0.1
urllib3==1.26.14
virtualenv==20.16.7
visions==0.7.5
wadllib==1.3.6
wasabi==1.1.2
wcwidth==0.2.5
weasel==0.3.4
webencodings==0.5.1
websocket-client==0.58.0
Werkzeug==2.2.2
whatthepatch==1.0.2
widgetsnbextension==3.6.1
wordcloud==1.9.3
wrapt==1.14.1
xgboost==1.7.6
xxhash==3.4.1
yapf==0.33.0
yarl==1.9.4
ydata-profiling==4.2.0
zict==3.0.0
zipp==3.11.0
And an nvidia-smi output.
Fri Jul 12 13:21:13 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 30C P0 56W / 300W | 258MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
It's also worth noting that despite the CUDA version being listed as 12.2 the 14.3 LTS ML runtime only has the cuda-toolkit for 11.8.
$ ls -ld /usr/local/cuda*
lrwxrwxrwx 1 root root 22 Jul 12 13:38 /usr/local/cuda -> /etc/alternatives/cuda
lrwxrwxrwx 1 root root 25 Jul 12 13:38 /usr/local/cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x 10 root root 4096 Jul 12 13:38 /usr/local/cuda-11.8
Switching to the 15.3 ML runtime gives us cuda-toolkit==12.1 which allows us to use the CUDA 12 packages for cudf.
#!/bin/bash
set -e
# Install RAPIDS libraries
pip install --extra-index-url=https://pypi.nvidia.com \
"cudf-cu12" \
"cuml-cu12" \
"dask-cudf-cu12" \
"dask-cuda==24.06"
However the >200 seconds issue still persists.
%%cudf.pandas.profile
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3,4,3]})
df.min(axis=1)
out = df.groupby('a').filter(
lambda group: len(group) > 1
)