[BUG] merlin models on vertex ai training - cuda error
Describe the bug I've been able to train a Merlin model in a Vertex Notebook (using merlin base image). Now I'm trying to train the same model in Vertex AI Training. When training begins, I get this error:
numba.cuda.cudadrv.driver.LinkerError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
ptxas application ptx input, line 9; fatal : Unsupported .version 7.7; current version is '7.4'
Steps/Code to reproduce bug
- I'm building this dockerfile, using the below as a base image:
FROM nvcr.io/nvidia/merlin/merlin-tensorflow:22.07
- The error occurs in the
model.fit()call intraining_task.py
Expected behavior Execute training
Environment details (please complete the following information):
- Environment location: Google Cloud, Vertex AI Training Job
- Driver:

Additional context
Actually I'm confused about the Driver Version. You see the results from nvidia-smi above.
However when I env | grep CUDA i get the following
env | grep CUDA
NVIDIA_REQUIRE_CUDA=cuda>=9.0
CUDA_CACHE_DISABLE=1
_CUDA_COMPAT_STATUS=CUDA Driver OK
CUDA_CUDA_LIBRARY=/usr/local/cuda/lib64/stubs
CUDA_VERSION=11.7.1.014
CUDA_PATH=/usr/local/cuda
CUDA_DRIVER_VERSION=515.48.08
_CUDA_COMPAT_PATH=/usr/local/cuda/compat
CUDA_HOME=/usr/local/cuda
Where driver version seems to be 515.48.08?
I'm not sure why this makes a difference, but here is what i've done to get around this error and start model training:
They use command and args differently.
This results in the error.
worker_pool_specs = [
{
"machine_spec": {
"machine_type": MACHINE_TYPE,
"accelerator_type": ACCELERATOR_TYPE,
"accelerator_count": ACCELERATOR_NUM,
},
"replica_count": 1,
"container_spec": {
"image_uri": IMAGE_URI,
"command": ["python", "-m", "train_task"],
"args": [
f'--per_gpu_batch_size={PER_GPU_BATCH_SIZE}',
f'--model_name={MODEL_NAME}',
f'--train_dir={TRAIN_DATA}',
f'--valid_dir={VALID_DATA}',
f'--schema={SCHEMA_PATH}',
f'--workflow_dir={WORKFLOW_DIR}',
f'--max_iter={MAX_ITERATIONS}',
f'--num_epochs={NUM_EPOCHS}',
f'--gpus={gpus}',
],
},
}
]
This does not produce the error, and successfully begins training:
worker_pool_specs = [
{
"machine_spec": {
"machine_type": MACHINE_TYPE,
"accelerator_type": ACCELERATOR_TYPE,
"accelerator_count": ACCELERATOR_NUM,
},
"replica_count": 1,
"container_spec": {
"image_uri": IMAGE_URI,
'command': ['sh','-euc',f'''
python -m train_task --per_gpu_batch_size={PER_GPU_BATCH_SIZE} \
--model_name={MODEL_NAME} --train_dir={TRAIN_DATA} \
--valid_dir={VALID_DATA} \
--schema={SCHEMA_PATH} \
--workflow_dir={WORKFLOW_DIR} \
--max_iter={MAX_ITERATIONS} --num_epochs={NUM_EPOCHS} --gpus={gpus}
'''
]
}
}
]
For reference, here is the complete image:
FROM nvcr.io/nvidia/merlin/merlin-tensorflow:22.07
WORKDIR /src
RUN pip install -U pip
RUN pip install google-cloud-bigquery gcsfs cloudml-hypertune
RUN pip install google-cloud-aiplatform kfp
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && apt-get update -y && apt-get install google-cloud-sdk -y
COPY training/* ./
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/compat/lib.real:/usr/local/hugectr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib
Hello @tottenjordan what is the base driver version on the machine? Is that original picture of nvidia-smi output bare-metal or on the actual container?
Hey @jperez999 - This should be the container. I think becasue this is running with bash it shouldnt be a driver problem, right?
I think it has to do with either
- how
LD_LIBRARY_PATHis being set/not set - how the other CUDA artifacts are being found/loaded
@tottenjordan
CUDA artifacts are loaded via the /opt/nvidia/nvidia_entrypoint.sh. I built your dockerfile, and it does not change your entrypoint. This means you should be loading the correct cuda version and all of its relevant artifacts.
And looking through the container's LD_LIBRARY_PATH you have some stuff doubled/tripled up and an extra path /usr/local/cuda/compat/lib.real but you are not missing anything.
Normal tensorflow container's LD_LIBRARY_PATH:
/usr/local/hugectr/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/opt/tritonserver/lib
Container built from your dockerfile's LD_LIBRARY_PATH:
/usr/local/hugectr/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/opt/tritonserver/lib:/usr/local/cuda/compat/lib.real:/usr/local/hugectr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib
Next step will be to attempt to repro your issue by producing a synthetic dataset with the same attributes as your dataset and running the same training task.
@tottenjordan is this still an issue? can we close the ticket?
thanks @rnyak