models [BUG] merlin models on vertex ai training

Describe the bug I've been able to train a Merlin model in a Vertex Notebook (using merlin base image). Now I'm trying to train the same model in Vertex AI Training. When training begins, I get this error:

numba.cuda.cudadrv.driver.LinkerError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
ptxas application ptx input, line 9; fatal   : Unsupported .version 7.7; current version is '7.4'

Steps/Code to reproduce bug

I'm building this dockerfile, using the below as a base image:

FROM nvcr.io/nvidia/merlin/merlin-tensorflow:22.07

The error occurs in the model.fit() call in training_task.py

Expected behavior Execute training

Environment details (please complete the following information):

Environment location: Google Cloud, Vertex AI Training Job
Driver:

Additional context

full training package

Aug 23 '22 23:08 tottenjordan

Actually I'm confused about the Driver Version. You see the results from nvidia-smi above.

However when I env | grep CUDA i get the following

env | grep CUDA                                                                                                                                                                                                                                                         
NVIDIA_REQUIRE_CUDA=cuda>=9.0                                                                                                                                                                                                                                                                   
CUDA_CACHE_DISABLE=1                                                                                                                                                                                                                                                                            
_CUDA_COMPAT_STATUS=CUDA Driver OK                                                                                                                                                                                                                                                              
CUDA_CUDA_LIBRARY=/usr/local/cuda/lib64/stubs                                                                                                                                                                                                                                                   
CUDA_VERSION=11.7.1.014                                                                                                                                                                                                                                                                         
CUDA_PATH=/usr/local/cuda                                                                                                                                                                                                                                                                       
CUDA_DRIVER_VERSION=515.48.08                                                                                                                                                                                                                                                                   
_CUDA_COMPAT_PATH=/usr/local/cuda/compat                                                                                                                                                                                                                                                        
CUDA_HOME=/usr/local/cuda

Where driver version seems to be 515.48.08?

Aug 24 '22 15:08 tottenjordan

I'm not sure why this makes a difference, but here is what i've done to get around this error and start model training:

They use command and args differently.

This results in the error.

worker_pool_specs =  [
    {
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": ACCELERATOR_TYPE,
            "accelerator_count": ACCELERATOR_NUM,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": IMAGE_URI,
            "command": ["python", "-m", "train_task"],
            "args": [
                f'--per_gpu_batch_size={PER_GPU_BATCH_SIZE}',
                f'--model_name={MODEL_NAME}',
                f'--train_dir={TRAIN_DATA}',
                f'--valid_dir={VALID_DATA}',
                f'--schema={SCHEMA_PATH}',
                f'--workflow_dir={WORKFLOW_DIR}',
                f'--max_iter={MAX_ITERATIONS}',
                f'--num_epochs={NUM_EPOCHS}',
                f'--gpus={gpus}',
            ],
        },
    }
]

This does not produce the error, and successfully begins training:

worker_pool_specs =  [
    {
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": ACCELERATOR_TYPE,
            "accelerator_count": ACCELERATOR_NUM,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": IMAGE_URI,
            'command': ['sh','-euc',f'''
                    python -m train_task --per_gpu_batch_size={PER_GPU_BATCH_SIZE} \
                    --model_name={MODEL_NAME} --train_dir={TRAIN_DATA} \
                    --valid_dir={VALID_DATA} \
                    --schema={SCHEMA_PATH} \
                    --workflow_dir={WORKFLOW_DIR} \
                    --max_iter={MAX_ITERATIONS} --num_epochs={NUM_EPOCHS} --gpus={gpus}
                    '''
            ]
        }
    }
]

For reference, here is the complete image:


FROM nvcr.io/nvidia/merlin/merlin-tensorflow:22.07

WORKDIR /src

RUN pip install -U pip
RUN pip install google-cloud-bigquery gcsfs cloudml-hypertune
RUN pip install google-cloud-aiplatform kfp
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg  add - && apt-get update -y && apt-get install google-cloud-sdk -y

COPY training/* ./

ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/compat/lib.real:/usr/local/hugectr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib

Aug 24 '22 20:08 tottenjordan

Hello @tottenjordan what is the base driver version on the machine? Is that original picture of nvidia-smi output bare-metal or on the actual container?

Aug 25 '22 23:08 jperez999

Hey @jperez999 - This should be the container. I think becasue this is running with bash it shouldnt be a driver problem, right?

I think it has to do with either

how LD_LIBRARY_PATH is being set/not set
how the other CUDA artifacts are being found/loaded

Sep 01 '22 20:09 tottenjordan

@tottenjordan

CUDA artifacts are loaded via the /opt/nvidia/nvidia_entrypoint.sh. I built your dockerfile, and it does not change your entrypoint. This means you should be loading the correct cuda version and all of its relevant artifacts.

And looking through the container's LD_LIBRARY_PATH you have some stuff doubled/tripled up and an extra path /usr/local/cuda/compat/lib.real but you are not missing anything.

Normal tensorflow container's LD_LIBRARY_PATH:

/usr/local/hugectr/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/opt/tritonserver/lib

Container built from your dockerfile's LD_LIBRARY_PATH:

/usr/local/hugectr/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/opt/tritonserver/lib:/usr/local/cuda/compat/lib.real:/usr/local/hugectr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib

Next step will be to attempt to repro your issue by producing a synthetic dataset with the same attributes as your dataset and running the same training task.

Sep 18 '22 00:09 jperez999

@tottenjordan is this still an issue? can we close the ticket?

Oct 26 '22 15:10 rnyak

thanks @rnyak

Oct 27 '22 17:10 tottenjordan

[BUG] merlin models on vertex ai training - cuda error