models icon indicating copy to clipboard operation
models copied to clipboard

[BUG] merlin models on vertex ai training - cuda error

Open tottenjordan opened this issue 3 years ago • 6 comments

Describe the bug I've been able to train a Merlin model in a Vertex Notebook (using merlin base image). Now I'm trying to train the same model in Vertex AI Training. When training begins, I get this error:

numba.cuda.cudadrv.driver.LinkerError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
ptxas application ptx input, line 9; fatal   : Unsupported .version 7.7; current version is '7.4'

Steps/Code to reproduce bug

FROM nvcr.io/nvidia/merlin/merlin-tensorflow:22.07

Expected behavior Execute training

Environment details (please complete the following information):

  • Environment location: Google Cloud, Vertex AI Training Job
  • Driver:

image

Additional context

tottenjordan avatar Aug 23 '22 23:08 tottenjordan

Actually I'm confused about the Driver Version. You see the results from nvidia-smi above.

However when I env | grep CUDA i get the following

env | grep CUDA                                                                                                                                                                                                                                                         
NVIDIA_REQUIRE_CUDA=cuda>=9.0                                                                                                                                                                                                                                                                   
CUDA_CACHE_DISABLE=1                                                                                                                                                                                                                                                                            
_CUDA_COMPAT_STATUS=CUDA Driver OK                                                                                                                                                                                                                                                              
CUDA_CUDA_LIBRARY=/usr/local/cuda/lib64/stubs                                                                                                                                                                                                                                                   
CUDA_VERSION=11.7.1.014                                                                                                                                                                                                                                                                         
CUDA_PATH=/usr/local/cuda                                                                                                                                                                                                                                                                       
CUDA_DRIVER_VERSION=515.48.08                                                                                                                                                                                                                                                                   
_CUDA_COMPAT_PATH=/usr/local/cuda/compat                                                                                                                                                                                                                                                        
CUDA_HOME=/usr/local/cuda                                                                                                                                                                                                                                                                       

Where driver version seems to be 515.48.08?

tottenjordan avatar Aug 24 '22 15:08 tottenjordan

I'm not sure why this makes a difference, but here is what i've done to get around this error and start model training:

They use command and args differently.

This results in the error.

worker_pool_specs =  [
    {
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": ACCELERATOR_TYPE,
            "accelerator_count": ACCELERATOR_NUM,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": IMAGE_URI,
            "command": ["python", "-m", "train_task"],
            "args": [
                f'--per_gpu_batch_size={PER_GPU_BATCH_SIZE}',
                f'--model_name={MODEL_NAME}',
                f'--train_dir={TRAIN_DATA}',
                f'--valid_dir={VALID_DATA}',
                f'--schema={SCHEMA_PATH}',
                f'--workflow_dir={WORKFLOW_DIR}',
                f'--max_iter={MAX_ITERATIONS}',
                f'--num_epochs={NUM_EPOCHS}',
                f'--gpus={gpus}',
            ],
        },
    }
]

This does not produce the error, and successfully begins training:

worker_pool_specs =  [
    {
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": ACCELERATOR_TYPE,
            "accelerator_count": ACCELERATOR_NUM,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": IMAGE_URI,
            'command': ['sh','-euc',f'''
                    python -m train_task --per_gpu_batch_size={PER_GPU_BATCH_SIZE} \
                    --model_name={MODEL_NAME} --train_dir={TRAIN_DATA} \
                    --valid_dir={VALID_DATA} \
                    --schema={SCHEMA_PATH} \
                    --workflow_dir={WORKFLOW_DIR} \
                    --max_iter={MAX_ITERATIONS} --num_epochs={NUM_EPOCHS} --gpus={gpus}
                    '''
            ]
        }
    }
]

For reference, here is the complete image:


FROM nvcr.io/nvidia/merlin/merlin-tensorflow:22.07

WORKDIR /src

RUN pip install -U pip
RUN pip install google-cloud-bigquery gcsfs cloudml-hypertune
RUN pip install google-cloud-aiplatform kfp
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg  add - && apt-get update -y && apt-get install google-cloud-sdk -y

COPY training/* ./

ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/compat/lib.real:/usr/local/hugectr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib

tottenjordan avatar Aug 24 '22 20:08 tottenjordan

Hello @tottenjordan what is the base driver version on the machine? Is that original picture of nvidia-smi output bare-metal or on the actual container?

jperez999 avatar Aug 25 '22 23:08 jperez999

Hey @jperez999 - This should be the container. I think becasue this is running with bash it shouldnt be a driver problem, right?

I think it has to do with either

  • how LD_LIBRARY_PATH is being set/not set
  • how the other CUDA artifacts are being found/loaded

tottenjordan avatar Sep 01 '22 20:09 tottenjordan

@tottenjordan

CUDA artifacts are loaded via the /opt/nvidia/nvidia_entrypoint.sh. I built your dockerfile, and it does not change your entrypoint. This means you should be loading the correct cuda version and all of its relevant artifacts.

And looking through the container's LD_LIBRARY_PATH you have some stuff doubled/tripled up and an extra path /usr/local/cuda/compat/lib.real but you are not missing anything.

Normal tensorflow container's LD_LIBRARY_PATH:

/usr/local/hugectr/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/opt/tritonserver/lib

Container built from your dockerfile's LD_LIBRARY_PATH:

/usr/local/hugectr/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/opt/tritonserver/lib:/usr/local/cuda/compat/lib.real:/usr/local/hugectr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib

Next step will be to attempt to repro your issue by producing a synthetic dataset with the same attributes as your dataset and running the same training task.

jperez999 avatar Sep 18 '22 00:09 jperez999

@tottenjordan is this still an issue? can we close the ticket?

rnyak avatar Oct 26 '22 15:10 rnyak

thanks @rnyak

tottenjordan avatar Oct 27 '22 17:10 tottenjordan