llama3-70B-Instruct-AWQ causing CUDA error: an illegal memory access was encountered

Open anindya-saha opened this issue 1 year ago • 0 comments

System Info

Hello Team,

I am using the following to load the AWQ quantized version on the LLama 3 model on a 4 x A100 GCP m/c. I cannot increase the --max-batch-prefill-tokens since I get the CUDA error: an illegal memory access was encountered. I also observe through nvidia-smi that it does not consume the entire GPU memory but still cause the illegal memory access error.

# Load LLama 3 casperhansen/llama-3-70b-instruct-awq

DOCKER_IMAGE=ghcr.io/huggingface/text-generation-inference:2.0.2
CONTAINER_NAME=eval_llama_3
HF_TOKEN=<my-token>
CUDA_VISIBLE_DEVICES=0,1,2,3
MODEL_ID="casperhansen/llama-3-70b-instruct-awq"
QUANTIZE=awq
VOLUME=~/.cache/huggingface/hub


docker run --rm \
    --name ${CONTAINER_NAME} \
    --shm-size 4g \
    --env HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
    --env CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
    -p 8080:80 \
    -v ${VOLUME}:/data \
    --gpus all \
    $DOCKER_IMAGE \
    --model-id ${MODEL_ID} \
    --num-shard 4 \
    --sharded true \
    --max-concurrent-requests 3 \
    --max-batch-prefill-tokens 24000 \
    --max-stop-sequences 20 \
    --trust-remote-code \
    --quantize ${QUANTIZE}

The GPUs are not even utilized half way though

Wed May  8 09:32:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              58W / 400W |  13321MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off | 00000000:00:05.0 Off |                    0 |
| N/A   36C    P0              75W / 400W |  13465MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          Off | 00000000:00:06.0 Off |                    0 |
| N/A   34C    P0              71W / 400W |  13465MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          Off | 00000000:00:07.0 Off |                    0 |
| N/A   35C    P0              70W / 400W |  13321MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     26648      C   /opt/conda/bin/python3.10                 13312MiB |
|    1   N/A  N/A     26649      C   /opt/conda/bin/python3.10                 13456MiB |
|    2   N/A  N/A     26650      C   /opt/conda/bin/python3.10                 13456MiB |
|    3   N/A  N/A     26652      C   /opt/conda/bin/python3.10                 13312MiB |
+---------------------------------------------------------------------------------------+

Information

[ ] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[ ] My own modifications

Reproduction

Steps are provided in the problem desription.

Expected behavior

The Model should load without exceptions.

May 08 '24 09:05 anindya-saha