bottlerocket Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance

Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance

Open Discipe opened this issue 9 months ago • 8 comments

We are running Bottlercocket on an AWS EKS g4dn instance. Because we are sharing a single GPU instance across multiple pods, we are specifying CPU limits only for our pods. Example:

resources:
  limits:
    memory: 1000Mi
  requests:
    cpu: 100m
    memory: 1000Mi

It worked fine with Bottlerocket 1.17 and stopped working on 1.19 (we didn't test it on 1.18).

Image I'm using: We have a minimal reproduction example that works on 1.17 and breaks on 1.19. Python executable that is used in Dockerfile below:

import torch
import sys
import logging

log_format = f"[%(asctime)s][%(levelname)s] %(message)s"
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format=log_format)
logger = logging.getLogger()

class Model:
    def __init__(self):
        self.log = logger
        self.log.info('Init')
        self.log.info('Python version: {}'.format(sys.version))
        self.log.info('Pytorch version: {}'.format(torch.__version__))
        self.log.info('Cuda available: {}'.format(torch.cuda.is_available()))
        t1 = torch.rand(2, 3).to(torch.device('cuda'))
        t2 = torch.rand(3, 2).to(torch.device('cuda'))
        res = torch.matmul(t1, t2)
        self.log.info('Matmul successed (device {})'.format(res.device))


if __name__ == '__main__':
    Model()
    logger.info('Done.')

Dockerfile (yes, this is as short as you can get with all these CUDA stuff):

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

RUN DEBIAN_FRONTEND=noninteractive apt-get update --fix-missing && apt-get upgrade -y

RUN DEBIAN_FRONTEND=noninteractive apt-get install --fix-missing -y \
    software-properties-common \
    wget \
    curl

RUN DEBIAN_FRONTEND=noninteractive apt-get install -yy \
    python3 \
    python3-dev

RUN wget https://bootstrap.pypa.io/pip/get-pip.py && \
    python3 ./get-pip.py && rm ./get-pip.py
RUN pip3 install --upgrade pip

# CUDA 11.7, torch 1.13.1
RUN pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu117

COPY test_main.py .

ENTRYPOINT ["python3", "test_main.py"]

What is expected to happen:

[2024-05-01 23:04:25,810][INFO] Init
[2024-05-01 23:04:25,810][INFO] Python version: 3.8.10 (default, Nov 22 2023, 10:22:35) 
[GCC 9.4.0]
[2024-05-01 23:04:25,810][INFO] Pytorch version: 1.13.1+cu117
[2024-05-01 23:04:25,954][INFO] Cuda available: True
[2024-05-01 23:04:28,416][INFO] Matmul successed (device cuda:0)
[2024-05-01 23:04:28,416][INFO] Done.

What actually happened:

{"levelname": "INFO", "time": "2024-05-02T19:12:41.737742Z", "message": "Init"}
{"levelname": "INFO", "time": "2024-05-02T19:12:41.737893Z", "message": "Python version: 3.8.10 (default, Nov 22 2023, 10:22:35) \n[GCC 9.4.0]"}
{"levelname": "INFO", "time": "2024-05-02T19:12:41.737960Z", "message": "Pytorch version: 1.13.1+cu117"}
{"levelname": "INFO", "time": "2024-05-02T19:12:41.738159Z", "message": "Cuda available: False"}

<... skip long stack trace ...>

  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

How to reproduce the problem: Run container on Bottlerocket 1.19

The issue looks similar to #3916

May 02 '24 19:05 Discipe

bottlerocket bottlerocket copied to clipboard

Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance

bottlerocket
bottlerocket copied to clipboard