Batch icon indicating copy to clipboard operation
Batch copied to clipboard

Azure Batch Ubuntu-HPC 22.04 pool with containers cannot detect CUDA GPUs

Open soniadasfaro opened this issue 8 months ago • 4 comments

Problem Description

We have an Azure Batch pool configured as follows:

OS: Linux NodeAgentSKUId: batch.node.ubuntu 22.04 ImageReference:

  {
    "publisher": "microsoft-dsvm",
    "offer":     "ubuntu-hpc",
    "sku":       "2204",
    "version":   "latest"
  }

ContainerConfiguration: Docker-compatible enabled.

Actual Results

This pool was created to replace our Ubuntu 20.04 nodes, which reach end of standard support on 31 May 2025 Ubuntu. After upgrading, any container task that tries to invoke CUDA code (e.g. via nvidia-smi or an FFmpeg/CUDA pipeline) logs.

Additional Logs

libevent use phreads:0
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/app/VModeApp/vmode/stitched.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    creation_time   : 2024-01-29T10:48:33.000000Z
    encoder         : Lavf60.3.100
  Duration: 00:00:10.44, start: 0.000000, bitrate: 139581 kb/s
  Stream #0:0[0x1](und): Video: h264 (Constrained Baseline) (avc1 / 0x31637661), yuvj420p(pc, bt709/unknown/unknown, progressive), 7680x3840, 137474 kb/s, 29.97 fps, 29.97 tbr, 90k tbn (default)
      Metadata:
        creation_time   : 2024-01-29T10:48:33.000000Z
        handler_name    : VideoHandler
        vendor_id       : [0][0][0][0]
      Side data:
        stereo3d: 2D, view: packed, primary eye: none
        spherical: equirectangular 
[swscaler @ 0x4d24c00] deprecated pixel format used, make sure you did set range correctly
/usr/local/vcpkg/buildtrees/popsift/src/v0.9-f30485bff3.clean/src/popsift/common/device_prop.cu:23
    Cannot get the current CUDA deviceno CUDA-capable device is detected
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit

Additonal Comments

Despite running on N-series VMs and using the verified DSVM HPC image, the containerized workload cannot see the GPU.

We need guidance on why the Ubuntu 22.04 DSVM HPC image isn’t exposing CUDA devices within containers and how to resolve it.

Thanks in advance !

soniadasfaro avatar Apr 23 '25 20:04 soniadasfaro

Can you please confirm which VM family (and specific VM size) you are attempting to use with this image?

alfpark avatar Apr 24 '25 17:04 alfpark

Can you please confirm which VM family (and specific VM size) you are attempting to use with this image?

@alfpark Sure the VM Size I am trying to use is STANDARD_NC8as_T4_V3.

soniadasfaro avatar Apr 25 '25 06:04 soniadasfaro

@soniadasfaro Before the upgrade to the dsvm ubuntu 22.04 what image were you using? The 20.04 dsvm image?

staer avatar Apr 28 '25 16:04 staer

A few things that need to be verified:

  1. Do you see the GPU on the host VM? You can login to the machine and run nvidia-smi or run it as a task with elevated privilege. You should see something like:
Mon Apr 28 17:39:06 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000001:00:00.0 Off |                  Off |
| N/A   33C    P8             11W /   70W |       1MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
  1. If not, what is shown for the nvidia driver during boot? You can type dmesg | grep nvidia. Please paste the output. You should see something like:
[    6.007814] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[    6.018518] nvidia 0001:00:00.0: enabling device (0000 -> 0002)
[    6.312465] nvidia 0001:00:00.0: can't derive routing for PCI INT A
[    6.312468] nvidia 0001:00:00.0: PCI INT A: no GSI
[    6.516227] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  560.35.03  Release Build  (dvs-builder@U16-I1-N07-12-3)  Fri Aug 16 21:22:33 UTC 2024
[    6.711838] [drm] [nvidia-drm] [GPU ID 0x00010000] Loading driver
[    6.711841] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0001:00:00.0 on minor 1
  1. Can you run a standard container task to validate? You can run this container nvcr.io/nvidia/k8s/cuda-sample:nbody with this command line nbody -gpu -benchmark. The stdout.txt file should end with:
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Turing" with compute capability 7.5

> Compute 7.5 CUDA device: [Tesla T4]
40960 bodies, total time for 10 iterations: 92.453 ms
= 181.468 billion interactions per second
= 3629.354 single-precision GFLOP/s at 20 flops per interaction

alfpark avatar Apr 28 '25 17:04 alfpark

Closing due to no response. Please see #174 for more info.

alfpark avatar May 27 '25 21:05 alfpark