vscode-remote-release icon indicating copy to clipboard operation
vscode-remote-release copied to clipboard

hostRequirements: gpu: optional is broken on Windows 11 and 10

Open sarphiv opened this issue 1 year ago • 11 comments

  • VSCode Version: >=1.84.2
  • Local OS Version: Multiple OS
  • Remote OS Version: ?
  • Remote Extension/Connection Type: Containers and WSL
  • Logs: N/A

Does this issue occur when you try this locally?: Yes Does this issue occur when you try this locally and all extensions are disabled?: Yes

This issue is a continuation of #9220, which appears to have regressed recently. Read the previous issue for more context.

Steps to Reproduce:

  1. Setup Docker to support CUDA containers according to NVIDIA's official instructions
  2. Create devcontainer.json with "hostRequirements": { "gpu": "optional" }
  3. Open a devcontainer that is supposed to support CUDA with the above config
  4. Check for CUDA support in PyTorch, or by running nvidia-smi

On Linux Fedora 38 the above works - the container has access to the GPU. On Windows 11 + WSL2 the above does not work. Troubleshooting steps have been described in #9220.

Adding "runArgs": [ "--gpus", "all" ] to devcontainer.json makes Windows 11 + WSL2 work. However, using the runArgs trick breaks the devcontainer for machines without GPUs (confirmed on Windows 11, macOS, and Linux Fedora).

As a temporary workaround, we are therefore currently maintaining two files: .devcontainer/gpu/devcontainer.json and .devcontainer/cpu/devcontainer.json.

sarphiv avatar Jan 11 '24 21:01 sarphiv

What do you get for running docker info -f '{{.Runtimes.nvidia}}' on the command line?

chrmarti avatar Jan 25 '24 09:01 chrmarti

What do you get for running docker info -f '{{.Runtimes.nvidia}}' on the command line? @chrmarti

The team member who experienced the issues on Windows 11 + WSL2 is currently on leave.

However, I found a Windows 10 machine with a GPU that has never had anything Docker nor NVIDIA container related installed on it. I installed Docker Desktop with WSL2 support, and oddly enough GPU passthrough appears to be supported by default, so I did nothing further.

Anyways, I ran your command and it gave:

> docker info -f '{{.Runtimes.nvidia}}'
'<no value>'

I guess your suspicion from the previous issue was correct.

To ensure that this machine was also affected by the bug I created a folder with the following contents. Note that I just took some existing files and started deleting things, so there's probably some unrelated lines in the following:

.devcontainer/devcontainer.json

{
    "name": "Dockerfile devcontainer gpu",
    "build": {
        "context": "..",
        "dockerfile": "Dockerfile"
    },
    "workspaceFolder": "/workspace",
    "workspaceMount": "source=.,target=/workspace,type=bind",
    "hostRequirements": {
        "gpu": "optional"
    },
    "runArgs": [
        "--shm-size=4gb",
        "--gpus=all"
    ]
}

.devcontainer/Dockerfile

# Setup environment basics
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime


# Install packages
RUN apt update -y \
    && apt install -y sudo \
    && apt clean


# Set up user
ARG USERNAME=user
ARG USER_UID=1000
ARG USER_GID=$USER_UID

RUN groupadd --gid $USER_GID $USERNAME \
    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME

USER $USERNAME


# Set up working directory
WORKDIR /workspace

# Set up environment variables
ENV PYTHONUNBUFFERED=True

Then I rebuilt and reopened the folder in a devcontainer via VSCode, and ran the following command to confirm I had access to a GPU (I also separately ensured PyTorch had access to CUDA acceleration):

> nvidia-smi

Everything worked perfectly. Afterwards, I commented out the "runArgs" key from the devcontainer.json file and repeated the above. This time nvidia-smi did not work and PyTorch had no CUDA acceleration.

sarphiv avatar Jan 25 '24 14:01 sarphiv

Great, what do you get for docker info -f '{{json .}}' on that machine? Thanks.

chrmarti avatar Jan 25 '24 15:01 chrmarti

I'm assuming you meant docker info -f json, because the other command fails. Here's the output.json. I sadly don't see any GPU nor NVIDIA references.

I also checked docker info -f '{{.Runtimes.nvidia}}' on Linux Fedora. It has an output which contains the string "nvidia-container-runtime", so I guess that's why it works on Linux. I then checked docker info -f json on Linux too, and it does contain the runtime nvidia, so I guess Window's being weird.

sarphiv avatar Jan 25 '24 15:01 sarphiv

We could add a machine-scoped setting to tell us if a GPU is present, absent or (the default like today) should be detected. That will give users a good out-of-the-box experience where the detection works, others can use the setting and we can gradually (where possible) improve the detection.

chrmarti avatar Jan 30 '24 15:01 chrmarti

I am running into the same issue on my Windows machine. nvidia-smi -L correctly returns the GPU info. docker info doesn't return anything related to the GPU.

Shall we use nvidia-smi to detect NVidia GPU instead?

sidecus avatar Mar 05 '24 13:03 sidecus

I am running into the same issue on my Windows machine. nvidia-smi -L correctly returns the GPU info. docker info doesn't return anything related to the GPU.

Shall we use nvidia-smi to detect NVidia GPU instead?

If we only used nvidia-smi then maybe this would fail on Linux, where you may have NVIDIA drivers (nvidia-smi works) but not the NVIDIA Container Runtime (no GPU inside containers).

sarphiv avatar Mar 05 '24 14:03 sarphiv

@chrmarti I am using an Ubuntu 22.04 machine with an NVIDIA GPU (non-WSL), but the hostRequirements: gpu: optional is not working. The output of docker info -f '{{.Runtimes.nvidia}}' is <no value>, indicating that I am experiencing the same issue as in this case. The output of docker info is as follows:

docker-info.json

sangotaro avatar May 22 '24 03:05 sangotaro

Stumbled upon this in the last days again, after having a solution in #9220 in January.

Working on a Windows Workstation now and cannot get a Dev Container running via WSL with GPU support.

What is about the intermediary solution to have a machine specific configuration, which marti mentioned above?

pascal456 avatar Jul 25 '24 22:07 pascal456

I agree, that whether or not an SSH server machine can use its GPU in a Docker container should be a setting on the SSH server machine. It doesn't belong to the local machine.

One difficulty with the machine setting is that when connecting through an SSH server (or Tunnel), we can't access its machine settings through VS Code's API because that only knowns the local and the dev container (calling these "machine settings") settings. We can check for and read the machine settings.json in the extension though. /cc @sandy081

chrmarti avatar Jul 26 '24 06:07 chrmarti

Here is my hacky fix for docker compose in the meantime :) https://github.com/microsoft/vscode-remote-release/issues/10124#issuecomment-2304669818

Dev Containers 0.386.0-pre-release adds a user setting to override the automatic detection of a GPU: Image

chrmarti avatar Sep 11 '24 05:09 chrmarti

@chrmarti DevContainers v0.386.0 (pre-release)

Hello,

It seems that this feature is still broken (v0.386.0). If I create a remote machine (GCP) with GPU and fully installed nvidia-stack, I can build and run the devcontainer using

"hostRequirements": {
    "gpu": "optional"
},

But if I remove the GPU from my remote machine I can't start the docker container anymore as it claims having detected a GPU despite the fact that no GPU is attached:

Output of devcontainer console is: [21551 ms] Start: Run: docker info -f {{.Runtimes.nvidia}} [21755 ms] GPU support found, add GPU flags to docker call. ...

If I run the command you have used in your ts-scripts on the machine (no GPU anymore) I get: {nvidia-container-runtime [] }

I think you are just checking whether the nvidia-container-runtime is available but not whether an actual gpu is attached. const runtimeFound = result.stdout.includes('nvidia-container-runtime');

So,

export async function extraRunArgs(common: ResolverParameters, params: DockerResolverParameters, config: DevContainerFromDockerfileConfig | DevContainerFromImageConfig) { const extraArguments: string[] = []; if (config.hostRequirements?.gpu) { if (await checkDockerSupportForGPU(params)) { common.output.write(GPU support found, add GPU flags to docker call.); extraArguments.push('--gpus', 'all'); } else { if (config.hostRequirements?.gpu !== 'optional') { common.output.write('No GPU support found yet a GPU was required - consider marking it as "optional"', LogLevel.Warning); } } } return extraArguments; }

Will add --gpus 'all' if the runtime is available even if no gpu is attached. Unfortunately the container won't start if --gpus all is given but no GPU is attached to the computer. Am I missing something here?

maro-otto avatar Sep 23 '24 10:09 maro-otto

@maro-otto Good catch, I'll open a new issue for this. Thanks.

chrmarti avatar Sep 24 '24 07:09 chrmarti

Hello! Are users able to verify that this works (minus the new bug caught by @maro-otto)?

eleanorjboyd avatar Sep 26 '24 17:09 eleanorjboyd

Also @chrmarti if no user is able to could you clarify the steps? The original filed issue is comprehensive but was wondering if this can be tested without CUDA containers according to NVIDIA's standards (since it seems like the setting would apply in other dev container scenarios). Thanks!

eleanorjboyd avatar Sep 26 '24 17:09 eleanorjboyd

Without a GPU, I suggest to set GPU Availability to all and verify that a new dev container with "hostRequirements": { "gpu": "optional" } tries to enable GPU for the container and fails.

With a GPU, you could set GPU Availability to none and verify that such a dev container indeed does not get the GPU (cross check that it gets the GPU with all).

chrmarti avatar Sep 27 '24 06:09 chrmarti

My laptop has a GPU. When GPU Availability is set to none, the dev container with optional gpu host requirements still gets a GPU: [2024-09-27T17:32:07.572Z] GPU support found, add GPU flags to docker call.

Host: Windows Remote: Node.js & JavaScript container

rzhao271 avatar Sep 27 '24 17:09 rzhao271

@rzhao271 Could you rebuild the container and append the log from that? (F1 > Dev Containers: Show Container Log)

chrmarti avatar Sep 30 '24 10:09 chrmarti

Closing this issue. GPU Availability had to be set to none within the WSL settings, not the User settings.

rzhao271 avatar Oct 02 '24 15:10 rzhao271