light-the-torch icon indicating copy to clipboard operation
light-the-torch copied to clipboard

Is it possible to set `--pytorch-computation-backend` using an environment variable?

Open amitkparekh opened this issue 3 years ago • 8 comments

Hi! I'm using this package within a Dockerfile to automatically install the correct CUDA version for the system but I'm having issues getting the build process to recognise the GPUs when building.

Is it possible to set an environment variable for the pytorch-computation-backend so that the option is not provided BUT there is an environment variable called PYTORCH_COMPUTATION_BACKEND, then it will use that?

amitkparekh avatar Oct 19 '22 08:10 amitkparekh

Hey @amitkparekh :wave:

I'm using this package within a Dockerfile to automatically install the correct CUDA version for the system but I'm having issues getting the build process to recognise the GPUs when building.

Do you mean you install the correct PyTorch binary for the system? Because light-the-torch does not install CUDA. Could you elaborate what you are trying to build?

pmeier avatar Oct 19 '22 08:10 pmeier

Sorry, that was poor wording on my part. Let me clarify properly!

My currently understanding of light-the-torch is that by providing the --pytorch-computation-backend option to the command, it will install that version of torch. For example, if I ran ltt install --pytorch-computation-backend=cu113 torch, it would get the current version of torch I have installed and then install the CUDA 11.3 version of it. For example, I would go from torch==1.11.0 to torch==1.11.0+cu113.

For my current setup, I am using Poe the Poet to add a postinstall hook to Poetry so that it will automatically install the torch version that best matches the computation backend. However, when running ltt during the build process of a Dockerfile, it is not able to detect the current CUDA version from the host machine.

Our production environment needs torch with CUDA 11.3, but I do not want to add the --pytorch-computation-backend=cu113 option within the postinstall hook directly, as not all developers have machines which support that CUDA version.

My current workaround within the Dockerfile is to do the following (after installing the project dependencies):

RUN TORCH_VERSION="$(pip show torch | grep Version | cut -d ':' -f2 | xargs)${TORCH_VERSION_SUFFIX}" \
	&& TORCHVISION_VERSION="$(pip show torchvision | grep Version | cut -d ':' -f2 | xargs)${TORCH_VERSION_SUFFIX}" \
	&& pip install --no-cache-dir torch=="${TORCH_VERSION}" torchvision=="${TORCHVISION_VERSION}" -f https://download.pytorch.org/whl/torch_stable.html

To get around this issue and simplify the developer experience, I'd like to specify the pytorch-computation-backend option using an environment variable, such that if the --pytorch-computation-backend option does not exist but an environment variable called PYTORCH_COMPUTATION_BACKEND does exist, it will parse the option from the environment variable.

I hope that makes more sense?

amitkparekh avatar Oct 19 '22 09:10 amitkparekh

For example, if I ran ltt install --pytorch-computation-backend=cu113 torch, it would get the current version of torch I have installed and then install the CUDA 11.3 version of it. For example, I would go from torch==1.11.0 to torch==1.11.0+cu113.

That is unfortunately not the case. The PyTorch wheels contain all libraries that you need at runtime for CUDA. The only thing that you need on your machine is the nvidia driver. Thus, there is no need to install CUDA, unless you need nvcc and the other libraries to build something else afterwards. Thus, ltt does not install CUDA because there is usually no need. On the flip side, if you actually need the CUDA compiler and libraries, you have pretty good control of what you want and thus installing the right image with pip is trivial with the options provided by PyTorch.

So, what is it? Do you just want PyTorch installed or do you actually need CUDA?

it is not able to detect the current CUDA version from the host machine.

That is somewhat surprising. Internally we use

https://github.com/pmeier/light-the-torch/blob/eda21f3d1398e0551546f6fde0e79f309de0951d/light_the_torch/_cb.py#L135-L140

to detect the driver version. Could you post the output of that command inside your image?

To get around this issue and simplify the developer experience, I'd like to specify the pytorch-computation-backend option using an environment variable, such that if the --pytorch-computation-backend option does not exist but an environment variable called PYTORCH_COMPUTATION_BACKEND does exist, it will parse the option from the environment variable.

I'm not against this, but let's see if you actually need it. Or better, whether or not ltt can do what you want it to do.

pmeier avatar Oct 19 '22 09:10 pmeier

That is unfortunately not the case. The PyTorch wheels contain all libraries that you need at runtime for CUDA. The only thing that you need on your machine is the nvidia driver. Thus, there is no need to install CUDA, unless you need nvcc and the other libraries to build something else afterwards. Thus, ltt does not install CUDA because there is usually no need. On the flip side, if you actually need the CUDA compiler and libraries, you have pretty good control of what you want and thus installing the right image with pip is trivial with the options provided by PyTorch.

So, what is it? Do you just want PyTorch installed or do you actually need CUDA?

I am specifically referring to the torch version that is best compatible with a given CUDA driver version/computation backend. I don't need ltt to install CUDA, but just install the correct torch library for the computation backend I have installed on my machine.

I'm sorry for any confusion here! I think that I am using the wrong terminology to refer to the specific torch wheel that is compiled for a specific CUDA driver version/computation backend. e.g., I want to install the torch==1.11.0+cu113 wheel.

Could you post the output of that command inside your image?

Given our current setup to keep Docker images small and not include anything unnecessary, nvidia-smi is not available when the image is being built. However, it is available when I run the image. Therefore, if I specify that I want to install torch==1.11.0+cu113, it will work when I run the image and not raise any errors during the build process since I am not compiling anything.

The output for the command is

#5 [2/4] RUN nvidia-smi
#5 0.210 /bin/bash: line 1: nvidia-smi: command not found
#5 ERROR: executor failed running [/bin/bash -o pipefail -c nvidia-smi]: exit code: 127

I'm not against this, but let's see if you actually need it. Or better, whether or not ltt can do what you want it to do.

I appreciate this a lot. I've just been playing with ltt and I'd ideally like to do simplify this command:

python -m light_the_torch install --upgrade --pytorch-computation-backend=$PYTORCH_COMPUTATION_BACKEND torch=="$(pip show torch | grep Version | cut -d ':' -f2 | xargs)"

which then does the following simply downloads and installs the torch wheel which matches the provided computation backend without updating torch.

$ python -m light_the_torch install --upgrade --pytorch-computation-backend=cu113 torch=="$(pip show torch | grep Version | cut -d ':' -f2 | xargs)"
Requirement already satisfied: torch==1.11.0 in ./.venv/lib/python3.9/site-packages (1.11.0)
Collecting torch==1.11.0
  Downloading https://download.pytorch.org/whl/cu113/torch-1.11.0%2Bcu113-cp39-cp39-linux_x86_64.whl (1637.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 GB 846.0 kB/s eta 0:00:00
Requirement already satisfied: typing-extensions in ./.venv/lib/python3.9/site-packages (from torch==1.11.0) (4.3.0)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.11.0
    Uninstalling torch-1.11.0:
      Successfully uninstalled torch-1.11.0
Successfully installed torch-1.11.0+cu113

The simplest solution for ltt without changing how it works is to be able to provide the pytorch computation backend as an environment variable, which will 100% solve the issues I am facing.

amitkparekh avatar Oct 19 '22 09:10 amitkparekh

I am specifically referring to the torch version that is best compatible with a given CUDA driver version/computation backend. I don't need ltt to install CUDA, but just install the correct torch library for the computation backend I have installed on my machine.

Ok, in that case ltt can help. Just a note of advice: there is no such thing as best compatible. Either something is supported or not. ltt will install the latest compatible by default.

nvidia-smi is not available when the image is being built.

Ok, that explains why auto-detection is not working. Just to make sure we are not hitting any pitfalls later on: the nvidia driver is installed and working correct, right? I always thought you can use nvidia-smi to check that. Or are you installing PyTorch before the driver is installed?

I've just been playing with ltt and I'd ideally like to do simplify this command:

IIUC, the only thing that would change with the env var you are proposing is that you would no longer have to specify --pytorch-computation-backend=$PYTORCH_COMPUTATION_BACKEND in the command, right? You would still have specify export PYTORCH_COMPUTATION_BACKEND=cu113 somewhere. Do you feel like that is really that much of an improvement?


Again, I don't want come across as defensive here. I just want to understand the use case correctly so I don't add a feature that in the end does not really help you or others.

pmeier avatar Oct 19 '22 19:10 pmeier

python -m light_the_torch install --upgrade --pytorch-computation-backend=$PYTORCH_COMPUTATION_BACKEND torch=="$(pip show torch | grep Version | cut -d ':' -f2 | xargs)"

Another thing I only now realized: why do you have to re-install torch? Wouldn't it be better to install the correct version right away? Or does the image you start from come with some version of torch pre-installed?

pmeier avatar Oct 20 '22 07:10 pmeier

Ok, in that case ltt can help. Just a note of advice: there is no such thing as best compatible. Either something is supported or not. ltt will install the latest compatible by default.

Okay, that makes more sense! Thanks for correcting me on that one!

Ok, that explains why auto-detection is not working. Just to make sure we are not hitting any pitfalls later on: the nvidia driver is installed and working correct, right? I always thought you can use nvidia-smi to check that. Or are you installing PyTorch before the driver is installed?

The nvidia driver is installed and available on the host before the image is being built, but it is not accessible during the building of the images. When running the image within a container, I use the --gpus flag (or similar) to ensure the container can access the nvidia GPUs, and I can run nvidia-smi successfully.

Another thing I only now realized: why do you have to re-install torch? Wouldn't it be better to install the correct version right away? Or does the image you start from come with some version of torch pre-installed?

A large reason for requesting this feature is specifically because of Poetry. This thread explains some of the main annoyances of people who use Poetry and want to easily install the torch wheel that is compatible with their CUDA driver version.

If I was the only developer, or I was only ever using a single machine, I would opt for referring to the specific wheel I need within my pyproject.toml, but I'm trying to figure out a solution that I can use going forwards.

Poetry is currently used to manage dependencies, however it will not easily allow us to install the correct torch wheel for the given system. One offered solution pointed me in the direction of your project but running python3 -m light_the_torch install --upgrade torch torchvision would install the correct wheel for the system and upgrade the torch version.

However, I do not want to upgrade the torch version because we have previously trained models on a specific version and we do not want to change any dependencies which alter the training loop — this is just to ensure all experiments are comparable. We just need to install the torch wheel that is best compatible for the system. In this case, we can just run python3 -m light_the_torch install torch torchvision?

Using poetry command hooks from Poe the Poet, we are using a post-install hook which runs python -m light_the_torch install --upgrade torch=="$(pip show torch | grep Version | cut -d ':' -f2 | xargs)" after installing all the dependencies. This means that, regardless of the torch version being used, this one hook will always work and ensure that the correct torch wheel is installed on the given system.

IIUC, the only thing that would change with the env var you are proposing is that you would no longer have to specify --pytorch-computation-backend=$PYTORCH_COMPUTATION_BACKEND in the command, right? You would still have specify export PYTORCH_COMPUTATION_BACKEND=cu113 somewhere. Do you feel like that is really that much of an improvement?

For me personally, yes. From a DevOps side, it makes things simpler for developers ensuring they have the correct torch wheel and means that a single Dockerfile can be used for multiple projects without needing to make minor changes per file.

The post-install command will not work properly during building of Docker images because it cannot detect the CUDA driver version of the host system. Additionally, we might want to specify the torch wheel variant installed instead of the "correct one" due to possible compatibility issues. Providing the environment variable PYTORCH_COMPUTATION_BACKEND can be done at the top of the Dockerfile or through a build-arg, and it means that the same post-install hook will work without needing any change.

Basically, moving this aspect of dependency management into the "it just works" territory.

Again, I don't want come across as defensive here. I just want to understand the use case correctly so I don't add a feature that in the end does not really help you or others.

I completely understand and respect that. I realise that this is probably a bit of a pain to get your head around and I really appreciate you hearing me out!

amitkparekh avatar Oct 20 '22 13:10 amitkparekh

It seems poetry is a pain rather than a help here. I recall that there was a problem with it before in #32.

I don't really understand why you think that

ENV LTT_PYTORCH_COMPUTATION_BACKEND=cu113

RUN python -m light_the_torch install \
     --upgrade \
     --pytorch-computation-backend=$LTT_PYTORCH_COMPUTATION_BACKEND \
     torch=="$(pip show torch | grep Version | cut -d ':' -f2 | xargs)"

is that much of a problem compared to

ENV LTT_PYTORCH_COMPUTATION_BACKEND=cu113

RUN python -m light_the_torch install \
     --upgrade \
     torch=="$(pip show torch | grep Version | cut -d ':' -f2 | xargs)"

but at this point I think it comes down to personal preference.

Since I currently don't see a downside of having the env var, I will add to help your use case. A couple of points though:

  1. I'm going to name it LTT_PYTORCH_COMPUTATION_BACKEND to avoid any future clashes with PyTorch env vars.
  2. Priority will be CLI arg > env var > auto detection.

I'm pretty swamped at the moment. I try to get this change in soon, but I can't make any promises here.

pmeier avatar Oct 20 '22 20:10 pmeier

@amitkparekh I just cut 0.5.0rc0. Could you pip install --pre light-the-torch and see if the LTT_PYTORCH_COMPUTATION_BACKEND variable does what you wanted it to do?

If that is confirmed and #103 is merged in some way or form, I will cut 0.5.0.

pmeier avatar Oct 23 '22 21:10 pmeier

@amitkparekh I just cut 0.5.0rc0. Could you pip install --pre light-the-torch and see if the LTT_PYTORCH_COMPUTATION_BACKEND variable does what you wanted it to do?

If that is confirmed and #103 is merged in some way or form, I will cut 0.5.0.

This works exactly how I need it to, thank you so much!

amitkparekh avatar Oct 25 '22 10:10 amitkparekh

Let me release 0.5.0 then.

pmeier avatar Oct 25 '22 11:10 pmeier

Release is live on PyPI.

pmeier avatar Oct 25 '22 12:10 pmeier