pip icon indicating copy to clipboard operation
pip copied to clipboard

pip should not cache large files in /tmp or TMPDIR

Open lee-b opened this issue 1 year ago • 14 comments

Description

Pip violates system specifications, and therefore essentially only works by accident right now, when it "gets lucky" with packages being small enough for /tmp.

This becomes a serious problem when working with larger pip installations, such as for vllm, pytorch with cuda acceleration, etc.

#5816 was closed with the advice to ensure that /tmp is large enough. #4462 was on the same topic and also closed without action. Pip also blames the user/system, with this output:

OSError: [Errno 28] No space left on device
...
note: This error originates from a subprocess, and is likely not a problem with pip.

However, these analyses are NOT correct.

If one reads the Linux file-hierarchy (7) man page specification (i.e., runs man 7 file-hierarchy) on Linux (Debian 12, at least), it states:

       /tmp/
           The place for small temporary files. This directory is usually mounted as a "tmpfs" instance, and should
           hence not be used for larger files. (Use /var/tmp/ for larger files.) 

This document also refers readers to:

https://systemd.io/TEMPORARY_DIRECTORIES/

Which similarly states:

/tmp/ and /var/tmp/ are two world-writable directories Linux systems provide for temporary files. The former is typically on tmpfs and thus backed by RAM/swap, and flushed out on each reboot. The latter is typically a proper, persistent file system, and thus backed by physical storage. This means:

    /tmp/ should be used for smaller, size-bounded files only; /var/tmp/ should be used for everything else.

Moreover, the data in question seems to be cached data, not normal temporary files, and should therefore go in /var/cache (probably only if daemon is writing it, I believe), or into the XDG cache directory (e.g., ~/.cache/pip/).

Expected behavior

Pip should follow all relevant specifications when creating files, rather than putting large files in the wrong place and overloading filesystems that are not intended for large files.

pip version

24.0

Python version

3.12.2

OS

Debian 12

How to Reproduce

bin/pip3 install vllm when /tmp has 1.7GB available.

Output

Collecting vllm
  Using cached vllm-0.5.2-cp38-abi3-manylinux1_x86_64.whl.metadata (1.8 kB)
Collecting aiohttp (from vllm)
  Downloading aiohttp-3.9.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting cmake>=3.21 (from vllm)
  Using cached cmake-3.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.1 kB)
Collecting fastapi (from vllm)
  Using cached fastapi-0.111.1-py3-none-any.whl.metadata (26 kB)
Collecting filelock>=3.10.4 (from vllm)
  Using cached filelock-3.15.4-py3-none-any.whl.metadata (2.9 kB)
Collecting lm-format-enforcer==0.10.3 (from vllm)
  Using cached lm_format_enforcer-0.10.3-py3-none-any.whl.metadata (16 kB)
Collecting ninja (from vllm)
  Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Collecting numpy<2.0.0 (from vllm)
  Using cached numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting nvidia-ml-py (from vllm)
  Using cached nvidia_ml_py-12.555.43-py3-none-any.whl.metadata (8.6 kB)
Collecting openai (from vllm)
  Using cached openai-1.36.1-py3-none-any.whl.metadata (22 kB)
Collecting outlines<0.1,>=0.0.43 (from vllm)
  Using cached outlines-0.0.46-py3-none-any.whl.metadata (15 kB)
Collecting pillow (from vllm)
  Using cached pillow-10.4.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Using cached prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl.metadata (13 kB)
Collecting prometheus-client>=0.18.0 (from vllm)
  Using cached prometheus_client-0.20.0-py3-none-any.whl.metadata (1.8 kB)
Collecting psutil (from vllm)
  Using cached psutil-6.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
Collecting py-cpuinfo (from vllm)
  Using cached py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting pydantic>=2.0 (from vllm)
  Using cached pydantic-2.8.2-py3-none-any.whl.metadata (125 kB)
Collecting pyzmq (from vllm)
  Downloading pyzmq-26.0.3-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Collecting ray>=2.9 (from vllm)
  Downloading ray-2.32.0-cp312-cp312-manylinux2014_x86_64.whl.metadata (13 kB)
Collecting requests (from vllm)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting sentencepiece (from vllm)
  Downloading sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting tokenizers>=0.19.1 (from vllm)
  Downloading tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting torch==2.3.1 (from vllm)
  Downloading torch-2.3.1-cp312-cp312-manylinux1_x86_64.whl.metadata (26 kB)
Collecting torchvision==0.18.1 (from vllm)
  Downloading torchvision-0.18.1-cp312-cp312-manylinux1_x86_64.whl.metadata (6.6 kB)
Collecting tqdm (from vllm)
  Using cached tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
Collecting transformers>=4.42.4 (from vllm)
  Using cached transformers-4.42.4-py3-none-any.whl.metadata (43 kB)
Collecting typing-extensions (from vllm)
  Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting uvicorn[standard] (from vllm)
  Using cached uvicorn-0.30.3-py3-none-any.whl.metadata (6.5 kB)
INFO: pip is looking at multiple versions of vllm to determine which version is compatible with other requirements. This could take a while.
Collecting vllm
  Downloading vllm-0.5.1.tar.gz (790 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 790.6/790.6 kB 7.0 MB/s eta 0:00:00
  Installing build dependencies ... error
  error: subprocess-exited-with-error

  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [64 lines of output]
      Collecting cmake>=3.21
        Using cached cmake-3.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.1 kB)
      Collecting ninja
        Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
      Collecting packaging
        Using cached packaging-24.1-py3-none-any.whl.metadata (3.2 kB)
      Collecting setuptools>=49.4.0
        Downloading setuptools-71.0.4-py3-none-any.whl.metadata (6.5 kB)
      Collecting torch==2.3.0
        Downloading torch-2.3.0-cp312-cp312-manylinux1_x86_64.whl.metadata (26 kB)
      Collecting wheel
        Using cached wheel-0.43.0-py3-none-any.whl.metadata (2.2 kB)
      Collecting filelock (from torch==2.3.0)
        Using cached filelock-3.15.4-py3-none-any.whl.metadata (2.9 kB)
      Collecting typing-extensions>=4.8.0 (from torch==2.3.0)
        Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
      Collecting sympy (from torch==2.3.0)
        Using cached sympy-1.13.1-py3-none-any.whl.metadata (12 kB)
      Collecting networkx (from torch==2.3.0)
        Using cached networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
      Collecting jinja2 (from torch==2.3.0)
        Using cached jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
      Collecting fsspec (from torch==2.3.0)
        Using cached fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
      Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.0)
        Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
      Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.0)
        Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
      Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.0)
        Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
      Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.3.0)
        Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
      Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.3.0)
        Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
      Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.3.0)
        Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
      Collecting nvidia-curand-cu12==10.3.2.106 (from torch==2.3.0)
        Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
      Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch==2.3.0)
        Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
      Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch==2.3.0)
        Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
      Collecting nvidia-nccl-cu12==2.20.5 (from torch==2.3.0)
        Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
      Collecting nvidia-nvtx-cu12==12.1.105 (from torch==2.3.0)
        Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)
      Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch==2.3.0)
        Using cached nvidia_nvjitlink_cu12-12.5.82-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
      Collecting MarkupSafe>=2.0 (from jinja2->torch==2.3.0)
        Using cached MarkupSafe-2.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
      Collecting mpmath<1.4,>=1.1.0 (from sympy->torch==2.3.0)
        Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
      Downloading torch-2.3.0-cp312-cp312-manylinux1_x86_64.whl (779.1 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 779.1/779.1 MB 5.3 MB/s eta 0:00:00
      Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
      Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
      Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
      Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
      Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
      ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device


      [notice] A new release of pip is available: 24.0 -> 24.1.2
      [notice] To update, run: /mnt/nvme1/home/lb/.local/.venvs/vllm/bin/python3 -m pip install --upgrade pip
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

Code of Conduct

lee-b avatar Jul 21 '24 10:07 lee-b

The data is temporary files (it's the unpacked source of vllm-0.5.1.tar.gz which appears to need a lot of space to build). Pip respects the standard $TEMP/$TMPDIR directories, which as far as I am aware are not defined as being limited to "small files only". That would suggest to me that $TEMP should be set to /var/tmp rather than to /tmp on systems that limit the space available to /tmp. But I'm not a Unix expert, so I don't know the nuances of how that environment variable is intended to be set if the filesystem standards define two "levels" of temporary filesystem space.

In fact, pip simply uses Python's standard temporary file management functions, which are also not limited to "small files only". So I would suggest that if you disagree with setting $TEMP to /var/tmp, you'd need to take it up with the Python project.

pfmoore avatar Jul 21 '24 11:07 pfmoore

which as far as I am aware are not defined as being limited to "small files only".

I mean, I literally gave you the definitions, "The place for small temporary files", "/tmp/ should be used for smaller, size-bounded files only; /var/tmp/ should be used for everything else."

lee-b avatar Jul 21 '24 14:07 lee-b

I mean, I literally gave you the definitions

For the filesystem locations, yes. I'm not disputing that. But why is $TEMP set to /tmp if that location is only for small files? I've never seen any documentation that states that programs should not store large files in the directory pointed at by $TEMP. If there is such documentation, then:

  1. Please provide a link.
  2. Please raise this with the CPython project, as they absolutely do not document that the tempfile module must only be used for small files (see here).

But realistically, I would say that you should simply set $TEMP to /var/tmp, and that would solve the issue for you.

pfmoore avatar Jul 21 '24 14:07 pfmoore

Just because I was curious about the spec, here is some reading:

  • https://en.wikipedia.org/wiki/TMPDIR
  • https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_03

notatallshaw avatar Jul 21 '24 15:07 notatallshaw

While systemd defines the two temp directories as you mentioned, I don’t see the distinction being made anywhere by systems that actually create the two directories for software to use. Even the reference they ultimate link to, https://refspecs.linuxfoundation.org/FHS_3.0/fhs-3.0.html, has nothing to distinguish vetween /tmp and /var/tmp, so at best the entire situation is [citation needed] and boarderline systemd is just making rules up.

uranusjr avatar Jul 22 '24 06:07 uranusjr

Yea, it sounds like the situation is the exact same as it was back when I'd posted https://github.com/pypa/pip/issues/5816#issuecomment-587302775.


I think one thing called out in OP that's worth figuring out how to improve is:

blames the user/system, with this output:

note: This error originates from a subprocess, and is likely not a problem with pip.

I think it's worth improving this line removing it when pip calls itself in a subprocess (the error will contain this message at the relevant place).


#5816 was closed with the advice to ensure that /tmp is large enough.

As the person who wrote the closing comment, this is not what I wrote or meant. I'd explicitly referenced $TMPDIR for a reason in that comment. :)

pradyunsg avatar Jul 22 '24 17:07 pradyunsg

Note the reasoning and context (of Solaris etc.) given here, as well as the advice to fixapplications that write large files to use /var/tmp:

https://fedoraproject.org/wiki/Features/tmp-on-tmpfs#Comments_and_Discussion

lee-b avatar Jul 23 '24 00:07 lee-b

I’ll repeat my comment from above - there is nothing actionable for pip here (except possibly an improved error message or documentation, as noted by @pradyunsg). We aren’t going to stop using the stdlib functionality, so the only realistic approaches here are a stdlib change (unlikely, IMO) or a user config change (the recommended approach)

pfmoore avatar Jul 23 '24 07:07 pfmoore

While systemd defines the two temp directories as you mentioned, I don’t see the distinction being made anywhere by systems that actually create the two directories for software to use. Even the reference they ultimate link to, https://refspecs.linuxfoundation.org/FHS_3.0/fhs-3.0.html, has nothing to distinguish vetween /tmp and /var/tmp, so at best the entire situation is [citation needed] and boarderline systemd is just making rules up.

If you reread the FHS citation, you will notice that there is in fact something to distinguish between /tmp and /var/tmp. That thing is, that /var/tmp is specified to survive system reboots (and therefore logically cannot be on a RAM-backed tmpfs as it will not survive a reboot).

If you assume a bunch of things:

  • /tmp is a temporary filesystem
  • /tmp/ is a SMALL temporary filesystem (reasonable: RAM is more scarce as a resource than HDD space)
  • /var/tmp resides on the same filesystem as /home

then /var/tmp may have much more available space than /tmp. It is unlikely to have less space.

That is probably the reason why systemd "requires" the rules it has made up -- because systemd likes to assume things based on what its authors would choose to do, and then rule out all other possible scenarios.

But, the systemd advice is a very very bad idea to follow if you have 64gb of RAM and your /var/tmp has a quota / dedicated partition and can "only" store 8gb on it -- you'll get 4x less space in /var/tmp than you'll get in /tmp, because /tmp usually allows half your available RAM...

eli-schwartz avatar Jul 29 '24 23:07 eli-schwartz

It would be very, very wrong for pip to declare that its work-in-progress compiled packages are de-facto "temporary files or directories that are preserved between system reboots", unless pip has changed its approach quite a bit...

eli-schwartz avatar Jul 29 '24 23:07 eli-schwartz

The actual, correct solution here is for someone to write a spec such as https://devmanual.gentoo.org/eclass-reference/check-reqs.eclass/index.html

It would allow python source code that is automatically built into a wheel, to declare in advance the amount of space it estimates it will probably require, so that tools such as pip can query the underlying filesystem for the directory that has been created by tempfile and check if there is, in fact, enough free space for the build to succeed.

pip could then error out before doing anything at all, with a clever error message such as:

ERROR: vllm asked for 3GB of temporary storage, but this is not available. Cannot build vllm. Please set the $TMPDIR environment variable to some location that has the required disk space, and try again.

eli-schwartz avatar Jul 29 '24 23:07 eli-schwartz

I'm having a look through the issue, as I've experienced it quite a bit too. It's not a pip issue, it's an issue with the user system being misconfigured on setup and it's highly specific to Linux.

The user's system was misconfigured with too small partitions (I've have endless problems in some companies with infra teams provisioning ridiculously small partitions on VMs). The system needs to be resized.

There may be a broader issue with some Linux distributions defaulting to smaller /tmp in memory, since recently, which would be an issue to report to distributions.

  • I think it's false to say that /tmp is in memory, I don't think it was in memory historically. Most computers in the 2010s had 1-4 GB of memory (single digit) and there wasn't enough memory to hold anything.
  • I don't recall /tmp being being limited to single digit size and being in memory until recently. It seems to me the issue was introduced recently.
  • tensorflow/torch/cuda require 10+GB to install and use, they don't play well with small partitions. Users simply won't be able to use them if they have ridiculously small disks.
  • /var/tmp is not a standard as far as I know. I've seen it in some places but have not in most, maybe it's specific to some distributions or some companies.

There was a similar issue with extremely small disks around 2015 when AWS became popular. VM images had 10GB disks to fit with the free tier, which created a lot of problems with running out of disks all the time and everything crashing. Eventually the provisioning tools became able to resize VM on creation and people learned to use it.

Users can run df -h to see partition size and resize anything partition that's single digit. Users can set TMPDIR to specify a different location to build packages.

@eli-schwartz what is the output of df -h for you? Is it a personal machine you setup yourself or a machine at work setup by someone else?

morotti avatar Aug 03 '24 13:08 morotti

@morotti I think you misunderstood my stake in this. :)

I saw a recent issue where someone was spreading FUD about systemd and claiming that pip isn't "compliant with systemd requirements". I like systemd quite well as it happens, I simply don't think that this one specific rationale in the systemd documentation holds water.

In fact, I don't use pip to install large packages. I generally regard the need to compile C/C++/Fortran software as a sign that pip is the wrong tool for the job and you should be using conda or a linux distro's prebuilt packages.

But I don't think that pip is doing the wrong thing here and I don't think that pip's developers should have to be lectured about how they're doing the wrong thing because systemd.

...

All that being said:

I think it's false to say that /tmp is in memory, I don't think it was in memory historically. Most computers in the 2010s had 1-4 GB of memory (single digit) and there wasn't enough memory to hold anything.

No one claimed it is? The claim was that it "usually" is -- and this claim is correct in the sense that it is a popular mechanism for system distributors, and also, if your system happens to use systemd, systemd will mount /tmp as a tmpfs backed by RAM automatically unless you go out of your way to change the defaults by masking it.

1-4 GB of memory is certainly enough memory to hold plenty of things regardless. Especially if you try to optimize for the average desktop user, who doesn't use /tmp interactively but has various programs writing short-lived temporary files there, which on average don't go above 100mb.

I don't recall /tmp being being limited to single digit size and being in memory until recently. It seems to me the issue was introduced recently.

It's not limited to single digit size, and it's not recent. Fedora has been doing it since 2012, systemd provided the configs for it since 2010. Early on, Debian disabled this based on initial feedback and hasn't really revisited the topic in 12 years -- but now they're finally changing and will do the same.

tensorflow/torch/cuda require 10+GB to install and use, they don't play well with small partitions. Users simply won't be able to use them if they have ridiculously small disks.

This bug report was about COMPILING packages, not using them.

Due to pip's policy of COMPILING packages in tempfile.mkdtemp() using an isolated venv, if you want to COMPILE a package that requires cuda at COMPILATION time, you must have enough space to install the files into a temporary virtualenv which by default is in /tmp.

No usage is implied. And if you can get past that hurdle and install it, you are not installing your workspace to $TMPDIR so the question becomes moot.

I'm having a look through the issue, as I've experienced it quite a bit too. It's not a pip issue, it's an issue with the user system being misconfigured on setup and it's highly specific to Linux.

The user's system was misconfigured with too small partitions (I've have endless problems in some companies with infra teams provisioning ridiculously small partitions on VMs). The system needs to be resized.

There may be a broader issue with some Linux distributions defaulting to smaller /tmp in memory, since recently, which would be an issue to report to distributions.

It is not specific to Linux, there is an entire industry of Windows software for creating ramdisks and moving %TEMP% to it. Solaris started making /tmp a tmpfs back in 1994, long before it became popular on Linux.

eli-schwartz avatar Aug 04 '24 03:08 eli-schwartz

Users can set TMPDIR to specify a different location to build packages.

Correct, this has been repeatedly suggested. :)

eli-schwartz avatar Aug 04 '24 03:08 eli-schwartz

Even after setting the TMPDIR, TEMP and TMP environment variables, files downloaded by pip continue to be downloaded to the default temp directory specified by the tempfiles python package.

See:

TMPDIR=/project/tmp TMP=/project/tmp TEMP=/project/tmp  pip install torch
Collecting torch
  Downloading torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl (906.5 MB)
     ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 126.0/906.5 MB 252.2 MB/s eta 0:00:04ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device

/project/tmp folder is on a 1.0TB volume. /tmp/ mountpoint is considerably smaller. Why is pip downloading the torch .whl file to the /tmp/ mountpoint?

alehuo avatar Jan 27 '25 11:01 alehuo

It can also be --cache-dir. I often find on problematic installs I need to set both cache-dir in pip and the TMPDIR envvar.

OffBy0x01 avatar Sep 11 '25 09:09 OffBy0x01

FYI, there's discussion about moving where wheels are initially downloaded to, which I think mostly solves this issue: https://github.com/pypa/pip/pull/13541

notatallshaw avatar Sep 11 '25 13:09 notatallshaw

As far as I can tell, that proposal doesn't even solve this issue "a little bit".

The core problem is that projects exist which aren't written in "pure python" (*.py) and their build backend must do more work than simply rearrange the source code into a wheel directory structure and subsequently zip them.

Consequently, the disk space utilized in building a wheel doesn't cleanly map as a simple "upper bounds of three times the size of the sdist" formula.

This manifests in two ways:

  • the build backend will create arbitrary (possibly large) numbers of files in a filesystem location computed based on the directory the sdist is extracted to (which is in tempdir). Most commonly, the result of a C or C++ or Rust compiler.
  • isolated build envs must reinstall additional copies of all "build plus runtime" dependencies that are already installed on their system, such as multi-gb sized wheels for GPU stuff. These will be in build-system.requires because they provide a compilation API+ABI. isolated build envs are likewise in tempdir.

The resulting "error, no space left on device" error inevitably occurs long before a wheel begins to be created.

The original report occurred when trying to pip install package A from sdist, which depended on package B (another sdist) which depended on package C, D, E, and F (wheels). While attempting to create an isolated build env for building B, pip raised an error stating that there wasn't enough disk space in tempdir to unzip C, D, E, and F into said isolated build env.

eli-schwartz avatar Sep 11 '25 14:09 eli-schwartz

The core problem is that projects exist which aren't written in "pure python" (*.py) and their build backend must do more work than simply rearrange the source code into a wheel directory structure and subsequently zip them.

Ah, yes, it doesn't solve that problem.

But there is a second problem, some users /tmp directories seem to be smaller than the size of wheels, so simply downloading wheels only hit that limit, and this would fix that.

notatallshaw avatar Sep 11 '25 14:09 notatallshaw

IMO all of this is missing the key point here, which is[^1] that when pip creates files which are, from pip's perspective, temporary, it uses the stdlib temporary file management functions. If the stdlib functions aren't suitable for that purpose, that seems to be a problem for the stdlib to solve. And if they are suitable, but require correct user configuration, that's a user problem (possibly combined with insufficient stdlib documentation on how to configure things).

If there's a limit on the size of what should be considered a "temporary file", or two separate classes of "temporary file" based on size, then I have two questions - first of all, what's the API for telling the stdlib whether you're providing a "small" or a "large" temporary file, and second, what should a tool do if it (like pip) has no way to know whether the file being created is small or large[^2]?

Because based on the above, it seems to me that there is nothing useful that pip can do here unless the stdlib API changes (or a more comprehensive 3rd party library to replace the stdlib functions becomes available). I think it's important to set expectations - the current discussion is very interesting, but nobody who is in a position to make an effective change to the status quo is reading it. If you want to reach the correct audience, you need to find a forum where the Python core developers are involved - maybe the Ideas category of discuss.python.org.

[^1]: If we ignore the question of cached files, which should be placed in pip's cache dir, which should be configured correctly. If there's a problem with any of that, it's a separate question. [^2]: I'd guess "assume large if you're not sure", but then won't nearly everyone just choose that for safety, defeating the purpose of having two categories?

pfmoore avatar Sep 11 '25 15:09 pfmoore

I'm not too familiar with the pip internals, so forgive me if this is a stupid question. But it looks like that PR only affects the build_wheel pep517 hook, by passing in an alternative directory for a build backend to create a new wheel. I don't see how it affects downloading a wheel from a wheel index.

I would anyways be surprised if downloading a wheel from a wheel index was vulnerable to this problem. Build backends need the ability to control the destination file name, which is why the interface only permits forcing a wheel directory. When downloading from an index, pip knows the destination filename before needing to do any work such as downloading.

So for solving race conditions in a network download there's already an incredibly simple solution: download to tmp-download-XXXX anonymous files in the same directory as the cache directory, and then perform a guaranteed atomic rename to the final stable name. No worries about moving potentially falling back to copying, since it's a single directory (and impossible for someone to do something weird like bind mount .cache/pip/tmp as a separate filesystem, even if they're in the mood to perform humorous "fuzzing").

I'd naively assume pip would already do this. Or if it didn't, that it would write network I/O directly to the final cachedir location. I would not expect it to take a side detour to the tempdir.

eli-schwartz avatar Sep 11 '25 15:09 eli-schwartz

IMO all of this is missing the key point here, which is1 that when pip creates files which are, from pip's perspective, temporary, it uses the stdlib temporary file management functions. If the stdlib functions aren't suitable for that purpose, that seems to be a problem for the stdlib to solve. And if they are suitable, but require correct user configuration, that's a user problem (possibly combined with insufficient stdlib documentation on how to configure things).

Hi,

I absolutely agree with you in nearly every respect. The point I was trying to make is, rather, that this is a hard enough problem to solve that I don't see how some PR to change the wheel destination directory is supposed to solve such a fundamentally hard problem.

There is one, and only one, thing that I do believe is absolutely under pip's control to change that would help solve this problem.

Pip could stop using isolated build envs. They are a lot of trouble and in my incredibly biased opinion, one of the most elaborate problems the python ecosystem ever created for itself in order to solve the two non-problems of:

  • projects that don't use CI to guarantee packages haven't forgotten to list some of their build-system.requires
  • inherent misdesign in the setuptools plugin API whereby setuptools has a function that irritates over installed entrypoints rather than iterating over a configuration field in setup.cfg / pyproject.toml, to select which plugins the project actually intends to use (rather than the sideloaded plugins that have bizarre and surprising effects)

(Note that this only solves the problem of giant wheels that can't be installed as build-system.requires inside the isolated build env. It doesn't solve the problem of projects that compile multiple GBs of their own cuda-enabled C extensions.)

eli-schwartz avatar Sep 11 '25 15:09 eli-schwartz

Pip could stop using isolated build envs

PEP 517 very specifically recommends to use isolated build envs, as pip is attempting to be a standards bearer there would need to be a new standard on how to handle this.

Such standards discussion would need to take place at https://discuss.python.org/c/packaging/14, as standards are community driven, not pip driven.

notatallshaw avatar Sep 11 '25 15:09 notatallshaw