setup-python icon indicating copy to clipboard operation
setup-python copied to clipboard

PIP cache should cache the installed packages as well

Open crabhi opened this issue 3 years ago • 34 comments

Description: Currently, setup-python caches only the ~/.cache/pip directory to avoid redownloads. However, it doesn't cache the installed packages. As some package have lengthy installation steps, this leads to delays in builds.

You can see the current behaviour for example in https://github.com/crabhi/setup-python-cache-test/actions/runs/1789016634 (or in attached build.txt) - the pip install output shows "Collecting" and "Installing" instead of "Requirement already satisfied" for all packages.

Justification: For example installing the ansible package takes well over a minute even if it's already downloaded.

Are you willing to submit a PR? Yes, I can try.

crabhi avatar Feb 03 '22 11:02 crabhi

Hello @crabhi, thanks for your request! We will look at it.

nikita-bykov avatar Feb 04 '22 14:02 nikita-bykov

would be also nice to follow https://github.com/actions/cache#outputs and provide an output cache-hit so we can if: steps.[id]. cache-hit != 'true' to avoid calling pip altogether.

barbieri avatar Mar 24 '22 19:03 barbieri

which is done at https://github.com/actions/cache/blob/2d8d0d1c9b41812b6fd3d4ae064360e7d8762c7b/src/utils/actionUtils.ts#L25-L27 and https://github.com/actions/cache/blob/main/src/restore.ts#L55

barbieri avatar Mar 24 '22 19:03 barbieri

This pattern affects more languages (actions/setup-node works the same - only caches downloads, not installs) - would love to see a general consensus towards caching installs, not tarballs (perhaps behind a flag/attribute for future compat cache: pip-install).

jbergstroem avatar Mar 29 '22 13:03 jbergstroem

for pip I think it makes even more sense as there are no postinstall actions... with setup-node I'm also caching the node_modules instead of the packages, but it "broke" some flows where there was a postinstall script to configure other things (like pre-build some typescript scripts). The solution is simple, just run that script manually (or in my case, cache the built scripts)... but not "one size fits all".

For pip AFAIR there are no postinstall scripts, then this would not be an issue.

barbieri avatar Mar 29 '22 14:03 barbieri

For pip AFAIR there are no postinstall scripts, then this would not be an issue.

I'm experimenting with this at the moment and caching site-packages (read: pip output) isn't straightforward either; for instance binary wrappers (black, ..) won't work (python -m black works fine tho). Might be one of thos YMMV cases that makes it hard to standardize for everyone.

jbergstroem avatar Mar 29 '22 14:03 jbergstroem

would be also nice to follow https://github.com/actions/cache#outputs and provide an output cache-hit so we can if: steps.[id]. cache-hit != 'true' to avoid calling pip altogether.

Hey, this feature was merged today and should be a part of the near-future release

dhvcc avatar Apr 05 '22 16:04 dhvcc

but the cache-hit is just for the packages, not the installation, right? IOW: do I still need to call pip install?

barbieri avatar Apr 05 '22 16:04 barbieri

but the cache-hit is just for the packages, not the installation, right? IOW: do I still need to call pip install?

Oh you're talking pip. Well yeah, then you'll have to wait for this action to support caching venv's out of the box. It's a case for pipenv and poetry though. The best this for now is to manually cache

dhvcc avatar Apr 05 '22 16:04 dhvcc

I have a case where building packages for pypy (grpcio, grpcio-tools) takes about 6 minutes-- it's way too slow to introduce a matrix.

If anyone has a manual example using actions/cache, please share it.

belm0 avatar Apr 16 '22 23:04 belm0

I was creating a python venv and then caching that directory, however I hit an issue where that was broken once restored (behaviour was inconsistent).

I currently have a job that takes ~6 min to complete, 4 min of which is installation of pip packages. An effective caching of installed packages would be a great boost.

rashidnhm avatar May 09 '22 20:05 rashidnhm

I was creating a python venv and then caching that directory, however I hit an issue where that was broken once restored (behaviour was inconsistent).

I currently have a job that takes ~6 min to complete, 4 min of which is installation of pip packages. An effective caching of installed packages would be a great boost.

Could you share the workflow so the people can take a look at it? I think it's possible to hack around while this feature is not here

dhvcc avatar May 10 '22 10:05 dhvcc

I was creating a python venv and then caching that directory, however I hit an issue where that was broken once restored (behaviour was inconsistent). I currently have a job that takes ~6 min to complete, 4 min of which is installation of pip packages. An effective caching of installed packages would be a great boost.

Could you share the workflow so the people can take a look at it? I think it's possible to hack around while this feature is not here

- uses: actions/checkout@v3

- id: setup_python
  uses: actions/setup-python@v3
  with:
    python-version: 3.7

- id: python_cache
  uses: actions/cache@v3
  with:
    path: venv
    key: pip-${{ steps.setup_python.outputs.python-version }}-${{ hashFiles('requirements.txt') }}

- if: steps.python_cache.outputs.cache-hit != 'true'
  run: |
    python3 -m venv venv

- run: |
    venv/bin/python3 -m pip install -r requirements.txt

This worked quite well for me for the most part, just that after a while I started getting errors as such:

Error: [Errno 2] No such file or directory: '/home/runner/work/myrepo/myrepo/venv/bin/python3': '/home/runner/work/myrepo/myrepo/venv/bin/python3'

rashidnhm avatar May 10 '22 13:05 rashidnhm

@rashidnhm have you tried debugging this issue? It seems like the problem may be not in this action.

dhvcc avatar May 13 '22 19:05 dhvcc

@rashidnhm have you tried debugging this issue? It seems like the problem may be not in this action.

So weirdly enough, I have not been able to reproduce the issue. To fix I simply removed the venv code and recreated and re cached it. I'm not even sure what caused it in the first place.

My only thought was maybe somehow the cach got corrupted and it kept restoring that. Really can't say.

For now I've kept the code I sent above, it's been working well since and haven't hit any other issues

rashidnhm avatar May 13 '22 19:05 rashidnhm

Ok, nice. The code seemed ok, so that was strange. I'd only advise you to may be not run pip install if cache was hit implying you don't want to modify cache in any way if it's hit to avoid corruption

dhvcc avatar May 13 '22 19:05 dhvcc

Ok, nice. The code seemed ok, so that was strange. I'd only advise you to may be not run pip install if cache was hit implying you don't want to modify cache in any way if it's hit to avoid corruption

So I have done quite a deep dive into the venv corruption issue, and I believe I know what happened, and how to avoid it as well.

The version of Python between when my cache was created and when it was restored changed. And I had a generic restore key which matched the old cache key. See detailed explanation below.

This is how I had my yaml file was when I hit this error:

# BAD CONFIG DO NOT USE (Illustrative purposes only)

- uses: actions/checkout@v3

- id: setup_python
  uses: actions/setup-python@v3
  with:
    python-version: 3.7

- id: python_cache
  uses: actions/cache@v3
  with:
    path: venv
    key: pip-${{ steps.setup_python.outputs.python-version }}-${{ hashFiles('requirements.txt') }}
    restore-keys: |
      pip-${{ steps.setup_python.outputs.python-version }}-
      pip-  # This line in specific was the cause of the issue

- if: steps.python_cache.outputs.cache-hit != 'true'
  run: |
    python3 -m venv venv

- run: |
    venv/bin/python3 -m pip install -r requirements.txt

When this workflow initially ran and saved the venv to cache, the latest release of Python3.7 was 3.7.12 ... meaning the venv created had symlinks to 3.7.12.

However, few days later when the workflow ran again, the latest release of Python3.7 was 3.7.13.

Notice in my workflow I don't pin my Python patch version, so actions/setup-python downloaded the latest available patch release of Python 3.7 (as expected).

However, my restore-key pip- matched the old cache, which restored the old venv created for Python 3.7.12 ... meaning all the symlinks inside were now broken! I have setup Python 3.7.13 but am trying to use a venv with symlinks to 3.7.12! Hence why when I tried to call the python executable from the venv, it could not find the file!

The resolution is to really ensure that the output of setup python is always part of the cache key. So any change in python version (even a patch version bump) would create a new cache key.

This is the code I have now, it has been working well without any issues. I have updated the workflow with the advice @dhvcc gave in the above comment. The venv is not touched if there is a cache hit.

- uses: actions/checkout@v3

- id: setup_python
  uses: actions/setup-python@v3
  with:
    python-version: 3.7

- id: python_cache
  uses: actions/cache@v3
  with:
    path: venv
    key: pip-${{ steps.setup_python.outputs.python-version }}-${{ hashFiles('requirements.txt') }}

- if: steps.python_cache.outputs.cache-hit != 'true'
  run: |
    # Check if venv exists (restored from secondary keys if any, and delete)
    # You might not need this line if you only have one primary key for the venv caching
    # I kept it in my code as a fail-safe
    if [ -d "venv" ]; then rm -rf venv; fi
    
    # Re-create the venv
    python3 -m venv venv

    # Install dependencies
    venv/bin/python3 -m pip install -r requirements.txt

rashidnhm avatar May 14 '22 04:05 rashidnhm

Hi, @rashidnhm 👋 Thanks a lot for such a detailed explanation, it should help others who encountered such issues.

IvanZosimov avatar May 18 '22 14:05 IvanZosimov

Any news on how to flag to cache the installed packages, and not only the downloaded ones, with actions/setup-python@v4? I am not seeing any flags for that in the documentation

Axeln78 avatar Jul 12 '22 13:07 Axeln78

Any news on how to flag to cache the installed packages, and not only the downloaded ones, with actions/setup-python@v4? I am not seeing any flags for that in the documentation

What do you exactly mean by that? A bit more context would be helpful to avoid misunderstandings

dhvcc avatar Jul 12 '22 19:07 dhvcc

Sorry, @dhvcc if I didn't manage to make myself clear. actions/setup-python@v4 uses actions/cache@v3 under the hood and users do not need to call on the actions/cache@v3 module in an example such as:

    - uses: actions/checkout@v3
    - name: Set up Python 3.10 and caches
      id: setup and cache
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
        cache: 'pip'

It would be great if the installed packages could be cached as well (the purpose of this issue #330) through actions/setup-python@v4

Axeln78 avatar Jul 13 '22 07:07 Axeln78

I wonder if it's actually worth it. Here I cached the content of ${{ env.pythonLocation }}/lib/site-packages and ${{ env.pythonLocation }}/Scripts using actions/cache:

No caching: image

Caching, no cache hit (+1m): image

Caching, cache hit (+18s): image

Avasam avatar Nov 24 '22 21:11 Avasam

@Avasam possibly at least less strain on pypi. Also we should test small and big amounts of dependencies

dhvcc avatar Dec 02 '22 09:12 dhvcc

Just wanted to add an anecdote of my own experience. TorchGeo has a long list of dependencies:

Install times without caching vary quite a bit by OS and Python version:

Python Linux macOS Windows
3.10 2m 30s 2m 23s 5m 4s
3.9 2m 50s 4m 50s 5m 49s
3.8 2m 29s 2m 12s 3m 19s

We first tried using the cache feature of setup-python:

    - name: Set up python
      uses: actions/[email protected]
      with:
        python-version: ${{ matrix.python-version }}
        cache: 'pip'
        cache-dependency-path: |
          requirements/required.txt
          requirements/datasets.txt
          requirements/tests.txt
    - name: Install pip dependencies
      run: pip install -r requirements/required.txt -r requirements/datasets.txt -r requirements/tests.txt

Not only do install times not significantly improve, in many cases it's actually worse!

Python Linux macOS Windows
3.10 2m 42s 1m 53s 5m 50s
3.9 2m 50s 2m 11s 5m 46s
3.8 2m 39s 3m 21s 2m 35s

Finally, we tried the setup proposed in this blog that manually caches the entire Python installation:

    - name: Set up python
      uses: actions/[email protected]
      with:
        python-version: ${{ matrix.python-version }}
    - name: Cache dependencies
      uses: actions/[email protected]
      id: cache
      with:
        path: ${{ env.pythonLocation }}
        key: ${{ env.pythonLocation }}-${{ hashFiles('requirements/required.txt') }}-${{ hashFiles('requirements/datasets.txt') }}-${{ hashFiles('requirements/tests.txt') }}
    - name: Install pip dependencies
      if: steps.cache.outputs.cache-hit != 'true'
      run: pip install -r requirements/required.txt -r requirements/datasets.txt -r requirements/tests.txt

This resulted in significantly faster installation times, which could likely be further improved by only caching the site-packages directory:

Python Linux macOS Windows
3.10 38s 39s 4m 11s
3.9 53s 45s 4m 20s
3.8 1m 2s 1m 15s 1m 21s

Apparently slower Windows caching is a known issue: https://github.com/actions/cache/issues/752.

So yes, if setup-python also cached installed packages, that would be awesome!

adamjstewart avatar Feb 05 '23 00:02 adamjstewart

which could likely be further improved by only caching the site-packages directory

In hindsight, this is a bad idea, many tools like black or flake8 also install files into bin so we'll at least need to cache bin too.

adamjstewart avatar Feb 07 '23 15:02 adamjstewart

In hindsight, this is a bad idea, many tools like black or flake8 also install files into bin so we'll at least need to cache bin too.

I addressed this point a while ago (above) - recap here:

I'm experimenting with this at the moment and caching site-packages (read: pip output) isn't straightforward either; for instance binary wrappers (black, ..) won't work (python -m black works fine tho). Might be one of thos YMMV cases that makes it hard to standardize for everyone.

So, instead of invoking black, do python -m black.

jbergstroem avatar Feb 07 '23 19:02 jbergstroem

That's a decent workaround, but I don't think it's realistic to expect all users to change how they invoke other steps later in their workflow. I think we would have to cache bin too. Possibly everything. Bonus of caching everything is that we have to install Python from a cache anyway.

adamjstewart avatar Feb 07 '23 19:02 adamjstewart

That's a decent workaround, but I don't think it's realistic to expect all users to change how they invoke other steps later in their workflow. I think we would have to cache bin too. Possibly everything. Bonus of caching everything is that we have to install Python from a cache anyway.

Most definitely not a catch-all! To be honest I'm not confident there's a straightforward solution..

jbergstroem avatar Feb 07 '23 19:02 jbergstroem

The workaround from @adamjstewart seems to work wonders indeed ! But I think a standard implementation from this repository would be a great addition. Any updates on it from the dev team ?

Seluj78 avatar Feb 22 '23 08:02 Seluj78

Just wanted to add an anecdote of my own experience. TorchGeo has a long list of dependencies:

Install times without caching vary quite a bit by OS and Python version:

Python Linux macOS Windows 3.10 2m 30s 2m 23s 5m 4s 3.9 2m 50s 4m 50s 5m 49s 3.8 2m 29s 2m 12s 3m 19s We first tried using the cache feature of setup-python:

    - name: Set up python
      uses: actions/[email protected]
      with:
        python-version: ${{ matrix.python-version }}
        cache: 'pip'
        cache-dependency-path: |
          requirements/required.txt
          requirements/datasets.txt
          requirements/tests.txt
    - name: Install pip dependencies
      run: pip install -r requirements/required.txt -r requirements/datasets.txt -r requirements/tests.txt

Not only do install times not significantly improve, in many cases it's actually worse!

Python Linux macOS Windows 3.10 2m 42s 1m 53s 5m 50s 3.9 2m 50s 2m 11s 5m 46s 3.8 2m 39s 3m 21s 2m 35s Finally, we tried the setup proposed in this blog that manually caches the entire Python installation:

    - name: Set up python
      uses: actions/[email protected]
      with:
        python-version: ${{ matrix.python-version }}
    - name: Cache dependencies
      uses: actions/[email protected]
      id: cache
      with:
        path: ${{ env.pythonLocation }}
        key: ${{ env.pythonLocation }}-${{ hashFiles('requirements/required.txt') }}-${{ hashFiles('requirements/datasets.txt') }}-${{ hashFiles('requirements/tests.txt') }}
    - name: Install pip dependencies
      if: steps.cache.outputs.cache-hit != 'true'
      run: pip install -r requirements/required.txt -r requirements/datasets.txt -r requirements/tests.txt

This resulted in significantly faster installation times, which could likely be further improved by only caching the site-packages directory:

Python Linux macOS Windows 3.10 38s 39s 4m 11s 3.9 53s 45s 4m 20s 3.8 1m 2s 1m 15s 1m 21s Apparently slower Windows caching is a known issue: actions/cache#752.

So yes, if setup-python also cached installed packages, that would be awesome!

This right here has been a life saver for me - I toiled over this caching for so long, but this got me there!! Thank you so so so much!!

CoreyGaunt avatar Apr 07 '23 21:04 CoreyGaunt