Possible to add option limit wheel cache size?
Just throwing this out there, because we have a server that has run out of disk space a few times because the pip wheel cache grew to 42 GB.
I just wrote a cron job to clear out the old stuff on occasion, but I wonder if it makes sense to have an option for maximum cache size and then pip itself would delete older wheels when it goes over the limit.
If having pip do this is out of scope, then we might just document the cron job in the wheel cache section of the docs, along with whatever the equivalent recipe would be on Windows.
42GB o.O That doesn't sound right... is it caching something it's not supposed to be?
My entire pip cache is only 186MB, of that only 53MB is from Wheels. Can you give more information about what's going on?
Well yes, there is something a little bit unique here perhaps, because we have one particular internal package that is very large because it has a bunch of GIFs and JPGs and stuff in it. This means installing it creates very large wheels; wheels that we don't actually need.
I am going to look into passing --no-binary anonweb to prevent pip from creating these wheels.
But if other folks start creating large packages, it could crop up again.
If this is too edge case for pip to handle, that's fine; just throwing it out there in case other folks have large packages or small (embedded?) systems with limited disk.
Oh, I'm not opposed to the idea in general, I was just confused why it was growing up to 42GB, it was surprising to me.
Are you installing this package like pip install anonweb or is it pip install -e git+https://something/anonweb.git#egg=anonweb?
IIRC, we're doing pip install /path/to/anonweb-version-number-branch-names-etc.tar.gz.
We're basically victims of the unfortunate combination of two factors:
- large packages
- Continuous Deployment
Even in our environment, there is only one package that has this combination.
Might be helpful to version your assets separately from code... but this could be a useful opt-in feature.
-e might even work as a hack, therefore you're not duplicating files all over the place.
Ooh, the -e might work to prevent this. Going to look into that. Thanks, @Ivoz!
@msabramo , one option for you is to author your package to download that data on demand, or provide your own entry point (console command) to download the large data. This would keep it out of the wheel (cache), but still allow you to install it into the environment.
nltk package externalizes model data, but I think that is because of governance and provenance issues.
As a note, there's been a pip cache command for inspecting and managing the cache for a while. This means, as a workaround, you can periodically do pip cache remove <pattern> (e.g. pip cache remove anonweb if the problematic package is called "anonweb").
It doesn't fix the underlying problem, but it should at least make it less disruptive.
pls
For CIs with a persistent cache, it is very desirable to have the cache size checked and limited by pip itself. Other tools like ccache have such a limit and even use it by default. I would be very happy if this feature could be added to the roadmap. 😊
Pip is currently operating under an all volunteer basis, with a few active maintainers contributing in their spare time, there aren't any active "projects" or "roadmaps".
If someone has a good idea for a clear design and they or someone else can submit a simple high quality PR I would be happy to review it. But just as a heads up, there is unlikely to be review capacity for any PR which involves a large change or refactor of the current cache design.