pip-audit
pip-audit copied to clipboard
Performance is dramatically worse for `-r requirements` than without it.
Bug description
It is very slow.
Reproduction steps
pip-audit vs pip-audit -r requirements.txt and the -r version is unbearably slow (I've never seen it finish.) So I've repro'd the hang in alpine and OFF VPN, so my .pypirc and VPN isn not the cause.
Clone this. https://github.com/matthewdeanmartin/pip_audit_in_docker/
FROM python:3.9-alpine
RUN pip install pip-audit
RUN mkdir -p app
# Thrashing, not sure why it fails
RUN chown root:root /app
RUN chown root:root /tmp
# pip-audit will install ccfi on 1st run, needs C++ etc.
RUN apk add --no-cache libffi-dev build-base
WORKDIR /app
ENTRYPOINT ["pip-audit"]
Build and run like this:
docker build -t pip-audit .
docker run -v $PWD:/app -e PIP_AUDIT_LOGLEVEL=debug pip-audit -r requirements_for_safety.txt
With -r it is very slow (never finishes), without -r it finishes, but it isn't a scenario I care about- if I have to install malicious code before I can audit it, what good is that, eh?
Expected behavior
It is very fast.
With -r it is very slow (never finishes), without -r it finishes, but it isn't a scenario I care about- if I have to install malicious code before I can audit it, what good is that, eh?
Just to make sure you aren't making an incorrect safety assumption here: pip-audit -r
's mode is not a security boundary.
For various historical reasons, it is impossible to fully resolve a Python package's dependency tree without executing arbitrary code in the general case. When you do pip-audit -r ...
, all we do is perform the dependency tree resolution steps for you behind the scene, with a very small amount of non-security isolation to keep packages from clobbering each other.
pip-audit
is not designed or intended to prevent malicious code, i.e. code that intends to exploit your system. It's designed to audit packages for known vulnerabilities, i.e. code that might be exploited by a different agent.
All that being said: the -r
mode can definitely be made faster, and we'll work on that! But it's no more secure against malicious code than pip install
is, because that's all it does behind the scenes.
As a thought on improving this: @tetsuo-cpp what do you think about trying to share virtual environments between packages, whenever we need to fall back to source distributions? I suspect that would save us a great deal of time, since currently we install and upgrade pip
and wheel
each time we fall back to sdists.
My thought is that that should be sound, since we're operating from a requirements source that should be internally consistent to begin with (i.e., anything that should pip install -r ...
correctly in a venv should also pip audit -r ...
in a single hidden venv).
pip-audit
is not designed or intended to prevent malicious code, i.e. code that intends to exploit your system. It's designed to audit packages for known vulnerabilities, i.e. code that might be exploited by a different agent.
Wait, so pip-audit
runs pip install
on each package provided by -r
? (I think the code that does so is here) Obviously pip install
would have been run by someone if the venv is live and pre-existing, but even if you just want to say "hey pip-audit, is noblesse
malicious, it will download it, install it and then tell me, "why yes, it is, in fact it just stole your credit card numbers"?
I guess that would explain why -r is slow, it is installing n packages?
Anyhow, I think only ochrona (safe import mode) is thinking about proactive supply chain checks, so the general idea isn't really on anyone's radar.
Anyhow, a tool should not install anything without telling the user that they are installing something. Even pip gives you some trace about what it is up to.
Wait, so
pip-audit
runspip install
on each package provided by-r
? (I think the code that does so is here) Obviouslypip install
would have been run by someone if the venv is live and pre-existing, but even if you just want to say "hey pip-audit, isnoblesse
malicious, it will download it, install it and then tell me, "why yes, it is, in fact it just stole your credit card numbers"?
Thanks for explaining @matthewdeanmartin, I agree that we need to elaborate on this. We've just merged in a change to the README to explain more about pip-audit
's security model and what assumptions are safe to make in https://github.com/trailofbits/pip-audit/commit/c7921461f50cab037c1b3fcd2f5b70aaf4af3111.
As a thought on improving this: @tetsuo-cpp what do you think about trying to share virtual environments between packages, whenever we need to fall back to source distributions? I suspect that would save us a great deal of time, since currently we install and upgrade pip and wheel each time we fall back to sdists.
@woodruffw, that idea seems promising. Would you try to uninstall each package after we're done with it to keep the environment clean?
Would you try to uninstall each package after we're done with it to keep the environment clean?
I think it would be okay to skip the uninstall step: all of the packages being installed during a pip-audit
run should be consistent with each other anyways, so we don't care how cluttered the temporary venv gets. But perhaps I'm underthinking it.
Actually, I suppose this usecase could cause problems:
pip-audit -r foo.txt -r bar.txt
...where foo.txt
and bar.txt
are both internally consistent, but not mutually compatible. So in that case, we'd probably want to have one virtual environment per requirements-style input fed into pip-audit
.
pip-audit -r foo.txt -r bar.txt
In this case, we should resolve the combined requirements together, not separately (similar to how pip install
works). If they're not mutually compatible, that should be a dependency resolution error.
Another incremental improvement here: #194 will make pip-audit -r
about 10% faster on contrived benchmarks, and probably even faster on real workloads.
@woodruffw
But if you keep packages installed on the virtual environment, isn't your pip list
output going to get cluttered by other packages? That data ends up being fed into the resolver. So for a package, you might be assigning it dependencies which actually belong to a bunch of things earlier in the requirements file.
Sorry, I'm probably just misunderstanding.
Nope, you understood correctly! That's a problem with what I'm proposing, and I don't have a good answer to it yet.
Braindump follows:
The big problem with our current approach is that we're essentially running resolvelib
N + 1 times: one at the virtualenv level each time we attempt an sdist
resolution step (via pip install
), and then one final time at the "top" as we collect the entire dependency graph.
Composing them in this way is unsound, since resolvelib
is tasked with making global (concretizing) dependency decisions at each individual step while the "top" resolver has its own selection requirements that can't be simultaneously honored. The end result is #197.
I think there are two ways we can fix this:
- We can carry more information in our
Resolver
implementation, enabling it to backtrack further. Using #197 as an example, the trick might be keeping two sets of resolution information: the concrete versions at each step, and the (narrowed?) constraints at each step. Then, instead of our "top"resolvelib
seeing two separate concrete versions of e.g.setuptools
, it could instead check the compatibility of the constraints and select the maximal version that satisfies both. I'm not 100% sure what that looks like yet, partially because I don't fully understand the different knobs thatresolvelib
offers 😅 - We can give up on using
resolvelib
entirely, and go with a "naive" approach where we just create a new venv and forward the requirement files into it, collecting the fully resolved dependencies at the very end. We could make this roughly as responsive as the current approach by using thecommunicate()
APIs for subprocesses, updating ourAuditState
each time we receive a line from the underlyingpip install
process.
#228 should make the -r
mode much more responsive, if not actually faster.
A few things have changed since the last activity here: pip-audit -r ...
now uses pip
under the hood rather than its own resolvelib
-based resolver, and thus should be approximately as fast as the equivalent pip install -r
invocation.
Unless others are continuing to see significant performance issues here (i.e., beyond those commensurate with running pip install -r ...
), I'm inclined to close this (and allow new issues to be opened for performance, where appropriate).