pip-audit Performance is dramatically worse for `-r requirements` than without it.

Bug description

It is very slow.

Reproduction steps

pip-audit vs pip-audit -r requirements.txt and the -r version is unbearably slow (I've never seen it finish.) So I've repro'd the hang in alpine and OFF VPN, so my .pypirc and VPN isn not the cause.

Clone this. https://github.com/matthewdeanmartin/pip_audit_in_docker/

FROM python:3.9-alpine
RUN pip install pip-audit
RUN mkdir -p app
# Thrashing, not sure why it fails
RUN chown root:root /app
RUN chown root:root /tmp
# pip-audit will install ccfi on 1st run, needs C++ etc.
RUN apk add --no-cache libffi-dev build-base
WORKDIR /app
ENTRYPOINT ["pip-audit"]

Build and run like this:

docker build -t pip-audit .
docker run -v $PWD:/app -e PIP_AUDIT_LOGLEVEL=debug pip-audit -r requirements_for_safety.txt

With -r it is very slow (never finishes), without -r it finishes, but it isn't a scenario I care about- if I have to install malicious code before I can audit it, what good is that, eh?

Expected behavior

It is very fast.

Dec 03 '21 20:12 matthewdeanmartin

With -r it is very slow (never finishes), without -r it finishes, but it isn't a scenario I care about- if I have to install malicious code before I can audit it, what good is that, eh?

Just to make sure you aren't making an incorrect safety assumption here: pip-audit -r's mode is not a security boundary.

For various historical reasons, it is impossible to fully resolve a Python package's dependency tree without executing arbitrary code in the general case. When you do pip-audit -r ..., all we do is perform the dependency tree resolution steps for you behind the scene, with a very small amount of non-security isolation to keep packages from clobbering each other.

pip-audit is not designed or intended to prevent malicious code, i.e. code that intends to exploit your system. It's designed to audit packages for known vulnerabilities, i.e. code that might be exploited by a different agent.

All that being said: the -r mode can definitely be made faster, and we'll work on that! But it's no more secure against malicious code than pip install is, because that's all it does behind the scenes.

Dec 03 '21 20:12 woodruffw

As a thought on improving this: @tetsuo-cpp what do you think about trying to share virtual environments between packages, whenever we need to fall back to source distributions? I suspect that would save us a great deal of time, since currently we install and upgrade pip and wheel each time we fall back to sdists.

My thought is that that should be sound, since we're operating from a requirements source that should be internally consistent to begin with (i.e., anything that should pip install -r ... correctly in a venv should also pip audit -r ... in a single hidden venv).

Dec 03 '21 21:12 woodruffw

pip-audit is not designed or intended to prevent malicious code, i.e. code that intends to exploit your system. It's designed to audit packages for known vulnerabilities, i.e. code that might be exploited by a different agent.

Wait, so pip-audit runs pip install on each package provided by -r? (I think the code that does so is here) Obviously pip install would have been run by someone if the venv is live and pre-existing, but even if you just want to say "hey pip-audit, is noblesse malicious, it will download it, install it and then tell me, "why yes, it is, in fact it just stole your credit card numbers"?

I guess that would explain why -r is slow, it is installing n packages?

Anyhow, I think only ochrona (safe import mode) is thinking about proactive supply chain checks, so the general idea isn't really on anyone's radar.

Anyhow, a tool should not install anything without telling the user that they are installing something. Even pip gives you some trace about what it is up to.

Dec 04 '21 01:12 matthewdeanmartin

Wait, so pip-audit runs pip install on each package provided by -r? (I think the code that does so is here) Obviously pip install would have been run by someone if the venv is live and pre-existing, but even if you just want to say "hey pip-audit, is noblesse malicious, it will download it, install it and then tell me, "why yes, it is, in fact it just stole your credit card numbers"?

Thanks for explaining @matthewdeanmartin, I agree that we need to elaborate on this. We've just merged in a change to the README to explain more about pip-audit's security model and what assumptions are safe to make in https://github.com/trailofbits/pip-audit/commit/c7921461f50cab037c1b3fcd2f5b70aaf4af3111.

Dec 04 '21 04:12 tetsuo-cpp

As a thought on improving this: @tetsuo-cpp what do you think about trying to share virtual environments between packages, whenever we need to fall back to source distributions? I suspect that would save us a great deal of time, since currently we install and upgrade pip and wheel each time we fall back to sdists.

@woodruffw, that idea seems promising. Would you try to uninstall each package after we're done with it to keep the environment clean?

Dec 04 '21 05:12 tetsuo-cpp

Would you try to uninstall each package after we're done with it to keep the environment clean?

I think it would be okay to skip the uninstall step: all of the packages being installed during a pip-audit run should be consistent with each other anyways, so we don't care how cluttered the temporary venv gets. But perhaps I'm underthinking it.

Actually, I suppose this usecase could cause problems:

pip-audit -r foo.txt -r bar.txt

...where foo.txt and bar.txt are both internally consistent, but not mutually compatible. So in that case, we'd probably want to have one virtual environment per requirements-style input fed into pip-audit.

Dec 06 '21 15:12 woodruffw

pip-audit -r foo.txt -r bar.txt

In this case, we should resolve the combined requirements together, not separately (similar to how pip install works). If they're not mutually compatible, that should be a dependency resolution error.

Dec 06 '21 16:12 di

Another incremental improvement here: #194 will make pip-audit -r about 10% faster on contrived benchmarks, and probably even faster on real workloads.

Dec 07 '21 17:12 woodruffw

@woodruffw But if you keep packages installed on the virtual environment, isn't your pip list output going to get cluttered by other packages? That data ends up being fed into the resolver. So for a package, you might be assigning it dependencies which actually belong to a bunch of things earlier in the requirements file.

Sorry, I'm probably just misunderstanding.

Dec 07 '21 22:12 tetsuo-cpp

Nope, you understood correctly! That's a problem with what I'm proposing, and I don't have a good answer to it yet.

Braindump follows:

The big problem with our current approach is that we're essentially running resolvelib N + 1 times: one at the virtualenv level each time we attempt an sdist resolution step (via pip install), and then one final time at the "top" as we collect the entire dependency graph.

Composing them in this way is unsound, since resolvelib is tasked with making global (concretizing) dependency decisions at each individual step while the "top" resolver has its own selection requirements that can't be simultaneously honored. The end result is #197.

I think there are two ways we can fix this:

We can carry more information in our Resolver implementation, enabling it to backtrack further. Using #197 as an example, the trick might be keeping two sets of resolution information: the concrete versions at each step, and the (narrowed?) constraints at each step. Then, instead of our "top" resolvelib seeing two separate concrete versions of e.g. setuptools, it could instead check the compatibility of the constraints and select the maximal version that satisfies both. I'm not 100% sure what that looks like yet, partially because I don't fully understand the different knobs that resolvelib offers 😅
We can give up on using resolvelib entirely, and go with a "naive" approach where we just create a new venv and forward the requirement files into it, collecting the fully resolved dependencies at the very end. We could make this roughly as responsive as the current approach by using the communicate() APIs for subprocesses, updating our AuditState each time we receive a line from the underlying pip install process.

Dec 07 '21 22:12 woodruffw

#228 should make the -r mode much more responsive, if not actually faster.

Jan 31 '22 20:01 woodruffw

A few things have changed since the last activity here: pip-audit -r ... now uses pip under the hood rather than its own resolvelib-based resolver, and thus should be approximately as fast as the equivalent pip install -r invocation.

Unless others are continuing to see significant performance issues here (i.e., beyond those commensurate with running pip install -r ...), I'm inclined to close this (and allow new issues to be opened for performance, where appropriate).

Aug 02 '23 03:08 woodruffw

pip-audit pip-audit copied to clipboard

Performance is dramatically worse for `-r requirements` than without it.

Bug description

Reproduction steps

Expected behavior

pip-audit
pip-audit copied to clipboard