reproman icon indicating copy to clipboard operation
reproman copied to clipboard

Multiple (duplicate?) distributions in trace files

Open chaselgrove opened this issue 6 years ago • 1 comments

In the traces at http://www.onerussian.com/tmp/niceman-traces-1.tgz, there are multiple venv distributions (with the same path and venv_version) and multiple debian distributions.

chaselgrove avatar Feb 07 '19 19:02 chaselgrove

duplicate?

Based on diffing the individual distributions in those files, there seems to be some differences (e.g., whether there are files associated with a package), though I've also found some that are identical too.

The (or at least a) core issue here is that in 2cf46df0 we punted on merging the multiple iterations:

ENH: iterate over tracers until nothing new is discovered

Note: this is not yet complete since result would duplicate distributions
and possibly packages within them, since every pass is independent from
previous ones.

On a joint meeting of two us we decided that we should leave the
resolution to the "unification" step which would pull those together
since "unification" was envisisioned, and it would come handy anyways
for other usecases (e.g. joining multiple specs)

AFAIK this hasn't been revisited at all.

Here's what seems to be going on. $ENV/bin/python was given as a path to trace. This isn't recognized as a virtualenv file, so it is added to the unknown files [*]. On the next iteration, this passes through DebTracer as an unknown file and so on until we get back to the virtualenv tracer. It detects that the $ENV/bin/python file is under the $ENV virtualenv, so it returns that distribution (this time without paths that were claimed by virtualenv in the previous iteration). I think any Debian or Git repo tracing that was triggered by the initial virtualenv tracing will be triggered again, leading to the multiple Debian and Git distributions.

So, it's an interaction between us not merging distributions after multiple retracing iterations and the virtualenv tracer getting confused by $ENV/bin/python. Based on a quick test, this patch would fix the virutalenv confusion:

diff --git a/reproman/distributions/venv.py b/reproman/distributions/venv.py
index f66b65e54..242c83ad0 100644
--- a/reproman/distributions/venv.py
+++ b/reproman/distributions/venv.py
@@ -160,7 +160,9 @@ def identify_distributions(self, files):
             # system wide installation of python
             for path in unknown_files.copy():
                 if is_subpath(path, venv_path) and op.islink(path):
-                    unknown_files.add(op.realpath(path))
+                    rpath = op.realpath(path)
+                    if not is_subpath(rpath, venv_path):
+                        unknown_files.add(rpath)
                     unknown_files.remove(path)
 
             packages = []

[*] It's a bit more complicated because we actually call realpath on it and get $ENV/bin/python2 or $ENV/bin/python3 and add that to the unknown files. We do this so that we can unlinkify, say, types.py -> /usr/lib/python2.7/types.py and then the DebTracer can use it on the next iteration.

kyleam avatar Feb 08 '19 16:02 kyleam