scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

Scanning multiple directories scans too much

Open rspier opened this issue 2 years ago • 6 comments

Description

Please leave a brief description of the bug or feature request:

How To Reproduce

Tell us how to reproduce the issue.

We have a giant third_party/ directory. GIANT! Trying to scan one package works fine. But trying to scan two at once, it scans things outside of those directories

$ scancode -n 129 --copyright --license --package --json /tmp/out.json  --max-in-memory 0 third_party/curl third_party/zlib
Setup plugins...
Collect file inventory...

It appears to hang there, but strace shows that it's actually scanning things outside of the curl and zlib directories, which will take a long time.

System configuration

For bug reports, it really helps us to know:

  • What OS are you running on? (Windows/MacOS/Linux): Debian testing based system.
  • What version of scancode-toolkit was used to generate the scan file? 32.0.4
  • What installation method was used to install/run scancode? (pip/source download/other)

rspier avatar Jul 12 '23 23:07 rspier

Ah, that's a flaw alright. When passing multiple input paths, I think that the current behaviour is to find the shared common root ancestor directory and "ignore" all parts that are not in the provided paths. That's a bad and stupid behaviour indeed.

pombredanne avatar Jul 13 '23 14:07 pombredanne

@JonoYang @AyanSinhaMahapatra what do you think could be the way to improve this?

pombredanne avatar Aug 03 '23 07:08 pombredanne

@pombredanne there's the new paths you added to the Codebase model in https://github.com/nexB/commoncode/pull/42, instead of using the include plugin to handle multiple paths, can't we use this directly? Looking into this more.

AyanSinhaMahapatra avatar Aug 03 '23 09:08 AyanSinhaMahapatra

I had some time to poke at this this afternoon, and it's not straightforward.

@pombredanne @AyanSinhaMahapatra Do you have any documentation on how paths is supposed to work. If I'm understanding properly, it's intended to be a set of subdirectories of the root (common_prefix) to filter to. On the surface, this seems more complicated than just iterating over multiple directories and concatenating the results. (So I'm trying to understand the rationale.)

It also looks like this isn't fully wired up yet. I started with commit 822cc91d895f1f, and started working through failures. There seem to be some mismatched assumptions about absolute vs relative paths and representation.

I went looking for tests for _create_resources_from_paths (which I think is where the main issues are), but there aren't any that look quite like what I'm looking for. (Although there are some for Codebase).

Anyway, wanted to reach out before I went any deeper...

Thanks-

rspier avatar Oct 20 '23 23:10 rspier

On the surface, this seems more complicated than just iterating over multiple directories and concatenating the results. (So I'm trying to understand the rationale.)

that's an inherited technical wart and debt. The original design was to say that a scan would always have a single root directory.

pombredanne avatar Nov 18 '23 08:11 pombredanne

Related: https://github.com/nexB/commoncode/issues/35

AyanSinhaMahapatra avatar Nov 20 '23 13:11 AyanSinhaMahapatra