chore: improving efficiency of similar projects analyzer
Summary
The goal is to improve the efficiency of the SimilarProjectsAnalyzer, which currently downloads the sourcecode tarball for every package of every maintainer it finds. The solution to this is to use the structure of the tarball/wheel provided on the inspector.pypi.io page of the package, making web requests to extract the structure instead of downloading the package.
Description of changes
This PR modifies the way PyPI inspector links are handled by created a separate PyPIInspectorAsset object, container information about the PyPI inspector URLs and with the ability to extract the project structure from a package URL. The WheelAbsenceAnalyzer is then modified to use this, simplifying it, and the SimilarProjectsAnalyzer then makes use of it for analyzing the package structure.
The SimilarProjectsAnalyzer normalizes the structure by doing the following:
- Only considering python files.
- Removing the
<package_name>-<version>prefix. - Removing the
<package_namefrom the top-level folder, resulting in a folder structure that does not contain the package name at the top level. - Removing
setup.pyfrom tarballs.
This makes it so that wheels and tarballs are comparable when looking at the package structure. A unit test is written to demonstrate this. The SimilarProjectsAnalyzer then extracts the hash for these folder structures, and compares them against other projects made by the maintainers of the analyzed package. If at least one is similar, the analyzer fails, but it does continue to loop and collect all similar projects.
A known complication with this is the fact that PyPI uses the Fastly CDN, returning a JavaScript challenge response. Since PyPI inspector uses URLs rerouted from PyPI, this means those JavaScript challenges are received when making programmatic requests in python to a PyPI inspector URL. This does not always happen, but is a frequent occurrence. To accommodate for this, this analyzer is written such that it does not raise HeuristicAnalyzerValueErrors, and will return SKIP results when unable to obtain package information.
Checklist
- [x] I have reviewed the contribution guide.
- [x] My PR title and commits follow the Conventional Commits convention.
- [x] My commits include the "Signed-off-by" line.
- [x] I have signed my commits following the instructions provided by GitHub. Note that we run GitHub's commit verification tool to check the commit signatures. A green
verifiedlabel should appear next to all of your commits on GitHub. - [x] I have updated the relevant documentation, if applicable.
- [x] I have tested my changes and verified they work as expected.