deptry icon indicating copy to clipboard operation
deptry copied to clipboard

feat: use `packaging` to parse requirements

Open mkniewallner opened this issue 1 year ago • 3 comments

PR Checklist

  • [x] A description of the changes is added to the description of this PR.
  • [ ] If there is a related issue, make sure it is linked to this PR.
  • [x] If you've fixed a bug or added code that should be tested, add tests!
  • [ ] If you've added or modified a feature, documentation in docs is updated

Description of changes

This is something that has been on my mind for quite some time now.

We currently rely on several regexes to parse dependencies in requirements files. Although this allows parsing formats that pip handles, there are many formats that PEP 508 does not cover, as both remote dependencies and local dependencies need to follow <package> @ <path> format. Even pip documentation suggests to use PEP 508 format.

The usage of regexes itself definitely makes the parsing best-effort, but it could also creates some false positives, as for instance for what looks like git URLs, we try to guess where the package name is, based on the git project name in the URL, which could depend on the git server used, or, worse, the git project name could be different than the real Python package name.

This PR suggests using packaging, maintained by PyPA, to parse dependencies where we expect PEP 508 format to be used (requirements files, PEP 621 metadata). This would remove support for URLs that do not follow PEP 508 dependencies, so this is a breaking change we would have to mention in the changelog, if we effectively want to go this way.

mkniewallner avatar Jun 16 '24 15:06 mkniewallner

Codecov Report

Attention: Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.1%. Comparing base (0f0a1c6) to head (9f0bb47). Report is 173 commits behind head on main.

Files with missing lines Patch % Lines
python/deptry/dependency_getter/pep_621.py 87.5% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@           Coverage Diff           @@
##            main    #735     +/-   ##
=======================================
+ Coverage   92.8%   93.1%   +0.3%     
=======================================
  Files         35      35             
  Lines        920     888     -32     
  Branches     165     154     -11     
=======================================
- Hits         854     827     -27     
+ Misses        52      49      -3     
+ Partials      14      12      -2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jun 16 '24 15:06 codecov[bot]

I do like the idea of using packaging to extract the dependencies, instead of using our own regexes, I think that is an improvement. As I understand; the only breaking change that we are aware of is for parsing requirements in requirements.txt in one of the following forms, right?

https://github.com/urllib3/urllib3/archive/refs/tags/1.26.8.zip
git+https://github.com/baz/foo-bar.git@asd#egg=foo-bar

In the link you share to the pip documentation, they suggest use PEP 508 format for installing from a package index. But they also show that they support other formats for packages that do not come from a package index. So I do think it would be good to keep supporting the formats in requirements.txt that we currently support, to reduce the risk of a breaking change.

Can we maybe do both for requirements.txt files? First try to extract the dependency with packaging, and if that fails, use a regex to extract the URL? Or maybe we can use something different completely? e.g. https://pypi.org/project/requirements-parser/

fpgmaas avatar Jun 17 '24 06:06 fpgmaas

Can we maybe do both for requirements.txt files? First try to extract the dependency with packaging, and if that fails, use a regex to extract the URL? Or maybe we can use something different completely? e.g. https://pypi.org/project/requirements-parser/

Between the 2 options, I'd personally prefer the first one, as packaging will not only be used to parse dependencies in requirements.txt files, but also in other formats that support PEP 508 (for instance [project.dependencies] in pyproject.toml.

I still think though that trying to guess the package name from a random URL on which we have no real control on feels quite hacky, even if most of the time this should give the user the expected result.

I'll put back the PR as a draft for now until I find the time to get back to this.

mkniewallner avatar Jul 16 '24 22:07 mkniewallner