dependabot-core icon indicating copy to clipboard operation
dependabot-core copied to clipboard

Switch Python `/simple` and `/<project>/` APIs to using the JSON-based format (PEP-691)

Open jeffwidman opened this issue 2 years ago • 3 comments

Code improvement description

This is mostly a brain dump of a bunch of research I did this evening around the current state of PyPI APIs as part of https://github.com/dependabot/dependabot-core/issues/5723 and whether any changes can/should be made in :dependabot: :

Warehouse / PyPI exposes several JSON-based APIs:

/simple

  • Provides an index of all packages.
  • Defaults to HTML
  • Can be requested in JSON format, as codified in PEP-691.
  • The HTML version is currently used by Dependabot for fetching available versions: https://github.com/dependabot/dependabot-core/blob/efc538ce5f1b424c20b30083875d426b61aaa4b7/python/lib/dependabot/python/update_checker/latest_version_finder.rb#L141-L231
  • Would be nice to migrate to the JSON variant as parsing static HTML is never fun
  • Probably blocked by lack of support in private registry implementations:
    • https://github.com/pypiserver/pypiserver/issues/508
    • https://github.com/devpi/devpi/issues/986
    • Artifactory: unknown
    • GitLab's pypi implementation: unknown
    • Cloudsmith: unknown
    • GemFury: unknown
    • Sonatype Nexus: unknown
    • Others??

/<package-name>/

  • Provides some version details about the package
  • Can be requested in JSON format, as codified in PEP-691.
  • Used by Dependabot here: https://github.com/dependabot/dependabot-core/blob/efc538ce5f1b424c20b30083875d426b61aaa4b7/python/lib/dependabot/python/update_checker/latest_version_finder.rb#L219-L224
  • Like /simple probably we can't migrate this to using the JSON API until/unless private registries support this.

/pypi/<package-name>/json

  • A per-project JSON API: https://warehouse.pypa.io/api-reference/json.html#project.
  • This has not yet been codified into a standard and per the comments on https://github.com/pypa/packaging-problems/issues/367 is heavily affected by PyPI implementation details, so may never get codified into a standard.
  • Today we use this in Dependabot for Metadata fetching for Python packages: https://github.com/dependabot/dependabot-core/blob/efc538ce5f1b424c20b30083875d426b61aaa4b7/python/lib/dependabot/python/metadata_finder.rb#L165
  • Not supported by most private registry implementations:
    • https://github.com/devpi/devpi/issues/801
    • https://github.com/pypiserver/pypiserver/issues/437
    • I'd be surprised if any of the hosted providers support this endpoint

Conclusion:

  1. The one JSON API that's non-standard is the one we use in Dependabot, because that's the only way to retrieve the metadata information.
  2. The other APIs for fetching available versions now have a PEP standardizing how they should expose their data via JSON in addition to static HTML.
  3. Today Dependabot fetches these via static HTML.
  4. We are probably blocked for the foreseeable future from migrating those APIs to use JSON because it'd break all the private registries.
  5. And running both JSON and HTML parsing paths doesn't make a lot of sense, at least right now, because it adds complexity with no real benefit.

jeffwidman avatar Aug 01 '23 05:08 jeffwidman

This will likely have to sit on backlog for several years until PEP-691 (adopted last year) sees more widespread adoption.

jeffwidman avatar Aug 01 '23 05:08 jeffwidman

Related ticket with more technical info:

  • https://github.com/devpi/devpi/issues/1018

jeffwidman avatar Apr 12 '24 17:04 jeffwidman

FWIW, dependabot is already broken on simple-only indexes. We're currently trying to use dependabot with an internal artifactory (with a simple index, no json) and dependabot tries to use json. I think the request is coming from here:

https://github.com/dependabot/dependabot-core/blob/e5ec7e979/python/lib/dependabot/python/update_checker.rb#L270

But with config like this:

version: 2
registries:
  python-artifactory:
    type: python-index
    url: https://redacted-internal-server-name/artifactory/api/pypi/pypi/simple/
    replaces-base: true
updates:
  - package-ecosystem: "pip"
    # dependabot will not run without this
    # https://docs.github.com/en/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file#insecure-external-code-execution
    insecure-external-code-execution: allow
    directory: "/"
    registries:
      - python-artifactory
    schedule:
      interval: "daily"

We get errors like this in the dependabot on internal runners log:

  proxy | 2024/05/09 11***17***05 [021] GET https***//redacted-internal-server-name***443/artifactory/api/pypi/pypi/simple/smart-open/
2024/05/09 11***17***05 [021] 200 https***//redacted-internal-server-name***443/artifactory/api/pypi/pypi/simple/smart-open/
updater | 2024/05/09 11***17***05 INFO <job_826040134> Filtered out 2 pre-release versions
updater | 2024/05/09 11***17***05 INFO <job_826040134> Requirements to unlock own
2024/05/09 11***17***05 INFO <job_826040134> Requirements update strategy bump_versions
updater | 2024/05/09 11***17***05 INFO <job_826040134> Updating smart-open from 5.2.1 to 7.0.4
  proxy | 2024/05/09 11***17***06 [023] GET https***//redacted-internal-server-name***443/pypi/smart-open/json
  proxy | 2024/05/09 11***17***06 [023] 404 https***//redacted-internal-server-name***443/pypi/smart-open/json

Note those last two urls. dependabot is not respecting the config where we've told it the main url to use, but is inventing the path prefix to look for the index. However, since this index is a simple index, even if dependabot was using the correct path prefix, that would 404.

bewinsnw avatar May 09 '24 15:05 bewinsnw