Switch Python `/simple` and `/<project>/` APIs to using the JSON-based format (PEP-691)
Code improvement description
This is mostly a brain dump of a bunch of research I did this evening around the current state of PyPI APIs as part of https://github.com/dependabot/dependabot-core/issues/5723 and whether any changes can/should be made in :dependabot: :
Warehouse / PyPI exposes several JSON-based APIs:
/simple
- Provides an index of all packages.
- Defaults to HTML
- Can be requested in JSON format, as codified in PEP-691.
- The HTML version is currently used by Dependabot for fetching available versions: https://github.com/dependabot/dependabot-core/blob/efc538ce5f1b424c20b30083875d426b61aaa4b7/python/lib/dependabot/python/update_checker/latest_version_finder.rb#L141-L231
- Would be nice to migrate to the JSON variant as parsing static HTML is never fun
- Probably blocked by lack of support in private registry implementations:
- https://github.com/pypiserver/pypiserver/issues/508
- https://github.com/devpi/devpi/issues/986
- Artifactory: unknown
- GitLab's pypi implementation: unknown
- Cloudsmith: unknown
- GemFury: unknown
- Sonatype Nexus: unknown
- Others??
/<package-name>/
- Provides some version details about the package
- Can be requested in JSON format, as codified in PEP-691.
- Used by Dependabot here: https://github.com/dependabot/dependabot-core/blob/efc538ce5f1b424c20b30083875d426b61aaa4b7/python/lib/dependabot/python/update_checker/latest_version_finder.rb#L219-L224
- Like
/simpleprobably we can't migrate this to using the JSON API until/unless private registries support this.
/pypi/<package-name>/json
- A per-project JSON API: https://warehouse.pypa.io/api-reference/json.html#project.
- This has not yet been codified into a standard and per the comments on https://github.com/pypa/packaging-problems/issues/367 is heavily affected by PyPI implementation details, so may never get codified into a standard.
- Today we use this in Dependabot for Metadata fetching for Python packages: https://github.com/dependabot/dependabot-core/blob/efc538ce5f1b424c20b30083875d426b61aaa4b7/python/lib/dependabot/python/metadata_finder.rb#L165
- Not supported by most private registry implementations:
- https://github.com/devpi/devpi/issues/801
- https://github.com/pypiserver/pypiserver/issues/437
- I'd be surprised if any of the hosted providers support this endpoint
Conclusion:
- The one JSON API that's non-standard is the one we use in Dependabot, because that's the only way to retrieve the metadata information.
- The other APIs for fetching available versions now have a PEP standardizing how they should expose their data via JSON in addition to static HTML.
- Today Dependabot fetches these via static HTML.
- We are probably blocked for the foreseeable future from migrating those APIs to use JSON because it'd break all the private registries.
- And running both JSON and HTML parsing paths doesn't make a lot of sense, at least right now, because it adds complexity with no real benefit.
This will likely have to sit on backlog for several years until PEP-691 (adopted last year) sees more widespread adoption.
Related ticket with more technical info:
- https://github.com/devpi/devpi/issues/1018
FWIW, dependabot is already broken on simple-only indexes. We're currently trying to use dependabot with an internal artifactory (with a simple index, no json) and dependabot tries to use json. I think the request is coming from here:
https://github.com/dependabot/dependabot-core/blob/e5ec7e979/python/lib/dependabot/python/update_checker.rb#L270
But with config like this:
version: 2
registries:
python-artifactory:
type: python-index
url: https://redacted-internal-server-name/artifactory/api/pypi/pypi/simple/
replaces-base: true
updates:
- package-ecosystem: "pip"
# dependabot will not run without this
# https://docs.github.com/en/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file#insecure-external-code-execution
insecure-external-code-execution: allow
directory: "/"
registries:
- python-artifactory
schedule:
interval: "daily"
We get errors like this in the dependabot on internal runners log:
proxy | 2024/05/09 11***17***05 [021] GET https***//redacted-internal-server-name***443/artifactory/api/pypi/pypi/simple/smart-open/
2024/05/09 11***17***05 [021] 200 https***//redacted-internal-server-name***443/artifactory/api/pypi/pypi/simple/smart-open/
updater | 2024/05/09 11***17***05 INFO <job_826040134> Filtered out 2 pre-release versions
updater | 2024/05/09 11***17***05 INFO <job_826040134> Requirements to unlock own
2024/05/09 11***17***05 INFO <job_826040134> Requirements update strategy bump_versions
updater | 2024/05/09 11***17***05 INFO <job_826040134> Updating smart-open from 5.2.1 to 7.0.4
proxy | 2024/05/09 11***17***06 [023] GET https***//redacted-internal-server-name***443/pypi/smart-open/json
proxy | 2024/05/09 11***17***06 [023] 404 https***//redacted-internal-server-name***443/pypi/smart-open/json
Note those last two urls. dependabot is not respecting the config where we've told it the main url to use, but is inventing the path prefix to look for the index. However, since this index is a simple index, even if dependabot was using the correct path prefix, that would 404.