CEP idea: Standardized user agent strings
It's really tricky to get accurate usage or download data for given package versions, platforms, etc, that can inform decision making (e.g. dropping osx-64).
Different conda clients are using different user agent strings for their repodata and package downloads so, even if we could query those, we could only do so for conda requests. Other tools like mamba, pixi or rattler-build are not providing as much information. We would also need a mechanism for specific contexts to extend the user agent with custom values (e.g. conda-forge might want to flag their internal CI jobs so they don't add noise to the real user data).
For example, in conda, the lack of a standard way to do so results in runtime patches like this.
I propose two things:
- Standardize which contents must be present in every user agent
- Recommend a mechanism to extend it with added value pairs (similar to
pip's custom JSON user agent strings)
References:
- https://github.com/anaconda/anaconda-package-data/issues/64
- https://github.com/conda/infrastructure/issues/1018
- https://discuss.python.org/t/pre-pep-user-agent-schema-for-http-requests-against-remote-package-indices/104006
I believe pip has a mechanism to add some telemetry information eg whether it runs in ci or not.
I checked the pip code, this is what they do:
https://github.com/pypa/pip/blob/7e49dca9277bf4e325b85cfb9ebe70401f194fb6/src/pip/_internal/network/session.py#L109
Yep, this bit particularly for CI detection heuristics. But that also would cover legitimate usage of CI in e.g. testing pipelines of other projects.
What we want to say is "this is a conda-forge build job" so it is passed to the build tool via e.g. --user-agent-data conda-forge/ci.
FWIW as long as we stay close to https://datatracker.ietf.org/doc/html/rfc1945#page-46, I don't mind. That said, it's totally normal to also define other important request headers, if we want to stuff more information into requests. This is what https://github.com/anaconda/conda-anaconda-telemetry is doing to not overload the User-Agent header, remember that headers have a max length before some servers ignore them, cut them off or even respond with 500 responses if overflowing a certain size. IIRC @travishathaway did some digging into this when we build the HTTP version of conda-anaconda-telemetry.
Hm, true, I like that option too. And conda already has a headers plugin hook, apparently. So we "only" need to standardize some of those decisions there (e.g. the ; separator for fields seen in anaconda-telemetry).