Set User-Agent header for outgoing HTTP requests
Discussed in https://github.com/apache/jena/discussions/3139
Originally posted by Abbe98 April 22, 2025 Is there a way to configure Fuseki so that it sets a user-agent header on outgoing requests?
Wikidata.org's SPARQL endpoint recently started enforcing its user-agent policy as a result one is heavily rate limited when doing federation with Wikidata without an User-agent header.
See also T402959 on the Wikimedia issue tracker. Enforcement of the user agent policy on WDQS has been temporarily stayed, but this is expected to be removed at the start of 2026, so after that point, anyone who wants to federate with WDQS will need to use Jena 5.5.0 or later (if i read the above commit correctly).
Also, IMHO it would be useful if Fuseki allowed server operators to customize the user agent by which the server identifies itself. ApacheJena/5.5.0 is much better than nothing, but it’s still somewhat generic (depending on how common Apache Jena servers are, I guess ^^) and doesn’t allow WDQS to distinguish between different calling servers, and the policy advises against “generic agents” (even “pywikibot” is “likely to be somewhat vague”).
Hi @lucaswerkmeister,
For Jena, it is more than federation and more than Jena Fuseki.
Jena also produces a java library (where it is possible to set the User-Agent) that could be used for for direct calls to wikidata. Jena also provides command line tools and these tools are used by non-programmers as well as programmers.
What does Wikidata want the setting to be in the latter case?
The current (5.5.0) setting is in the format of common style, at least as I found when I asked the web.
IMHO it would make sense for each of the CLI tools to have a user agent like
rsparql ApacheJena/5.5.0
following the “space-separated components by decreasing specificity” format. (You could also include a URL / comment in parentheses, but in this case it doesn’t seem necessary to me – ApacheJena is straightforward to search for.) Or a single user agent for all the tools, like jena-cli ApacheJena/5.5.0, would probably also be fine.
(This is not an official Wikimedia position ^^ if you want a more authoritative answer, I think you could try asking in #wikimedia-tech on Libera Chat, or open a Phabricator task.)
@lucaswerkmeister -- The User-Agent policy page focuses on bots so your response is helpful.
I'm not keen on including email addresses.
Yeah, I think that only makes sense for individual bots. (It would make sense, for instance, for someone writing a bot using the Jena libraries – you already mentioned it’s possible to set the User-Agent there.)
The user agent for SERVICE is currently hard coded - it could be turned into yet another context property that could be configurable on request / endpoint / dataset / global ARQ level.
https://github.com/apache/jena/blob/c8a32ea1ed06fabd4fdc6ebf7d5ed65fafb1e7ab/jena-arq/src/main/java/org/apache/jena/sparql/exec/http/Service.java#L230-L233
A more invasive approach would be a SPARQL syntax extension for service options. This would have to go to sparql-dev - some related discussions have taken place but AFAIK no actionable items were yet derived.
- Query Level from https://github.com/w3c/sparql-dev/issues/10#issuecomment-892768925
SELECT ?id
FROM NAMED <http://www.openlinksw.com/dataspace/[email protected]/weblog/[email protected]%27s%20BLOG%20%5B127%5D/sioc.ttl>
OPTION (get:soft "soft", get:method "GET")
WHERE { GRAPH ?g { ?id a ?o } }
LIMIT 10
- Service level (further discussion would be whether options could originate from bindings):
SELECT * { SERVICE[userAgent: "my-bot"] <https://query.wikidata.org/sparql> { ?s ?p ?o }