wikidata icon indicating copy to clipboard operation
wikidata copied to clipboard

Empty response (JSONDecodeError) when sending many requests in a row

Open marccarre opened this issue 3 years ago • 3 comments

Version

0.7.0

Problem

When sending several (10~100) requests in a row, some requests fail, without determinism, with the following error:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Upon closer investigation, the actual response is a 429,

  • with "reason": Too many requests. Please comply with the User-Agent policy to get a higher rate limit: https://meta.wikimedia.org/wiki/User-Agent_policy, and
  • with the following "body":
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<style>
	* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
.content-text { overflow: hidden; overflow-wrap: break-word; word-wrap: break-word; -webkit-hyphens: auto; -moz-hyphens: auto; -ms-hyphens: auto; hyphens: auto; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
.text-muted { color: #777; }
</style>
<div class="content" role="main">
	<a href="https://www.wikimedia.org"><img src="https://www.wikimedia.org/static/images/wmf-logo.png" srcset="https://www.wikimedia.org/static/images/wmf-logo-2x.png 2x" alt="Wikimedia" width="135" height="101">
</a>
<h1>Error</h1>
<div class="content-text">
	<p>Our servers are currently under maintenance or experiencing a technical problem.

	Please <a href="" title="Reload this page" onclick="window.location.reload(false); return false">try again</a> in a few&nbsp;minutes.</p>

<p>See the error message at the bottom of this page for more&nbsp;information.</p>
</div>
</div>
<div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class="text-muted"><code>Request from 122.216.10.145 via cp5012 cp5012, Varnish XID 477962109<br>Upstream caches: cp5012 int<br>Error: 429, Too many requests. Please comply with the User-Agent policy to get a higher rate limit: https://meta.wikimedia.org/wiki/User-Agent_policy at Sun, 17 Jul 2022 22:28:20 GMT</code></p>
</div>
</html>

Root cause

This library doesn't follow Wikimedia's user-agent policy, specifically:

<client name>/<version> (<contact information>) <library/framework name>/<version> [<library name>/<version> ...]. Parts that are not applicable can be omitted.

which leads in a temporary rate limiting/blacklisting of the agent:

Requests from disallowed user agents may instead encounter a less helpful error message like this: Our servers are currently experiencing a technical problem. Please try again in a few minutes.

See also: https://meta.wikimedia.org/wiki/User-Agent_policy

Solution

Set an User-Agent header compliant with the above policy, e.g.:

>>> import urllib
>>> od = urllib.request.OpenerDirector()
>>> od.addheaders 
[('User-agent', 'Python-urllib/3.9')]
>>> 
>>> import wikidata
>>> wikidata.__version__
'0.7.0'
>>> 
>>> import sys
>>> od.addheaders = { 
...     "Accept": "application/sparql-results+json",
...     "User-Agent": "wikidata-based-bot/%s (https://github.com/dahlia/wikidata ; [email protected]) python/%s.%s.%s Wikidata/%s" % (wikidata.__version__, sys.version_info.major, sys.version_info.minor, sys.version_info.micro, wikidata.__version__),
... }
>>> 
>>> od.addheaders 
{'Accept': 'application/sparql-results+json', 'User-Agent': 'wikidata-based-bot/0.7.0 (https://github.com/dahlia/wikidata ; [email protected]) python/3.9.13 Wikidata/0.7.0'}

marccarre avatar Jul 17 '22 23:07 marccarre

By the way, doesn't WikiData provide rate limit header fields? If it has them we could intelligently control the request rate from client side.

dahlia avatar Jul 18 '22 03:07 dahlia

I looked for this too but, unless I missed something, couldn't find such header in the response:

{'accept-ch': 'Sec-CH-UA-Arch,Sec-CH-UA-Bitness,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Model,Sec-CH-UA-Platform-Version',
 'accept-ranges': 'bytes',
 'access-control-allow-origin': '*',
 'age': '1',
 'cache-control': 'public, max-age=300',
 'content-encoding': 'gzip',
 'content-type': 'application/sparql-results+json;charset=utf-8',
 'date': 'Fri, 22 Jul 2022 09:33:06 GMT',
 'nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, '
        '"success_fraction": 0.0}',
 'permissions-policy': 'interest-cohort=(),ch-ua-arch=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-bitness=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-full-version-list=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-model=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-platform-version=(self '
                       '"intake-analytics.wikimedia.org")',
 'report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": '
              '"https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" '
              '}] }',
 'server': 'nginx/1.14.2',
 'server-timing': 'cache;desc="pass", host;desc="cp5008"',
 'set-cookie': 'WMF-Last-Access=22-Jul-2022;Path=/;HttpOnly;secure;Expires=Tue, '
               '23 Aug 2022 00:00:00 GMT, '
               'WMF-Last-Access-Global=22-Jul-2022;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Tue, '
               '23 Aug 2022 00:00:00 GMT',
 'strict-transport-security': 'max-age=106384710; includeSubDomains; preload',
 'transfer-encoding': 'chunked',
 'vary': 'Accept, Accept-Encoding',
 'x-cache': 'cp5009 miss, cp5008 pass',
 'x-cache-status': 'pass',
 'x-client-ip': '***.***.***.***',
 'x-first-solution-millis': '48',
 'x-served-by': 'wdqs2001'}

There seems to be only this recommendation on the request rate: https://wikitech.wikimedia.org/wiki/Robot_policy#Request_rate

marccarre avatar Jul 22 '22 09:07 marccarre

Thank you for letting me know that! 🙏🏼

dahlia avatar Jul 22 '22 12:07 dahlia