extraction-framework 429 Too many requests

Hi,

I have configured https://github.com/dbpedia/marvin-config to extract german wikipedia. A first run worked for the 20220401 dump.

Today i run again to extract the 20220601 dump, but it only worked partly the extraction framework and after some time only HTTP 429 was returned from https://de.wikipedia.org/w/api.php.

Exception; de; Main Extraction at 00:00.957s for 62 datasets; Main Extraction failed for instance http://de.dbpedia.org/resource/Liste_von_Autoren/J: Server returned HTTP response code: 429 for URL: https://de.wikipedia.org/w/api.php java.io.IOException: Server returned HTTP response code: 429 for URL: https://de.wikipedia.org/w/api.php at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1902) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268) at org.dbpedia.extraction.util.MediaWikiConnector$$anonfun$retrievePage$1.apply$mcVI$sp(MediaWikiConnector.scala:97) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166) ...

I used the following settings in extractionConfiguration/extraction.de.properties

mwc-apiUrl=https://{{LANG}}.wikipedia.org/w/api.php
mwc-maxRetries=5
mwc-connectMs=4000
mwc-readMs=30000
mwc-sleepFactor=2000

It seems the extraction-framework does not handle this HTTP error properly. I would be great if the Retry-After HTTP header is used to handle such errors. Any suggestions which properties to adjust for this problem?

Jun 08 '22 10:06 uleodolter

Hi, we are currently reworking the abstract extraction

Jun 22 '22 08:06 jlareck

Any updates on this or workaround for this ? the extraction of german wikipedia worked only once in April 2022.

Oct 20 '22 18:10 uleodolter

Hi, yes, we have some updates around text extraction. So, this summer, we had a Google Summer of Code project during which one student upgraded text extraction and it became better (at least we reduced number of 429 errors but still sometimes text extraction process becomes frozen at some point of time). So in this branch there is all related work https://github.com/dbpedia/extraction-framework/tree/celian-gsoc .

During this gsoc project there were implemented two new MediawikiConnectors based on previous one:

https://github.com/dbpedia/extraction-framework/blob/celian-gsoc/core/src/main/scala/org/dbpedia/extraction/util/MediawikiConnectorConfigured.scala - this MediawikiConnector uses current Mediawiki API that we always have used before, but there was added some new configurations so as result number of 429 HTTP errors were reduced. But sometimes extraction doesn't completes and when maybe 70-95% (I am not completly sure in these numbers but when we tested it and compared with datasets that we had in previous releases, the number of extracted pages looks like were almost the same) of pages from dump were extracted then the extraction process just becomes frozen. I recommend you to run extraction only for one language per process (in extraction.text.properties file just write one language).

https://github.com/dbpedia/extraction-framework/blob/celian-gsoc/core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorRest.scala - here is used new REST Mediawiki API. And for this one we still have same problem with frozen process during extraction.

Oct 29 '22 18:10 jlareck