mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Wikipedia scrape dies because of upstream HTTP 504 errors (was:Timeout is reported as HTTP 504)

Open kelson42 opened this issue 3 years ago • 7 comments

$ cat articles
Список_угрожаемых_видов_цветковых_растений
$ mw --mwUrl="https://ru.wikipedia.org/" --articleList=articles
[log] [2021-08-12T07:13:55.452Z] Successfully logged in S3
[log] [2021-08-12T07:13:55.613Z] closing sanitize redis DB
[log] [2021-08-12T07:13:55.616Z] Starting mwoffliner v1.11.8...
[log] [2021-08-12T07:13:56.529Z] Successfully logged in S3
[log] [2021-08-12T07:13:56.533Z] Getting text direction...
[log] [2021-08-12T07:13:56.534Z] Getting site info...
[log] [2021-08-12T07:13:56.535Z] Getting sub-title...
(node:681934) Warning: Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification.
[log] [2021-08-12T07:13:56.767Z] Text direction is [ltr]
[log] [2021-08-12T07:14:00.120Z] Base Url:  https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/
[log] [2021-08-12T07:14:00.120Z] Base Url for Main Page:  https://ru.wikipedia.org/api/rest_v1/page/html/
[log] [2021-08-12T07:14:00.127Z] Using output directory /home/kelson/code/mwoffliner/out
[log] [2021-08-12T07:14:00.129Z] Using temporary directory /dev/shm/mwoffliner-1628752440127
[log] [2021-08-12T07:14:00.358Z] Worker [0] getting article range [0-1] of [1] [100%]
[log] [2021-08-12T07:14:00.610Z] Total articles found in Redis: 1
[log] [2021-08-12T07:14:00.620Z] Doing dump
[log] [2021-08-12T07:14:00.621Z] Writing zim to [/home/kelson/code/mwoffliner/out/wikipedia_ru_articles_2021-08.zim]
[log] [2021-08-12T07:14:01.092Z] Found [3] stylesheets to download
[log] [2021-08-12T07:14:01.092Z] Downloading stylesheets and populating media queue
[log] [2021-08-12T07:14:01.340Z] Downloaded stylesheets
[log] [2021-08-12T07:14:01.341Z] Saving favicon.png...
[log] [2021-08-12T07:14:02.051Z] Getting Main Page
[log] [2021-08-12T07:14:02.051Z] Creating main page...
[log] [2021-08-12T07:14:02.055Z] Getting articles
	

^C[log] [2021-08-12T07:22:25.880Z] SIGINT
[log] [2021-08-12T07:22:25.881Z] Flushing Redis DBs
[log] [2021-08-12T07:22:25.882Z] Exiting with code [130]
[log] [2021-08-12T07:22:25.882Z] Deleting temporary directory [/dev/shm/mwoffliner-1628752440127]
kelson@camber:~/code/mwoffliner$ mw --mwUrl="https://ru.wikipedia.org/" --articleList=articles --verbose
[log] [2021-08-12T07:22:36.138Z] Successfully logged in S3
[log] [2021-08-12T07:22:36.321Z] closing sanitize redis DB
[log] [2021-08-12T07:22:36.324Z] Starting mwoffliner v1.11.8...
[info] [2021-08-12T07:22:36.327Z] Using custom flavour: no
[log] [2021-08-12T07:22:38.580Z] Successfully logged in S3
[log] [2021-08-12T07:22:38.584Z] Getting text direction...
[log] [2021-08-12T07:22:38.585Z] Getting site info...
[log] [2021-08-12T07:22:38.585Z] Getting sub-title...
[info] [2021-08-12T07:22:38.586Z] Downloading [https://ru.wikipedia.org/wiki/]
[info] [2021-08-12T07:22:38.587Z] Getting JSON from [https://ru.wikipedia.org/w/api.php?action=query&meta=siteinfo&format=json&siprop=general|namespaces|statistics|variables|category|wikidesc]
[info] [2021-08-12T07:22:38.587Z] Downloading [https://ru.wikipedia.org/wiki/]
(node:684743) Warning: Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification.
[log] [2021-08-12T07:22:38.808Z] Text direction is [ltr]
[info] [2021-08-12T07:22:41.200Z] Getting JSON from [https://ru.wikipedia.org/w/api.php?action=query&format=json&prop=redirects%7Crevisions%7Ccoordinates&rdlimit=max&rdnamespace=]
[log] [2021-08-12T07:22:41.435Z] Base Url:  https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/
[log] [2021-08-12T07:22:41.435Z] Base Url for Main Page:  https://ru.wikipedia.org/api/rest_v1/page/html/
[log] [2021-08-12T07:22:41.442Z] Using output directory /home/kelson/code/mwoffliner/out
[info] [2021-08-12T07:22:41.442Z] Creating temporary directory [/dev/shm/mwoffliner-1628752961442]
[log] [2021-08-12T07:22:41.443Z] Using temporary directory /dev/shm/mwoffliner-1628752961442
[info] [2021-08-12T07:22:41.443Z] ArticleList has [1] items
[info] [2021-08-12T07:22:41.444Z] Getting JSON from [https://ru.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&format=json]
[info] [2021-08-12T07:22:41.668Z] Getting article ids
[log] [2021-08-12T07:22:41.670Z] Worker [0] getting article range [0-1] of [1] [100%]
[info] [2021-08-12T07:22:41.671Z] Getting JSON from [https://ru.wikipedia.org/w/api.php?action=query&format=json&prop=redirects%7Crevisions%7Cpageimages%7Ccoordinates&rdlimit=max&rdnamespace=0&redirects=true&titles=%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9&colimit=max]
[log] [2021-08-12T07:22:41.921Z] Total articles found in Redis: 1
[warn] [2021-08-12T07:22:41.922Z] Couldn't find strings file for [ru], falling back to [en]
[log] [2021-08-12T07:22:41.923Z] Doing dump
[log] [2021-08-12T07:22:41.924Z] Writing zim to [/home/kelson/code/mwoffliner/out/wikipedia_ru_articles_2021-08.zim]
[info] [2021-08-12T07:22:41.983Z] Copying Static Resource Files
[info] [2021-08-12T07:22:41.989Z] Finding stylesheets to download
[info] [2021-08-12T07:22:41.990Z] Downloading [https://ru.wikipedia.org/wiki/]
[log] [2021-08-12T07:22:42.424Z] Found [3] stylesheets to download
[log] [2021-08-12T07:22:42.424Z] Downloading stylesheets and populating media queue
[info] [2021-08-12T07:22:42.427Z] Downloading CSS from https://ru.wikipedia.org/w/load.php?lang=ru&modules=ext.flaggedRevs.basic%2Cicons|ext.uls.interlanguage|ext.visualEditor.desktopArticleTarget.noscript|ext.wikimediaBadges|jquery.makeCollapsible.styles|mediawiki.ui.button|mediawiki.widgets.styles|oojs-ui-core.icons%2Cstyles|oojs-ui.styles.indicators|skins.vector.styles.legacy&only=styles&skin=vector
[info] [2021-08-12T07:22:42.428Z] Downloading CSS from https://ru.wikipedia.org/w/load.php?lang=ru&modules=site.styles&only=styles&skin=vector
[info] [2021-08-12T07:22:42.428Z] Downloading CSS from https://ru.wikipedia.org/wiki/?title=Mediawiki%3Aoffline.css&action=raw
[info] [2021-08-12T07:22:42.428Z] Downloading [https://ru.wikipedia.org/w/load.php?lang=ru&modules=ext.flaggedRevs.basic%2Cicons%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cjquery.makeCollapsible.styles%7Cmediawiki.ui.button%7Cmediawiki.widgets.styles%7Coojs-ui-core.icons%2Cstyles%7Coojs-ui.styles.indicators%7Cskins.vector.styles.legacy&only=styles&skin=vector]
[info] [2021-08-12T07:22:42.429Z] Downloading [https://ru.wikipedia.org/w/load.php?lang=ru&modules=site.styles&only=styles&skin=vector]
[info] [2021-08-12T07:22:42.430Z] Downloading [https://ru.wikipedia.org/wiki/?title=Mediawiki%3Aoffline.css&action=raw]
[log] [2021-08-12T07:22:42.671Z] Downloaded stylesheets
[log] [2021-08-12T07:22:42.673Z] Saving favicon.png...
[info] [2021-08-12T07:22:42.674Z] Getting JSON from [https://ru.wikipedia.org/w/api.php?action=query&meta=siteinfo&format=json]
[info] [2021-08-12T07:22:42.924Z] Downloading [http://ru.wikipedia.org/static/images/project-logos/ruwiki.png]
[log] [2021-08-12T07:22:43.843Z] Getting Main Page
[log] [2021-08-12T07:22:43.844Z] Creating main page...
[log] [2021-08-12T07:22:43.847Z] Getting articles
[info] [2021-08-12T07:22:43.849Z] Worker [0] processing batch of article ids [["Список_угрожаемых_видов_цветковых_растений"]]
[info] [2021-08-12T07:22:43.849Z] Getting article [Список_угрожаемых_видов_цветковых_растений] from https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9
[info] [2021-08-12T07:22:43.849Z] Getting JSON from [https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9]
[info] [2021-08-12T07:24:01.540Z] [backoff] #0 after 100 ms
[info] [2021-08-12T07:24:01.643Z] Getting JSON from [https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9]
[info] [2021-08-12T07:25:06.838Z] [backoff] #1 after 200 ms
[info] [2021-08-12T07:25:07.039Z] Getting JSON from [https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9]
[info] [2021-08-12T07:26:12.631Z] [backoff] #2 after 400 ms
[info] [2021-08-12T07:26:13.031Z] Getting JSON from [https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9]
[info] [2021-08-12T07:27:18.224Z] [backoff] #3 after 800 ms
[info] [2021-08-12T07:27:19.025Z] Getting JSON from [https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9]
[info] [2021-08-12T07:28:24.217Z] [backoff] #4 after 1600 ms
[info] [2021-08-12T07:28:25.818Z] Getting JSON from [https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9]
[info] [2021-08-12T07:29:31.011Z] [backoff] #5 after 3200 ms
[info] [2021-08-12T07:29:34.211Z] Getting JSON from [https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9]
[info] [2021-08-12T07:30:39.409Z] [backoff] #6 after 6400 ms
[info] [2021-08-12T07:30:45.815Z] Getting JSON from [https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9]
[warn] [2021-08-12T07:31:51.013Z] Failed to get [https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9] [status=504]
[error] [2021-08-12T07:31:51.014Z] Error downloading article Список_угрожаемых_видов_цветковых_растений
Failed to run mwoffliner after [556s]: {
	"message": "Request failed with status code 504",
	"name": "Error",
	"stack": "Error: Request failed with status code 504\n    at createError (/home/kelson/code/mwoffliner/node_modules/axios/lib/core/createError.js:16:15)\n    at settle (/home/kelson/code/mwoffliner/node_modules/axios/lib/core/settle.js:17:12)\n    at IncomingMessage.handleStreamEnd (/home/kelson/code/mwoffliner/node_modules/axios/lib/adapters/http.js:260:11)\n    at IncomingMessage.emit (events.js:326:22)\n    at IncomingMessage.EventEmitter.emit (domain.js:483:12)\n    at endReadableNT (_stream_readable.js:1241:12)\n    at processTicksAndRejections (internal/process/task_queues.js:84:21)",
	"config": {
		"url": "https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9",
		"method": "get",
		"headers": {
			"Accept": "application/json",
			"cache-control": "public, max-stale=86400",
			"accept-encoding": "gzip, deflate",
			"user-agent": "MWOffliner/HEAD ([email protected])",
			"cookie": ""
		},
		"transformRequest": [
			null
		],
		"transformResponse": [
			null
		],
		"timeout": 120000,
		"responseType": "json",
		"xsrfCookieName": "XSRF-TOKEN",
		"xsrfHeaderName": "X-XSRF-TOKEN",
		"maxContentLength": -1,
		"maxBodyLength": -1
	}
}


**********

Request failed with status code 504

**********


[log] [2021-08-12T07:31:51.016Z] Exiting with code [2]
[log] [2021-08-12T07:31:51.016Z] Deleting temporary directory [/dev/shm/mwoffliner-1628752961442]

kelson42 avatar Aug 12 '21 07:08 kelson42

@MananJethwani Any idea how an upstream timeout is transformed in a HTTP 504 error?

kelson42 avatar Aug 12 '21 07:08 kelson42

@kelson42 this seems to me like an upstream bug should I open a ticket?....the connection is successful on our side but still REST API is having a problem in sending the response....should I handle it separately and let it pass for now?

MananJethwani avatar Aug 14 '21 09:08 MananJethwani

The big report is not about the timeout, it is about a timeout reported as http 504 response code. A timeout shoukd be reported as a timeout. In my test, i get no response so connection timeout.

kelson42 avatar Aug 14 '21 11:08 kelson42

@kelson42 it's not a timeout on upstream it's a 504 response from upstream. if you go to https://ru.wikipedia.org/api/rest_v1/ and type Список угрожаемых видов цветковых растений for title in mobile-section you will notice it's giving 504 there as well.

MananJethwani avatar Aug 14 '21 13:08 MananJethwani

@MananJethwani You are right:

$ curl --connect-timeout 120 -I "https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9"
HTTP/2 504 
content-length: 24
content-type: text/plain
date: Sat, 14 Aug 2021 14:49:24 GMT
server: envoy
age: 67
x-cache: cp3054 miss, cp3054 miss
x-cache-status: miss
server-timing: cache;desc="miss", host;desc="cp3054"
strict-transport-security: max-age=106384710; includeSubDomains; preload
report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
permissions-policy: interest-cohort=()
set-cookie: WMF-Last-Access=14-Aug-2021;Path=/;HttpOnly;secure;Expires=Wed, 15 Sep 2021 12:00:00 GMT
set-cookie: WMF-Last-Access-Global=14-Aug-2021;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 15 Sep 2021 12:00:00 GMT
x-client-ip: 2a02:168:6008:0:1592:e08c:305a:2fd8
set-cookie: GeoIP=CH:ZH:Zurich:47.37:8.57:v4; Path=/; secure; Domain=.wikipedia.org

kelson42 avatar Aug 14 '21 14:08 kelson42

Here is the upstream ticket https://phabricator.wikimedia.org/T288889

kelson42 avatar Aug 14 '21 14:08 kelson42

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Nov 09 '21 20:11 stale[bot]