mwoffliner
mwoffliner copied to clipboard
Request failed with status code 400
For https://en.wikipedia.org/api/rest_v1/page/mobile-sections/I%25CC%2587znik
see https://farm.openzim.org/pipeline/7adfa713c4bff5e9ce378a06/debug
It seems we request a bad title, json says:
type | "https://mediawiki.org/wiki/HyperSwitch/errors/bad_request"
method | "get"
detail | "title-invalid-characters"
The problem is with the article/title İznik
. It this is put in an article list then the scraper dies because a wrongly encoded string seems to be send to the API. IMO this is not a regression and the problem has always been there... but in the past we were not checking the API HTTP response code properly and the article were simply not mirrored at all... and it seems to indeed be missing in the old ZIM files of Wikipedia 0.8. So I guess we are braking the encoding of the title somewhere before requesting the HTML... might be at the time we retrieve meta informations like redirects .... etc.
@MananJethwani Would you be able please to have a look to that one as well. It is easy to reproduce and I'm sure you will find out quickly were the problem occur. Actually this is a quite serious problem because not only one zimfarm recipe dies because of this problem.
looks like we encode it twice!!
@kelson42 looks like we receive the articleIDs encoded from the MediaWiki side, so we don't need to encode them again while fetching.
@MananJethwani Your fix has allowed to improve the situation, but I still have a scenario here https://farm.openzim.org/pipeline/03264e29e2116ecec91f8f06/debug
@kelson42 this is strange, %C2%AD
is not mapped to any UTF-8 code, does this mean we are encoding some kind of empty line?
and even if we are why is it present in Wikipedia?
most probably this is a problem from the MediaWiki side, the site exists https://dty.wikipedia.org/wiki/%C2%AD
but when we try to fetch it using rest API using this URI https://dty.wikipedia.org/api/rest_v1/page/mobile-sections/%C2%AD
we get 400 response.
WP0.8 is broken again after this https://github.com/openzim/mwoffliner/pull/1521/files#diff-9a83f0d6b6913493f3382285626a8799d767b06b0c309e56d611014e9d05eea4L121. We need to better understand what is going on here.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
It seems it was some kind of weird encoding in the article list. I have fixed it.