wikiteam
wikiteam copied to clipboard
Error loop "XML for ... is wrong"
From [email protected] on July 12, 2011 20:21:47
Apparently this error is quite frequent with some characters. This starts a neverending loop, see e.g. http://p.defau.lt/?vUHNXKoaCOfNkeor_0HmCg I removed that title from the title list and resumed the dump; the following pages were not downloaded, perhaps because they were invalid: http://p.defau.lt/?KeDck2rQZqGlp9MWmYmB_Q Could those (invalid?) titles be the actual problem behind the error?
Original issue: http://code.google.com/p/wikiteam/issues/detail?id=26
From [email protected] on July 12, 2011 13:02:17
I don't think it is due to weird chars.
Look at the edit history for that page (is it a big history?) and try to export it handy using Special:Export (and open the result XML file). All is OK?
Status: New
From [email protected] on July 13, 2011 12:17:23
I tried an export with "Être et Temps", then with "Être et Temps" and all following titles, then only with following titles. If "Être et Temps" is included the xml is invalid. The history has only one, big revision (1,389,912 bytes): http://www.wikinfo.org/index.php?title=%C3%8Atre_et_Temps&action=history Is that size really enough to give problems?
Attachment: Wikinfo-20110713190814.xml Wikinfo-20110713190833.xml Wikinfo-20110713191257.xml
From [email protected] on July 14, 2011 11:52:24
If you have problems while trying to handy-export that article using Special:Export, then it is not a dumpgenerator.py problem. Although I have exported it handy without problems now. Try again, --resume. Server may be overloaded from time to time (and slowest ones have problems exporting huge revisions; PHP errors).
From [email protected] on May 28, 2012 01:29:09
So, changing this bug: the problem is the infinite loop. getXMLPageCore, after 5 retries, calls getXMLPageCore again for the last revision only, which will fail again and call getXMLPageCore again and again.
Summary: Error loop "XML for ... is wrong"
From [email protected] on May 28, 2012 02:36:52
Should be fixed by r675 Example: http://p.defau.lt/?3quwoS3nepAPnje3ro4WQw now gives http://p.defau.lt/?ZB8tXAcnI8c178eseWMGWA
Status: Fixed
From [email protected] on December 08, 2013 13:11:42
Still happening e.g. for http://www.editthis.info/4chan/api.php
From [email protected] on January 31, 2014 04:41:39
To forget about this very annoying issue, I just added sys.exit() right after "if c >= maxretries:" in dumpgenerator.py. Then launcher.py makes it continue to the next wiki of the list. Brute force workaround...
Recent example:
60987 page titles loaded Titles saved at... astrocom_wiki_astro_databank-20140628-titles.txt Retrieving the XML for every page from "start" XML for "Main_Page" is wrong. Waiting 20 seconds and reloading... XML for "Main_Page" is wrong. Waiting 40 seconds and reloading... XML for "Main_Page" is wrong. Waiting 60 seconds and reloading... XML for "Main_Page" is wrong. Waiting 80 seconds and reloading... We have retried 5 times MediaWiki error for "Main_Page", network error or whatever... Trying to save only the last revision for this page...
In this case the wiki needs login for export http://www.astro.com/astro-databank/Special:Export
Emilio J. Rodríguez-Posada, 28/06/2014 12:29:
In this case the wiki needs login for export http://www.astro.com/astro-databank/Special:Export
Reopening https://github.com/WikiTeam/wikiteam/issues/28
- OK: traceback fixed with #156
- but now it loops (see comment in https://github.com/WikiTeam/wikiteam/pull/157 )
- and when I start it with: only current version of every page python dumpgenerator.py --api=http://skilledtests.com/wiki/api.php --xml --images --curonly
it stops :-(
2526 page titles loaded Titles saved at... skilledtestscom_wiki-20140711-titles.txt Retrieving the XML for every page from "start" XML for "Main_Page" is wrong. Waiting 20 seconds and reloading... XML for "Main_Page" is wrong. Waiting 40 seconds and reloading... XML for "Main_Page" is wrong. Waiting 60 seconds and reloading... XML for "Main_Page" is wrong. Waiting 80 seconds and reloading... We have retried 5 times MediaWiki error for "Main_Page", network error or whatever... Saving in the errors log, and skipping... XML export on this wiki is broken, quitting.
- I thought: if I remove these special chars (��) from MediaWiki:Sidebar (1) the 'XML for "Main_Page" is wrong' might be gone.
- but: NO
- still "python dumpgenerator.py --api=http://skilledtests.com/wiki/api.php --xml --images --curonly" exits
- still unable to use your tool successfully :-(
- nothing more done.
(1) http://skilledtests.com/wiki/index.php5?title=MediaWiki%3ASidebar&diff=12630&oldid=12202
I noticed something.
- When I start it with this (1) command it starts to read. so the difference is that I have this .php5 (not .php) but it CRASHed, see #158
- this command (2) was successful also dump verification :-)
(1) python dumpgenerator.py --index=http://www.skilledtests.com/wiki/index.php5 --xml --curonly (2) python dumpgenerator.py --index=http://www.skilledtests.com/wiki/index.php5 --xml --images
Hello Erkan, I can run it with this:
python dumpgenerator.py --index=http://skilledtests.com/wiki/index.php5 --xml --images
As you said, it crashes, but can be resumed with:
python dumpgenerator.py --index=http://skilledtests.com/wiki/index.php5 --xml --images --resume --path=skilledtestscom_wiki5-20140711-wikidump/
The real issue here is a code regression, in the past we allowed --api and --index parameters at the same time. Now they are mutually excluded, so if you provide only --api, it fails to calculate the index.php because it is really index.php5. Currently the only solution is to use --index=...index.php5
I'm going to open an issue for the regression.
OK (thx for opening issue 160)
Can someone give an example of an URL that still produces this error?
Can someone give an example of an URL that still produces this error?
I have so many that I'm not able to track them (I'll soon reintroduce the abort hack above). Some candidates: http://en.starbricks.t15.org/wiki/api.php http://biou.net/api.php http://openttd.lachlanstevens.net/api.php http://wiki.toplist.cz/api.php http://cyber.law.harvard.edu/googlebooks/api.php http://en.cinoku.com/w/api.php http://albens73.fr/wiki/api.php http://wiki.mostlegit.com/api.php http://shafafsazi.com/w/api.php http://moodleforum.no/api.php http://wiki.lankou.org/w/api.php http://icube-ipp.unistra.fr/en/api.php http://icube-macepv.unistra.fr/en/api.php http://cyber.law.harvard.edu/hoap/api.php http://wiki.murrhardt.net/api.php http://www.runnersassari.it/wiki/api.php
Can someone give an example of an URL that still produces this error?
I have so many that I'm not able to track them (I'll soon reintroduce the abort hack above). Some candidates: http://en.starbricks.t15.org/wiki/api.php
This API is giving me a fatal error.
http://biou.net/api.php
The API seems fine, but index.php gives a fatal error: http://biou.net/index.php/Main_Page
http://openttd.lachlanstevens.net/api.php
I was able to complete a dump of this wiki successfully.
http://wiki.toplist.cz/api.php
Because the script is using http://wiki.toplist.cz/index.php instead of just http://wiki.toplist.cz/ as the index.
http://cyber.law.harvard.edu/googlebooks/api.php
Again, it's trying to use https://cyber.law.harvard.edu/googlebooks/index.php instead of the real index https://cyber.law.harvard.edu/googlebooks/
http://en.cinoku.com/w/api.php
You do not have permission to
You are not allowed to execute the action you have requested.
http://albens73.fr/wiki/api.php
http://albens73.fr/wiki/index.php/Sp%C3%A9cial:Exporter works
http://albens73.fr/wiki/index.php/Special:Export doesn't work
http://wiki.mostlegit.com/api.php
No error for me.
http://shafafsazi.com/w/api.php
Special:Export gives:
The action you have requested is limited to users in the group: Users.
http://moodleforum.no/api.php
Again, using the page "index.php".
http://wiki.lankou.org/w/api.php
Another wrong "index.php".
http://icube-ipp.unistra.fr/en/api.php
VERY strange, says you need to log in to use Special:Export, but other special pages work fine.
http://icube-macepv.unistra.fr/en/api.php
Same error as previous.
http://cyber.law.harvard.edu/hoap/api.php
Yet another wrong "index.php"
http://wiki.murrhardt.net/api.php
More incorrect "index.php" use.
http://www.runnersassari.it/wiki/api.php
Another broken wiki.
http://www.runnersassari.it/wiki/index.php?title=Speciale:Esporta/Discussioni_Runner_Sassari_Enciclopedia:Informazioni returns malformed XML, and http://www.runnersassari.it/wiki/index.php?title=Discussioni_Runner_Sassari_Enciclopedia:Informazioni gives a database error.
Thanks pir², those two fixed several. With https://github.com/WikiTeam/wikiteam/commit/bdc7c9bf069cf0341f86ea4d2549dfb08bc549f4 I fixed albens: https://archive.org/details/wiki-albens73fr_wiki
Now trying the others mentioned in the comment above.
http://en.starbricks.t15.org/wiki/index.php failed with a 502, maybe we're too fast? Should adjust delay when we get certain errors > 404 or >= 500.
mostlegit still fails for me if I pass the api, probably short URL: http://www.mostlegit.com/wiki/?title=Special:Export/Main_Page
http://shafafsazi.com/w/api.php has no $wgServer: File "./dumpgenerator.py", line 1434, in checkAPI index = result['query']['general']['server'] + \ KeyError: 'server'
More examples, api.php URLs as provided by not-archived.py:
https://wiki.churchoffoxx.net/api.php http://mediawikibootstrapskin.co.uk/nexus/api.php http://jeffreylotterman.com/api.php http://holger-drefs.de/w/api.php http://planet417.com/api.php http://wiki.nbi.ku.dk/w/nanophys_cleanroom/api.php http://wiki.lignes-cardinales.fr/api.php http://www.elfhame.org/w/api.php http://wiki.openbroadcaster.com/api.php http://auth.unidog.de/api.php http://atheistsurvivalguide.org/wiki/api.php http://scoaa.net/api.php http://gridpack.org/wiki/api.php https://dbl.lmi.org/api.php https://www.semantic-apps.com/mediawiki/api.php
And probably: http://www.secure-abap.de/api.php http://www.suppfinderwiki.com/api.php http://www.aik-ev.de/api.php http://forge.openbravo.com/plugins/mwiki/api.php https://support.guest.it/wiki/api.php http://www.cosmiccuttlefish.org/api.php http://www.genderpedia.net/api.php https://debu-lab.info/wiki/api.php http://www.topgearit.net/api.php http://sco.uncyclopedia.org.uk/api.php https://restauth.net/api.php https://wiki.unitedplatform.com/api.php http://www.zauber-wiki.org/api.php http://www.informaticalessen.be/api.php https://www.cs.colostate.edu/wiki/mediawiki/api.php https://www.ubuntu.lt/wiki/api.php http://www.brandschutz-wiki.de/api.php http://2015.brucon.org/api.php http://cubscoutpack.com/api.php https://salamatechwiki.org/api.php https://fluswiki.hfwu.de/api.php
On a set of 8000 wikis, I've so far not needed to manually break any "XML is wrong" loop so far, so I guess our skipping is now working at last. However, I'm also not seeing any wiki being archived which previously failed :[ so we didn't substantially increase our success rate. Not sure what to do.
what about http://neowiki.neooffice.org/api.php ? I cannot get it working. :-(
I am still having this looping issue. api.php for reproducibility testing: http://www.tanasinn.info/api.php Operating System: Windows 7 Professional Version 6.1 (Build 7601: Service Pack 1)[64 bit] Command Prompt output:
C:\Users\Patrick\Documents\WinPython-64bit-2.7.12.3Zero\scripts>cd C:\Users\Patr
ick\Documents\SBARG\Wikis\wikiteam
C:\Users\Patrick\Documents\SBARG\Wikis\wikiteam>dumpgenerator.py --api=http://ww
w.tanasinn.info/api.php --xml --images
C:\Users\Patrick\Documents\SBARG\Wikis\wikiteam>python dumpgenerator.py --api=ht
tp://www.tanasinn.info/api.php --xml --images
Checking API... http://www.tanasinn.info/api.php
Checking API... http://tanasinn.info/api.php
API is OK: http://tanasinn.info/api.php
Checking index.php... http://www.tanasinn.info/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3) #
# More info at: https://github.com/WikiTeam/wikiteam #
#########################################################################
#########################################################################
# Copyright (C) 2011-2016 WikiTeam developers #
# This program is free software: you can redistribute it and/or modify #
# it under the terms of the GNU General Public License as published by #
# the Free Software Foundation, either version 3 of the License, or #
# (at your option) any later version. #
# #
# This program is distributed in the hope that it will be useful, #
# but WITHOUT ANY WARRANTY; without even the implied warranty of #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #
# GNU General Public License for more details. #
# #
# You should have received a copy of the GNU General Public License #
# along with this program. If not, see <http://www.gnu.org/licenses/>. #
#########################################################################
Analysing http://tanasinn.info/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
20 namespaces found
Retrieving titles in the namespace 0
... 1167 titles retrieved in the namespace 0
Retrieving titles in the namespace 1
. 96 titles retrieved in the namespace 1
Retrieving titles in the namespace 2
. 198 titles retrieved in the namespace 2
Retrieving titles in the namespace 3
. 61 titles retrieved in the namespace 3
Retrieving titles in the namespace 4
. 11 titles retrieved in the namespace 4
Retrieving titles in the namespace 5
. 0 titles retrieved in the namespace 5
Retrieving titles in the namespace 6
. 455 titles retrieved in the namespace 6
Retrieving titles in the namespace 7
. 22 titles retrieved in the namespace 7
Retrieving titles in the namespace 8
. 58 titles retrieved in the namespace 8
Retrieving titles in the namespace 9
. 1 titles retrieved in the namespace 9
Retrieving titles in the namespace 10
. 132 titles retrieved in the namespace 10
Retrieving titles in the namespace 103
. 0 titles retrieved in the namespace 103
Retrieving titles in the namespace 12
. 4 titles retrieved in the namespace 12
Retrieving titles in the namespace 13
. 1 titles retrieved in the namespace 13
Retrieving titles in the namespace 14
. 120 titles retrieved in the namespace 14
Retrieving titles in the namespace 15
. 6 titles retrieved in the namespace 15
Retrieving titles in the namespace 11
. 10 titles retrieved in the namespace 11
Retrieving titles in the namespace 100
. 39 titles retrieved in the namespace 100
Retrieving titles in the namespace 102
. 0 titles retrieved in the namespace 102
Retrieving titles in the namespace 101
. 1 titles retrieved in the namespace 101
Titles saved at... tanasinninfo-20161130-titles.txt
2382 page titles loaded
Retrieving the XML for every page from "start"
In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading
...
In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading
...
In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading
...
In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading
...
Traceback (most recent call last):
File "dumpgenerator.py", line 2084, in <module>
main()
File "dumpgenerator.py", line 2076, in main
createNewDump(config=config, other=other)
File "dumpgenerator.py", line 1651, in createNewDump
generateXMLDump(config=config, titles=titles, session=other['session'])
File "dumpgenerator.py", line 682, in generateXMLDump
header, config = getXMLHeader(config=config, session=session)
File "dumpgenerator.py", line 440, in getXMLHeader
xml = "".join([x for x in getXMLPage(config=config, title=randomtitle, verbo
se=False, session=session)])
File "dumpgenerator.py", line 589, in getXMLPage
xml = getXMLPageCore(params=params, config=config, session=session)
File "dumpgenerator.py", line 516, in getXMLPageCore
time.sleep(wait)
KeyboardInterrupt
C:\Users\Patrick\Documents\SBARG\Wikis\wikiteam>python dumpgenerator.py --api=ht
tp://www.tanasinn.info/api.php --xml --images > tanasinninfo-2016-dumplog.log
Traceback (most recent call last):
File "dumpgenerator.py", line 2084, in <module>
main()
C:\Users\Patrick\Documents\SBARG\Wikis\wikiteam>python dumpgenerator.py --api=ht
tp://www.tanasinn.info/api.php --xml --images
Checking API... http://www.tanasinn.info/api.php
Checking API... http://tanasinn.info/api.php
API is OK: http://tanasinn.info/api.php
Checking index.php... http://www.tanasinn.info/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3) #
# More info at: https://github.com/WikiTeam/wikiteam #
#########################################################################
#########################################################################
# Copyright (C) 2011-2016 WikiTeam developers #
# This program is free software: you can redistribute it and/or modify #
# it under the terms of the GNU General Public License as published by #
# the Free Software Foundation, either version 3 of the License, or #
# (at your option) any later version. #
# #
# This program is distributed in the hope that it will be useful, #
# but WITHOUT ANY WARRANTY; without even the implied warranty of #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #
# GNU General Public License for more details. #
# #
# You should have received a copy of the GNU General Public License #
# along with this program. If not, see <http://www.gnu.org/licenses/>. #
#########################################################################
Analysing http://tanasinn.info/api.php
Warning!: "./tanasinninfo-20161130-wikidump" path exists
There is a dump in "./tanasinninfo-20161130-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the
current session will be ignored
and the parameters available in "./tanasinninfo-20161130-wikidump/config.txt" wi
ll be loaded.
Do you want to resume ([yes, y], [no, n])? y
You have selected: YES
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML is corrupt? Regenerating...
Retrieving the XML for every page from "start"
In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading
...
In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading
...
In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading
...
In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading
...
We have retried 5 times
MediaWiki error for "Main_Page", network error or whatever...
Trying to save only the last revision for this page...
In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading
...
In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading
...
In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading
...
In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading
...
We have retried 5 times
MediaWiki error for "Main_Page", network error or whatever...
Saving in the errors log, and skipping...
Trying the local name for the Special namespace instead
In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading
...
In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading
...
In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading
...
In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading
...
We have retried 5 times
MediaWiki error for "Main_Page", network error or whatever...
Trying to save only the last revision for this page...
In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading
...
In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading
...
In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading
...
In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading
...
We have retried 5 times
MediaWiki error for "Main_Page", network error or whatever...
Saving in the errors log, and skipping...
Traceback (most recent call last):
File "dumpgenerator.py", line 2084, in <module>
main()
File "dumpgenerator.py", line 2074, in main
resumePreviousDump(config=config, other=other)
File "dumpgenerator.py", line 1735, in resumePreviousDump
config=config, titles=titles, session=other['session'])
File "dumpgenerator.py", line 682, in generateXMLDump
header, config = getXMLHeader(config=config, session=session)
File "dumpgenerator.py", line 467, in getXMLHeader
header = xml.split('</mediawiki>')[0]
UnboundLocalError: local variable 'xml' referenced before assignment
C:\Users\Patrick\Documents\SBARG\Wikis\wikiteam>