wikiteam icon indicating copy to clipboard operation
wikiteam copied to clipboard

Error loop "XML for ... is wrong"

Open emijrp opened this issue 10 years ago • 27 comments

From [email protected] on July 12, 2011 20:21:47

Apparently this error is quite frequent with some characters. This starts a neverending loop, see e.g. http://p.defau.lt/?vUHNXKoaCOfNkeor_0HmCg I removed that title from the title list and resumed the dump; the following pages were not downloaded, perhaps because they were invalid: http://p.defau.lt/?KeDck2rQZqGlp9MWmYmB_Q Could those (invalid?) titles be the actual problem behind the error?

Original issue: http://code.google.com/p/wikiteam/issues/detail?id=26

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on July 12, 2011 13:02:17

I don't think it is due to weird chars.

Look at the edit history for that page (is it a big history?) and try to export it handy using Special:Export (and open the result XML file). All is OK?

Status: New

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on July 13, 2011 12:17:23

I tried an export with "Être et Temps", then with "Être et Temps" and all following titles, then only with following titles. If "Être et Temps" is included the xml is invalid. The history has only one, big revision (1,389,912 bytes): http://www.wikinfo.org/index.php?title=%C3%8Atre_et_Temps&action=history Is that size really enough to give problems?

Attachment: Wikinfo-20110713190814.xml Wikinfo-20110713190833.xml Wikinfo-20110713191257.xml

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on July 14, 2011 11:52:24

If you have problems while trying to handy-export that article using Special:Export, then it is not a dumpgenerator.py problem. Although I have exported it handy without problems now. Try again, --resume. Server may be overloaded from time to time (and slowest ones have problems exporting huge revisions; PHP errors).

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on May 28, 2012 01:29:09

So, changing this bug: the problem is the infinite loop. getXMLPageCore, after 5 retries, calls getXMLPageCore again for the last revision only, which will fail again and call getXMLPageCore again and again.

Summary: Error loop "XML for ... is wrong"

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on May 28, 2012 02:36:52

Should be fixed by r675 Example: http://p.defau.lt/?3quwoS3nepAPnje3ro4WQw now gives http://p.defau.lt/?ZB8tXAcnI8c178eseWMGWA

Status: Fixed

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on June 12, 2012 00:00:14

Issue 51 has been merged into this issue.

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on June 22, 2012 00:14:31

Reopened because "fix" was reverted.

Status: New

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on November 09, 2012 02:05:55

Blocking: wikiteam:44

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on December 08, 2013 13:11:42

Still happening e.g. for http://www.editthis.info/4chan/api.php

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on January 31, 2014 04:41:39

To forget about this very annoying issue, I just added sys.exit() right after "if c >= maxretries:" in dumpgenerator.py. Then launcher.py makes it continue to the next wiki of the list. Brute force workaround...

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on January 31, 2014 07:15:38

Blocking: wikiteam:33

emijrp avatar Jun 25 '14 10:06 emijrp

Recent example:

60987 page titles loaded Titles saved at... astrocom_wiki_astro_databank-20140628-titles.txt Retrieving the XML for every page from "start" XML for "Main_Page" is wrong. Waiting 20 seconds and reloading... XML for "Main_Page" is wrong. Waiting 40 seconds and reloading... XML for "Main_Page" is wrong. Waiting 60 seconds and reloading... XML for "Main_Page" is wrong. Waiting 80 seconds and reloading... We have retried 5 times MediaWiki error for "Main_Page", network error or whatever... Trying to save only the last revision for this page...

nemobis avatar Jun 28 '14 09:06 nemobis

In this case the wiki needs login for export http://www.astro.com/astro-databank/Special:Export

emijrp avatar Jun 28 '14 10:06 emijrp

Emilio J. Rodríguez-Posada, 28/06/2014 12:29:

In this case the wiki needs login for export http://www.astro.com/astro-databank/Special:Export

Reopening https://github.com/WikiTeam/wikiteam/issues/28

nemobis avatar Jun 28 '14 10:06 nemobis

  1. OK: traceback fixed with #156
  2. but now it loops (see comment in https://github.com/WikiTeam/wikiteam/pull/157 )
  3. and when I start it with: only current version of every page python dumpgenerator.py --api=http://skilledtests.com/wiki/api.php --xml --images --curonly

it stops :-(

2526 page titles loaded Titles saved at... skilledtestscom_wiki-20140711-titles.txt Retrieving the XML for every page from "start" XML for "Main_Page" is wrong. Waiting 20 seconds and reloading... XML for "Main_Page" is wrong. Waiting 40 seconds and reloading... XML for "Main_Page" is wrong. Waiting 60 seconds and reloading... XML for "Main_Page" is wrong. Waiting 80 seconds and reloading... We have retried 5 times MediaWiki error for "Main_Page", network error or whatever... Saving in the errors log, and skipping... XML export on this wiki is broken, quitting.

Erkan-Yilmaz avatar Jul 11 '14 08:07 Erkan-Yilmaz

  1. I thought: if I remove these special chars (��) from MediaWiki:Sidebar (1) the 'XML for "Main_Page" is wrong' might be gone.
  2. but: NO
  3. still "python dumpgenerator.py --api=http://skilledtests.com/wiki/api.php --xml --images --curonly" exits
  4. still unable to use your tool successfully :-(
  5. nothing more done.

(1) http://skilledtests.com/wiki/index.php5?title=MediaWiki%3ASidebar&diff=12630&oldid=12202

Erkan-Yilmaz avatar Jul 11 '14 16:07 Erkan-Yilmaz

I noticed something.

  1. When I start it with this (1) command it starts to read. so the difference is that I have this .php5 (not .php) but it CRASHed, see #158
  2. this command (2) was successful also dump verification :-)

(1) python dumpgenerator.py --index=http://www.skilledtests.com/wiki/index.php5 --xml --curonly (2) python dumpgenerator.py --index=http://www.skilledtests.com/wiki/index.php5 --xml --images

Erkan-Yilmaz avatar Jul 11 '14 16:07 Erkan-Yilmaz

Hello Erkan, I can run it with this:

python dumpgenerator.py --index=http://skilledtests.com/wiki/index.php5 --xml --images

As you said, it crashes, but can be resumed with:

python dumpgenerator.py --index=http://skilledtests.com/wiki/index.php5 --xml --images --resume --path=skilledtestscom_wiki5-20140711-wikidump/

The real issue here is a code regression, in the past we allowed --api and --index parameters at the same time. Now they are mutually excluded, so if you provide only --api, it fails to calculate the index.php because it is really index.php5. Currently the only solution is to use --index=...index.php5

I'm going to open an issue for the regression.

emijrp avatar Jul 11 '14 17:07 emijrp

OK (thx for opening issue 160)

Erkan-Yilmaz avatar Jul 11 '14 20:07 Erkan-Yilmaz

Can someone give an example of an URL that still produces this error?

PiRSquared17 avatar Nov 12 '14 23:11 PiRSquared17

Can someone give an example of an URL that still produces this error?

I have so many that I'm not able to track them (I'll soon reintroduce the abort hack above). Some candidates: http://en.starbricks.t15.org/wiki/api.php http://biou.net/api.php http://openttd.lachlanstevens.net/api.php http://wiki.toplist.cz/api.php http://cyber.law.harvard.edu/googlebooks/api.php http://en.cinoku.com/w/api.php http://albens73.fr/wiki/api.php http://wiki.mostlegit.com/api.php http://shafafsazi.com/w/api.php http://moodleforum.no/api.php http://wiki.lankou.org/w/api.php http://icube-ipp.unistra.fr/en/api.php http://icube-macepv.unistra.fr/en/api.php http://cyber.law.harvard.edu/hoap/api.php http://wiki.murrhardt.net/api.php http://www.runnersassari.it/wiki/api.php

nemobis avatar Nov 26 '14 19:11 nemobis

Can someone give an example of an URL that still produces this error?

I have so many that I'm not able to track them (I'll soon reintroduce the abort hack above). Some candidates: http://en.starbricks.t15.org/wiki/api.php

This API is giving me a fatal error.

http://biou.net/api.php

The API seems fine, but index.php gives a fatal error: http://biou.net/index.php/Main_Page

http://openttd.lachlanstevens.net/api.php

I was able to complete a dump of this wiki successfully.

http://wiki.toplist.cz/api.php

Because the script is using http://wiki.toplist.cz/index.php instead of just http://wiki.toplist.cz/ as the index.

http://cyber.law.harvard.edu/googlebooks/api.php

Again, it's trying to use https://cyber.law.harvard.edu/googlebooks/index.php instead of the real index https://cyber.law.harvard.edu/googlebooks/

http://en.cinoku.com/w/api.php

You do not have permission to , for the following reason:

You are not allowed to execute the action you have requested.

http://albens73.fr/wiki/api.php

http://albens73.fr/wiki/index.php/Sp%C3%A9cial:Exporter works

http://albens73.fr/wiki/index.php/Special:Export doesn't work

http://wiki.mostlegit.com/api.php

No error for me.

http://shafafsazi.com/w/api.php

Special:Export gives:

The action you have requested is limited to users in the group: Users.

http://moodleforum.no/api.php

Again, using the page "index.php".

http://wiki.lankou.org/w/api.php

Another wrong "index.php".

http://icube-ipp.unistra.fr/en/api.php

VERY strange, says you need to log in to use Special:Export, but other special pages work fine.

http://icube-macepv.unistra.fr/en/api.php

Same error as previous.

http://cyber.law.harvard.edu/hoap/api.php

Yet another wrong "index.php"

http://wiki.murrhardt.net/api.php

More incorrect "index.php" use.

http://www.runnersassari.it/wiki/api.php

Another broken wiki.

http://www.runnersassari.it/wiki/index.php?title=Speciale:Esporta/Discussioni_Runner_Sassari_Enciclopedia:Informazioni returns malformed XML, and http://www.runnersassari.it/wiki/index.php?title=Discussioni_Runner_Sassari_Enciclopedia:Informazioni gives a database error.

PiRSquared17 avatar Feb 28 '15 14:02 PiRSquared17

Thanks pir², those two fixed several. With https://github.com/WikiTeam/wikiteam/commit/bdc7c9bf069cf0341f86ea4d2549dfb08bc549f4 I fixed albens: https://archive.org/details/wiki-albens73fr_wiki

Now trying the others mentioned in the comment above.

http://en.starbricks.t15.org/wiki/index.php failed with a 502, maybe we're too fast? Should adjust delay when we get certain errors > 404 or >= 500.

mostlegit still fails for me if I pass the api, probably short URL: http://www.mostlegit.com/wiki/?title=Special:Export/Main_Page

http://shafafsazi.com/w/api.php has no $wgServer: File "./dumpgenerator.py", line 1434, in checkAPI index = result['query']['general']['server'] + \ KeyError: 'server'

nemobis avatar Mar 08 '15 18:03 nemobis

More examples, api.php URLs as provided by not-archived.py:

https://wiki.churchoffoxx.net/api.php http://mediawikibootstrapskin.co.uk/nexus/api.php http://jeffreylotterman.com/api.php http://holger-drefs.de/w/api.php http://planet417.com/api.php http://wiki.nbi.ku.dk/w/nanophys_cleanroom/api.php http://wiki.lignes-cardinales.fr/api.php http://www.elfhame.org/w/api.php http://wiki.openbroadcaster.com/api.php http://auth.unidog.de/api.php http://atheistsurvivalguide.org/wiki/api.php http://scoaa.net/api.php http://gridpack.org/wiki/api.php https://dbl.lmi.org/api.php https://www.semantic-apps.com/mediawiki/api.php

And probably: http://www.secure-abap.de/api.php http://www.suppfinderwiki.com/api.php http://www.aik-ev.de/api.php http://forge.openbravo.com/plugins/mwiki/api.php https://support.guest.it/wiki/api.php http://www.cosmiccuttlefish.org/api.php http://www.genderpedia.net/api.php https://debu-lab.info/wiki/api.php http://www.topgearit.net/api.php http://sco.uncyclopedia.org.uk/api.php https://restauth.net/api.php https://wiki.unitedplatform.com/api.php http://www.zauber-wiki.org/api.php http://www.informaticalessen.be/api.php https://www.cs.colostate.edu/wiki/mediawiki/api.php https://www.ubuntu.lt/wiki/api.php http://www.brandschutz-wiki.de/api.php http://2015.brucon.org/api.php http://cubscoutpack.com/api.php https://salamatechwiki.org/api.php https://fluswiki.hfwu.de/api.php

nemobis avatar Jul 05 '15 14:07 nemobis

On a set of 8000 wikis, I've so far not needed to manually break any "XML is wrong" loop so far, so I guess our skipping is now working at last. However, I'm also not seeing any wiki being archived which previously failed :[ so we didn't substantially increase our success rate. Not sure what to do.

nemobis avatar Jul 06 '15 06:07 nemobis

what about http://neowiki.neooffice.org/api.php ? I cannot get it working. :-(

dennisroczek avatar Nov 03 '15 23:11 dennisroczek

I am still having this looping issue. api.php for reproducibility testing: http://www.tanasinn.info/api.php Operating System: Windows 7 Professional Version 6.1 (Build 7601: Service Pack 1)[64 bit] Command Prompt output:

C:\Users\Patrick\Documents\WinPython-64bit-2.7.12.3Zero\scripts>cd C:\Users\Patr
ick\Documents\SBARG\Wikis\wikiteam

C:\Users\Patrick\Documents\SBARG\Wikis\wikiteam>dumpgenerator.py --api=http://ww
w.tanasinn.info/api.php --xml --images

C:\Users\Patrick\Documents\SBARG\Wikis\wikiteam>python dumpgenerator.py --api=ht
tp://www.tanasinn.info/api.php --xml --images
Checking API... http://www.tanasinn.info/api.php
Checking API... http://tanasinn.info/api.php
API is OK: http://tanasinn.info/api.php
Checking index.php... http://www.tanasinn.info/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2016 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://tanasinn.info/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
20 namespaces found
    Retrieving titles in the namespace 0
...    1167 titles retrieved in the namespace 0
    Retrieving titles in the namespace 1
.    96 titles retrieved in the namespace 1
    Retrieving titles in the namespace 2
.    198 titles retrieved in the namespace 2
    Retrieving titles in the namespace 3
.    61 titles retrieved in the namespace 3
    Retrieving titles in the namespace 4
.    11 titles retrieved in the namespace 4
    Retrieving titles in the namespace 5
.    0 titles retrieved in the namespace 5
    Retrieving titles in the namespace 6
.    455 titles retrieved in the namespace 6
    Retrieving titles in the namespace 7
.    22 titles retrieved in the namespace 7
    Retrieving titles in the namespace 8
.    58 titles retrieved in the namespace 8
    Retrieving titles in the namespace 9
.    1 titles retrieved in the namespace 9
    Retrieving titles in the namespace 10
.    132 titles retrieved in the namespace 10
    Retrieving titles in the namespace 103
.    0 titles retrieved in the namespace 103
    Retrieving titles in the namespace 12
.    4 titles retrieved in the namespace 12
    Retrieving titles in the namespace 13
.    1 titles retrieved in the namespace 13
    Retrieving titles in the namespace 14
.    120 titles retrieved in the namespace 14
    Retrieving titles in the namespace 15
.    6 titles retrieved in the namespace 15
    Retrieving titles in the namespace 11
.    10 titles retrieved in the namespace 11
    Retrieving titles in the namespace 100
.    39 titles retrieved in the namespace 100
    Retrieving titles in the namespace 102
.    0 titles retrieved in the namespace 102
    Retrieving titles in the namespace 101
.    1 titles retrieved in the namespace 101
Titles saved at... tanasinninfo-20161130-titles.txt
2382 page titles loaded
Retrieving the XML for every page from "start"
    In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading
...
    In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading
...
    In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading
...
    In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading
...
Traceback (most recent call last):
  File "dumpgenerator.py", line 2084, in <module>
    main()
  File "dumpgenerator.py", line 2076, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 1651, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "dumpgenerator.py", line 682, in generateXMLDump
    header, config = getXMLHeader(config=config, session=session)
  File "dumpgenerator.py", line 440, in getXMLHeader
    xml = "".join([x for x in getXMLPage(config=config, title=randomtitle, verbo
se=False, session=session)])
  File "dumpgenerator.py", line 589, in getXMLPage
    xml = getXMLPageCore(params=params, config=config, session=session)
  File "dumpgenerator.py", line 516, in getXMLPageCore
    time.sleep(wait)
KeyboardInterrupt

C:\Users\Patrick\Documents\SBARG\Wikis\wikiteam>python dumpgenerator.py --api=ht
tp://www.tanasinn.info/api.php --xml --images > tanasinninfo-2016-dumplog.log
Traceback (most recent call last):
  File "dumpgenerator.py", line 2084, in <module>
    main()


C:\Users\Patrick\Documents\SBARG\Wikis\wikiteam>python dumpgenerator.py --api=ht
tp://www.tanasinn.info/api.php --xml --images
Checking API... http://www.tanasinn.info/api.php
Checking API... http://tanasinn.info/api.php
API is OK: http://tanasinn.info/api.php
Checking index.php... http://www.tanasinn.info/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2016 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://tanasinn.info/api.php

Warning!: "./tanasinninfo-20161130-wikidump" path exists
There is a dump in "./tanasinninfo-20161130-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the
current session will be ignored
and the parameters available in "./tanasinninfo-20161130-wikidump/config.txt" wi
ll be loaded.
Do you want to resume ([yes, y], [no, n])? y
You have selected: YES
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML is corrupt? Regenerating...
Retrieving the XML for every page from "start"
    In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading
...
    In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading
...
    In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading
...
    In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading
...
    We have retried 5 times
    MediaWiki error for "Main_Page", network error or whatever...
    Trying to save only the last revision for this page...
    In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading
...
    In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading
...
    In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading
...
    In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading
...
    We have retried 5 times
    MediaWiki error for "Main_Page", network error or whatever...
    Saving in the errors log, and skipping...
Trying the local name for the Special namespace instead
    In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading
...
    In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading
...
    In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading
...
    In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading
...
    We have retried 5 times
    MediaWiki error for "Main_Page", network error or whatever...
    Trying to save only the last revision for this page...
    In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading
...
    In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading
...
    In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading
...
    In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading
...
    We have retried 5 times
    MediaWiki error for "Main_Page", network error or whatever...
    Saving in the errors log, and skipping...
Traceback (most recent call last):
  File "dumpgenerator.py", line 2084, in <module>
    main()
  File "dumpgenerator.py", line 2074, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1735, in resumePreviousDump
    config=config, titles=titles, session=other['session'])
  File "dumpgenerator.py", line 682, in generateXMLDump
    header, config = getXMLHeader(config=config, session=session)
  File "dumpgenerator.py", line 467, in getXMLHeader
    header = xml.split('</mediawiki>')[0]
UnboundLocalError: local variable 'xml' referenced before assignment

C:\Users\Patrick\Documents\SBARG\Wikis\wikiteam>

wertercatt avatar Nov 30 '16 18:11 wertercatt