wikiteam icon indicating copy to clipboard operation
wikiteam copied to clipboard

Using dumpgenerator.py with Python 3

Open EndeavourAccuracy opened this issue 3 years ago • 29 comments

Python is currently at version 3.8.6 and your code requires 2.7.

As of 23 April, Ubuntu (Focal Fossa; 20.04) repos no longer carry kitchen for Python 2; only python3-kitchen. Similarly, as of 27 June, Mint (Ulyana; 20) can no longer access this because it relies on Ubuntu repos.

As a result, modern distros can no longer use dumpgenerator.py.

I'm not a Python programmer. Nevertheless, I've tried converting dumpgenerator.py from Python 2 to Python 3. This attempt was unsuccessful.

I've:

  • replaced print "" and print '' with print ("")
  • replaced ur'' with r'' (This is for Python 3. If this needs to work with both 2 and 3, we'd apparently have to use u'' and escape any backslashes in the strings.)
  • replaced cPickle with pickle, and cookielib with http.cookiejar But then I ran into this error, and I could not continue: "RecursionError: maximum recursion depth exceeded while calling a Python object"

Also, I have my own (C and PHP/JavaScript) FOSS programming projects to work on.

Can you folks work on making a version of dumpgenerator.py that works with Python 3?

EndeavourAccuracy avatar Sep 29 '20 15:09 EndeavourAccuracy

EndeavourAccuracy, 29/09/20 18:30:

can no longer access this because it relies on Ubuntu repos.

I'm not sure what you mean. Does Mint no longer carry pip?

nemobis avatar Sep 29 '20 16:09 nemobis

For efficiently working with legacy versions of Python, it is recommended to use venv or (my personal preference) miniconda. Miniconda creates 'environments' which can contain any version of (for instance) python without affecting system python.

You can download miniconda: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh And install it: bash Miniconda3-latest-Linux-x86_64.sh

After restarting your shell window or ssh session. you you can create a new conda environment 'wiki' conda create -n wiki Enter it by activating the environemnt: conda activate wiki Now you can install your special snowflake version of python: conda install python=2.7

This python version 2.7 is only accessible from within the conda environment 'wiki' and does not affect the system files. The binaries are stored in your home directory, and none of this requires root access as you are not changing system files.

OAHOR avatar Nov 08 '20 13:11 OAHOR

I was caught with this because Mint 20 does not have python2 or pip by default. It seems overkill to install them both use to get dumpgenerator.py working. Why cannot it be upgraded to a modern version?

jrbray1 avatar Jan 01 '21 12:01 jrbray1

@EndeavourAccuracy do you have your modifications available somewhere to work against? So we could try to fix the errors you were running into?

tiefpunkt avatar Jan 09 '21 10:01 tiefpunkt

@EndeavourAccuracy do you have your modifications available somewhere to work against? So we could try to fix the errors you were running into?

Not any more, no. Part of my reasoning not to keep it was that I'm not a Python programmer, and I might therefore have accidentally introduced - perhaps hard to spot - errors in the code.

EndeavourAccuracy avatar Jan 09 '21 11:01 EndeavourAccuracy

I tried following the instructions in the main README on both macOS 11 and Debian 11, and in both cases it gave me the following:

$ ./dumpgenerator.py --help
Please install the kitchen module.
Please install or update the Requests module.

...even after running $ pip install --upgrade -r requirements.txt. I assume this is because Python2 is basically EOL, and it's increasingly difficult to set up a working Python2 environment.

Anyway, I'm muddling my way through @OAHOR's instructions for miniconda, and if it's the most reliable way of running dumpgenerator.py these days, then maybe it should be added to the instructions in the main README?

elsiehupp avatar May 27 '21 08:05 elsiehupp

I hope this ticket won't be closed referencing a work-around to get Python 2 working on modern systems. (Since it still won't allow using dumpgenerator.py with Python 3.)

EndeavourAccuracy avatar Jun 09 '21 15:06 EndeavourAccuracy

Here's a basic manual for creating backups without shell access, and without using dumpgenerator.py. This is for users who do have phpMyAdmin and FTP access, and want to distribute a backup without sensitive data. Use at your own risk.

Creating a MediaWiki Backup Without Sensitive Data, Using phpMyAdmin and FTP
Version 0.1 (June 9, 2021). Public domain.

This backup method does NOT require:
- shell access, or
- a Python 2 environment (e.g. for dumpgenerator.py).

USE THIS MANUAL AT YOUR OWN RISK

--------------------
[1/2] Database
--------------------
1. Launch phpMyAdmin.
2. On the left, click your MediaWiki database.
3. On the right, click tab "Export".
4. Select export method "Custom".
5. Optionally, unselect table "archive", which contains deleted edits. (How? Click "Select All", then Ctrl+click on "archive".)
6. Verify that section "Format-specific options" ends with "structure and data" selected.
7. Verify that section "Data dump options" uses "both of the above" as insert syntax.
8. Press the "Go" button, which will download the .sql file.
9. Remove private information from the .sql file:

Note: What you actually remove is your own decision. Below are suggestions.

Search: CREATE TABLE IF NOT EXISTS `user`
Remove: everything under "Dumping data for table `user`".
(That data could reveal the user_real_name, user_email, user_password, and user_newpassword. See, for example, "SELECT CONVERT(user_email USING utf8) FROM `user`;".)

Search: CREATE TABLE IF NOT EXISTS `watchlist`
Remove: everything under "Dumping data for table `watchlist`".
(That data could reveal which pages are watched/unwatched.)

Search: CREATE TABLE IF NOT EXISTS `recentchanges`
Remove: everything under "Dumping data for table `recentchanges`".
(That data could reveal rc_ip for each change.)

10. Done.

--------------------
[2/2] File system
--------------------
1. Download all files via FTP.
2. Remove private information:

Note: What you actually remove is your own decision. Below are suggestions.

Modify or delete LocalSettings.php.

Maybe delete directory images/archive/.

Maybe delete directory images/deleted/.

Maybe delete directory images/temp/.

Maybe delete cache/.

3. Done.

EndeavourAccuracy avatar Jun 09 '21 15:06 EndeavourAccuracy

I just made it so that my existing pull request doesn’t auto-close this issue. I’m working on a Python 3 version right now.

elsiehupp avatar Jun 09 '21 17:06 elsiehupp

Can all y’all give https://github.com/WikiTeam/wikiteam/pull/409 a spin? Thanks!

elsiehupp avatar Jun 09 '21 20:06 elsiehupp

Can all y’all give #409 a spin? Thanks!

Personally, I've just moved to another method. I also lack the time to test-run the updated script, sorry. If this would've come just a bit earlier, I might have made different choices. I've been a bit surprised that so few users have made themselves heard here, even though this ticket has been open since September 2020. I'm guessing most MediaWiki admins have, and use, shell access for backups.

EndeavourAccuracy avatar Jun 09 '21 22:06 EndeavourAccuracy

That’s fine. I’ve since figured out how to run the CI tests locally, so I can do most of my testing myself.

elsiehupp avatar Jun 09 '21 22:06 elsiehupp

I tried your version for Python3 without luck unfortunately, trying to download the Vim Wikia I get this error.

Can all y’all give #409 a spin? Thanks!

% python dumpgenerator.py https://vim.fandom.com/wiki/Vim_Tips_Wiki --xml --images /home/oli/programs/wikiteam/dumpgenerator.py:1142: SyntaxWarning: "is not" with a literal. Did you mean "!="? if buffer[-1] is not '\n': /home/oli/programs/wikiteam/dumpgenerator.py:1524: SyntaxWarning: "is not" with a literal. Did you mean "!="? if xmlfiledesc is not '' and not re.search(r'', xmlfiledesc): Checking API... https://vim.fandom.com/api.php API is OK: https://vim.fandom.com/api.php Checking index.php... https://vim.fandom.com/index.php index.php is OK

Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)

More info at: https://github.com/WikiTeam/wikiteam

Analysing https://vim.fandom.com/api.php Traceback (most recent call last): File "/home/oli/programs/wikiteam/dumpgenerator.py", line 2555, in main() File "/home/oli/programs/wikiteam/dumpgenerator.py", line 2542, in main saveConfig(config=config, configfilename=configfilename) File "/home/oli/programs/wikiteam/dumpgenerator.py", line 1594, in saveConfig pickle.dump(config, outfile) TypeError: write() argument must be str, not bytes

olinorwell avatar Jun 14 '21 21:06 olinorwell

Thanks. Could you pull the new changes and try it again? I didn’t finish the full dump myself, but it looks kind of like it should be working now. (As in, I think the problems I’m having at this point might be limited to the test script.)

FYI, to run, do the following:

$ pip install pipenv
$ pipenv run python dumpgenerator.py ...

Also, FYI, it’s easier to read if you wrap the output in three “tick” signs, like so:

```
[past your output here]
```

elsiehupp avatar Jun 15 '21 13:06 elsiehupp

FYI, to run, do the following:

$ pip install pipenv
$ pipenv run python dumpgenerator.py ...

Even after successfully installing pipenv, when I try running the script (with pipenv run ...) I get asked once again to install pipenv. Am I missing something?

It looks like a dump was started and then immediately aborted, though. It created the folder and the confix.txt file.

Sylphystia avatar Aug 18 '21 12:08 Sylphystia

It seems like the script is still relatively fussy. I was able to get this specific command to run on macOS:

% pipenv run python dumpgenerator.py https://vim.fandom.com/wiki/Vim_Tips_Wiki --xml --images

By contrast, the Wikiteam wiki wouldn’t download. For testing purposes, could you try exactly the same command with the Vim Wikia?

Even after successfully installing pipenv, when I try running the script (with pipenv run ...) I get asked once again to install pipenv. Am I missing something?

Can you try the instructions in the pipenv docs (a second time, if you’ve done so already); try running the above command; and then post exactly what the output is inside a pair of ``` (like the following) if it doesn’t work?

```
[past your output here]
```

Also, please include what system you’re running, as well as the output of the commands $ which pipenv and $ python --version.

It looks like a dump was started and then immediately aborted, though. It created the folder and the confix.txt file.

Weird. I mean, I’ve gotten failed dumps, too, but if pipenv itself is the problem, you shouldn’t be getting this far in the first place.

elsiehupp avatar Aug 22 '21 17:08 elsiehupp

Oh, and also the output of $ git status from inside the wikiteam directory.

elsiehupp avatar Aug 22 '21 17:08 elsiehupp

In a jail on FreeBSD 11.4-RELEASE-p9 amd64:

  • which pipenv -> /usr/local/bin/pipenv
  • python --version -> Python 3.8.12
  • git status -> On branch python3 / Your branch is up to date with 'origin/python3

At first I had issues because I thought it was necessary to install poetry and wikiteams3 using pip, then the new dumpgenerator.py by itself. That didn't work out so well. It led to the error above, which I tried adding exception logging to and I got this:

Traceback (most recent call last):
  File "/home/wikifur/dumpgenerator.py", line 37, in <module>
    import mwclient
ModuleNotFoundError: No module named 'mwclient'

This was confusing because even when I installed mwclient via pip, it didn't work. I also had to use pipenv --python 3.8 run ./dumpgenerator.py to even get that far, possibly because I had 2.7 installed at the same time.


Once I actually cloned the whole repo and checked out python3, it worked more smoothly, except that it broke when saving files:

./dumpgenerator.py --xml --xmlrevisions https://furry.wiki.opencura.com
[...namespaces downloaded...]
Titles saved at... furrywikiopencuracom_w-20210924-titles.txt
253 page titles loaded
https://furry.wiki.opencura.com/w/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2839, in <module>
    main()
  File "./dumpgenerator.py", line 2830, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 2350, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other["session"])
  File "./dumpgenerator.py", line 822, in generateXMLDump
    xmlfile.write(header)
TypeError: a bytes-like object is required, not 'str'

My understanding of this is that "wb" worked when writing strings in 2.x but won't in 3.x because they're now Unicode. Instead, it has to be opened as "w" - anyway, I made the following change and it worked:

diff --git a/dumpgenerator.py b/dumpgenerator.py
index f68f190..663aad3 100755
--- a/dumpgenerator.py
+++ b/dumpgenerator.py
@@ -818,7 +818,7 @@ def generateXMLDump(config={}, titles=[], start=None, session=None):
             xmlfile = open("%s/%s" % (config["path"], xmlfilename), "a")
         else:
             print("Retrieving the XML for every page from the beginning")
-            xmlfile = open("%s/%s" % (config["path"], xmlfilename), "wb")
+            xmlfile = open("%s/%s" % (config["path"], xmlfilename), "w")
             xmlfile.write(header)
         try:
             r_timestamp = "<timestamp>([^<]+)</timestamp>"
@@ -2514,7 +2514,7 @@ def saveSpecialVersion(config={}, session=None):
         raw = r.text
         delay(config=config, session=session)
         raw = removeIP(raw=raw)
-        with open("%s/Special:Version.html" % (config["path"]), "wb") as outfile:
+        with open("%s/Special:Version.html" % (config["path"]), "w") as outfile:
             outfile.write(raw)


@@ -2529,7 +2529,7 @@ def saveIndexPHP(config={}, session=None):
         raw = r.text
         delay(config=config, session=session)
         raw = removeIP(raw=raw)
-        with open("%s/index.html" % (config["path"]), "wb") as outfile:
+        with open("%s/index.html" % (config["path"]), "w") as outfile:
             outfile.write(raw)


This appeared to fix --xml and --xml --xmlrevisions (FWIW, it's not immediately obvious that --xmlrevisions requires --xml). There may be other changes that need to be made, but I have not tested that (for example " wb" is used for images, but maybe that is correct because they are bytes?)

GreenReaper avatar Sep 24 '21 23:09 GreenReaper

Hi @GreenReaper it’s been a month or two since I last worked on this, so it may take me a little bit to catch up with what’s going on here. Thank you for the detailed information, though!

elsiehupp avatar Sep 28 '21 13:09 elsiehupp

any news on it ? I also get

(.venv38) ixxx@devHost:~/workspace/wikitools3$ python dumpgenerator.py 
python: can't open file 'dumpgenerator.py': [Errno 2] No such file or directory
(.venv38) ixxx@devHost:~/workspace/wikitools3$ pip freeze
poster3==0.8.1
wikitools3==3.0.0

ImmoWetzel avatar Nov 15 '21 11:11 ImmoWetzel

Hi @ImmoWetzel—if you pop over to the pull request at https://github.com/WikiTeam/wikiteam/pull/409, the instructions for how to use the (still somewhat incomplete) Python 3 port are a bit more up-to-date there. I’ve added installation instructions at the top of the thread so you don’t have to read all the way through just to use the mostly working version of dumpgenerator.

Headline to grab people’s attention as necessary:

To use wikiteam3 visit https://github.com/WikiTeam/wikiteam/pull/409 and follow the instructions there.

elsiehupp avatar Nov 15 '21 14:11 elsiehupp

Can all y’all give #409 a spin? Thanks!

Personally, I've just moved to another method. I also lack the time to test-run the updated script, sorry. If this would've come just a bit earlier, I might have made different choices. I've been a bit surprised that so few users have made themselves heard here, even though this ticket has been open since September 2020. I'm guessing most MediaWiki admins have, and use, shell access for backups.

Makes very little sense, since you, in 2022, usually have shell access to a server if you have FTP or SQL access.

cooperdk avatar Jun 05 '22 22:06 cooperdk

https://github.com/WikiTeam/wikiteam/issues/433#issuecomment-1146897391

But #395 is two years old and as having not been fixed, it would not be illogical to renew it. The scripts should have been ported even long before that report.

@OAHOR suggests using conda to install an environment, but it makes no sense because as I wrote, Python 2.7 is no longer safe to use.

I am contemplating whether or not to help @elsiehupp if time permits.

Up to you. We also have some slightly different approaches on which one could choose to base any further work: https://github.com/WikiTeam/wikiteam/pull/331 https://github.com/nemobis/wikiteam/tree/2to3

nemobis avatar Jun 05 '22 23:06 nemobis

@OAHOR still asking kitchen module?

`Please install the kitchen module. Please install or update the Requests module.

(wiki) C:\Users\karti>python --version`

kwekewk avatar Jun 29 '22 08:06 kwekewk

@kwekewk I'm not exactly sure where you're running into problems, but I made a tidy version of the instructions for using miniconda for a pull request if you'd like to give them a try. (They're basically just @OAHOR's instructions, though.)

As an alternative, you can try the mostly functional Python 3 port I've been working on. There are other people helping me with the port, as well, so if you run into difficulties with it, you can feel free to open an Issue on that repository, and one or more of us can take a look.

elsiehupp avatar Jun 29 '22 11:06 elsiehupp

@kwekewk I'm not exactly sure where you're running into problems, but I made a tidy version of the instructions for using miniconda for a pull request if you'd like to give them a try. (They're basically just @OAHOR's instructions, though.)

As an alternative, you can try the mostly functional Python 3 port I've been working on. There are other people helping me with the port, as well, so if you run into difficulties with it, you can feel free to open an Issue on that repository, and one or more of us can take a look.

@elsiehupp solved, apparently I had to repeat the command requirements in conda pip install --user --upgrade -r requirements.txt . And, why the downloader can only download 40-50 images per minute?

kwekewk avatar Jun 29 '22 15:06 kwekewk

Apparently I had to repeat the command requirements in conda pip install --user --upgrade -r requirements.txt. And, why the downloader can only download 40-50 images per minute?

The delay functionality exists to help avoid getting temporarily blocked by a remote server for sending too many requests too quickly.

You should be able to specify the delay in seconds with a parameter. (You can get a list of available parameters with the --help parameter.) I vaguely remember finding that 0.5 seconds seemed to be just slow enough not to get blocked, but presumably it varies by server.

Obviously you shouldn't need the delay functionality if you're running the script locally, but if you're running the script locally you should also be able to initiate an export from within the MediaWiki admin interface itself.

elsiehupp avatar Jun 29 '22 16:06 elsiehupp

see also https://github.com/mediawiki-client-tools/mediawiki-scraper

via https://wiki.archiveteam.org/index.php?title=WikiTeam

milahu avatar Jun 30 '23 20:06 milahu

see also https://github.com/mediawiki-client-tools/mediawiki-scraper

via https://wiki.archiveteam.org/index.php?title=WikiTeam

https://github.com/WikiTeam/wikiteam/pull/409 😉

elsiehupp avatar Jul 02 '23 15:07 elsiehupp