wikiteam
wikiteam copied to clipboard
Using dumpgenerator.py with Python 3
Python is currently at version 3.8.6 and your code requires 2.7.
As of 23 April, Ubuntu (Focal Fossa; 20.04) repos no longer carry kitchen for Python 2; only python3-kitchen. Similarly, as of 27 June, Mint (Ulyana; 20) can no longer access this because it relies on Ubuntu repos.
As a result, modern distros can no longer use dumpgenerator.py.
I'm not a Python programmer. Nevertheless, I've tried converting dumpgenerator.py from Python 2 to Python 3. This attempt was unsuccessful.
I've:
- replaced print "" and print '' with print ("")
- replaced ur'' with r'' (This is for Python 3. If this needs to work with both 2 and 3, we'd apparently have to use u'' and escape any backslashes in the strings.)
- replaced cPickle with pickle, and cookielib with http.cookiejar But then I ran into this error, and I could not continue: "RecursionError: maximum recursion depth exceeded while calling a Python object"
Also, I have my own (C and PHP/JavaScript) FOSS programming projects to work on.
Can you folks work on making a version of dumpgenerator.py that works with Python 3?
EndeavourAccuracy, 29/09/20 18:30:
can no longer access this because it relies on Ubuntu repos.
I'm not sure what you mean. Does Mint no longer carry pip?
For efficiently working with legacy versions of Python, it is recommended to use venv or (my personal preference) miniconda. Miniconda creates 'environments' which can contain any version of (for instance) python without affecting system python.
You can download miniconda:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
And install it:
bash Miniconda3-latest-Linux-x86_64.sh
After restarting your shell window or ssh session. you you can create a new conda environment 'wiki'
conda create -n wiki
Enter it by activating the environemnt:
conda activate wiki
Now you can install your special snowflake version of python:
conda install python=2.7
This python version 2.7 is only accessible from within the conda environment 'wiki' and does not affect the system files. The binaries are stored in your home directory, and none of this requires root access as you are not changing system files.
I was caught with this because Mint 20 does not have python2 or pip by default. It seems overkill to install them both use to get dumpgenerator.py working. Why cannot it be upgraded to a modern version?
@EndeavourAccuracy do you have your modifications available somewhere to work against? So we could try to fix the errors you were running into?
@EndeavourAccuracy do you have your modifications available somewhere to work against? So we could try to fix the errors you were running into?
Not any more, no. Part of my reasoning not to keep it was that I'm not a Python programmer, and I might therefore have accidentally introduced - perhaps hard to spot - errors in the code.
I tried following the instructions in the main README on both macOS 11 and Debian 11, and in both cases it gave me the following:
$ ./dumpgenerator.py --help
Please install the kitchen module.
Please install or update the Requests module.
...even after running $ pip install --upgrade -r requirements.txt
. I assume this is because Python2 is basically EOL, and it's increasingly difficult to set up a working Python2 environment.
Anyway, I'm muddling my way through @OAHOR's instructions for miniconda
, and if it's the most reliable way of running dumpgenerator.py
these days, then maybe it should be added to the instructions in the main README?
I hope this ticket won't be closed referencing a work-around to get Python 2 working on modern systems. (Since it still won't allow using dumpgenerator.py with Python 3.)
Here's a basic manual for creating backups without shell access, and without using dumpgenerator.py. This is for users who do have phpMyAdmin and FTP access, and want to distribute a backup without sensitive data. Use at your own risk.
Creating a MediaWiki Backup Without Sensitive Data, Using phpMyAdmin and FTP
Version 0.1 (June 9, 2021). Public domain.
This backup method does NOT require:
- shell access, or
- a Python 2 environment (e.g. for dumpgenerator.py).
USE THIS MANUAL AT YOUR OWN RISK
--------------------
[1/2] Database
--------------------
1. Launch phpMyAdmin.
2. On the left, click your MediaWiki database.
3. On the right, click tab "Export".
4. Select export method "Custom".
5. Optionally, unselect table "archive", which contains deleted edits. (How? Click "Select All", then Ctrl+click on "archive".)
6. Verify that section "Format-specific options" ends with "structure and data" selected.
7. Verify that section "Data dump options" uses "both of the above" as insert syntax.
8. Press the "Go" button, which will download the .sql file.
9. Remove private information from the .sql file:
Note: What you actually remove is your own decision. Below are suggestions.
Search: CREATE TABLE IF NOT EXISTS `user`
Remove: everything under "Dumping data for table `user`".
(That data could reveal the user_real_name, user_email, user_password, and user_newpassword. See, for example, "SELECT CONVERT(user_email USING utf8) FROM `user`;".)
Search: CREATE TABLE IF NOT EXISTS `watchlist`
Remove: everything under "Dumping data for table `watchlist`".
(That data could reveal which pages are watched/unwatched.)
Search: CREATE TABLE IF NOT EXISTS `recentchanges`
Remove: everything under "Dumping data for table `recentchanges`".
(That data could reveal rc_ip for each change.)
10. Done.
--------------------
[2/2] File system
--------------------
1. Download all files via FTP.
2. Remove private information:
Note: What you actually remove is your own decision. Below are suggestions.
Modify or delete LocalSettings.php.
Maybe delete directory images/archive/.
Maybe delete directory images/deleted/.
Maybe delete directory images/temp/.
Maybe delete cache/.
3. Done.
I just made it so that my existing pull request doesn’t auto-close this issue. I’m working on a Python 3 version right now.
Can all y’all give https://github.com/WikiTeam/wikiteam/pull/409 a spin? Thanks!
Can all y’all give #409 a spin? Thanks!
Personally, I've just moved to another method. I also lack the time to test-run the updated script, sorry. If this would've come just a bit earlier, I might have made different choices. I've been a bit surprised that so few users have made themselves heard here, even though this ticket has been open since September 2020. I'm guessing most MediaWiki admins have, and use, shell access for backups.
That’s fine. I’ve since figured out how to run the CI tests locally, so I can do most of my testing myself.
I tried your version for Python3 without luck unfortunately, trying to download the Vim Wikia I get this error.
Can all y’all give #409 a spin? Thanks!
% python dumpgenerator.py https://vim.fandom.com/wiki/Vim_Tips_Wiki --xml --images /home/oli/programs/wikiteam/dumpgenerator.py:1142: SyntaxWarning: "is not" with a literal. Did you mean "!="? if buffer[-1] is not '\n': /home/oli/programs/wikiteam/dumpgenerator.py:1524: SyntaxWarning: "is not" with a literal. Did you mean "!="? if xmlfiledesc is not '' and not re.search(r'', xmlfiledesc): Checking API... https://vim.fandom.com/api.php API is OK: https://vim.fandom.com/api.php Checking index.php... https://vim.fandom.com/index.php index.php is OK
Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)
More info at: https://github.com/WikiTeam/wikiteam
Analysing https://vim.fandom.com/api.php
Traceback (most recent call last):
File "/home/oli/programs/wikiteam/dumpgenerator.py", line 2555, in
Thanks. Could you pull the new changes and try it again? I didn’t finish the full dump myself, but it looks kind of like it should be working now. (As in, I think the problems I’m having at this point might be limited to the test script.)
FYI, to run, do the following:
$ pip install pipenv
$ pipenv run python dumpgenerator.py ...
Also, FYI, it’s easier to read if you wrap the output in three “tick” signs, like so:
```
[past your output here]
```
FYI, to run, do the following:
$ pip install pipenv
$ pipenv run python dumpgenerator.py ...
Even after successfully installing pipenv, when I try running the script (with pipenv run ...) I get asked once again to install pipenv. Am I missing something?
It looks like a dump was started and then immediately aborted, though. It created the folder and the confix.txt file.
It seems like the script is still relatively fussy. I was able to get this specific command to run on macOS:
% pipenv run python dumpgenerator.py https://vim.fandom.com/wiki/Vim_Tips_Wiki --xml --images
By contrast, the Wikiteam wiki wouldn’t download. For testing purposes, could you try exactly the same command with the Vim Wikia?
Even after successfully installing pipenv, when I try running the script (with pipenv run ...) I get asked once again to install pipenv. Am I missing something?
Can you try the instructions in the pipenv
docs (a second time, if you’ve done so already); try running the above command; and then post exactly what the output is inside a pair of ```
(like the following) if it doesn’t work?
```
[past your output here]
```
Also, please include what system you’re running, as well as the output of the commands $ which pipenv
and $ python --version
.
It looks like a dump was started and then immediately aborted, though. It created the folder and the confix.txt file.
Weird. I mean, I’ve gotten failed dumps, too, but if pipenv
itself is the problem, you shouldn’t be getting this far in the first place.
Oh, and also the output of $ git status
from inside the wikiteam
directory.
In a jail on FreeBSD 11.4-RELEASE-p9 amd64:
- which pipenv -> /usr/local/bin/pipenv
- python --version -> Python 3.8.12
- git status -> On branch python3 / Your branch is up to date with 'origin/python3
At first I had issues because I thought it was necessary to install poetry and wikiteams3 using pip, then the new dumpgenerator.py by itself. That didn't work out so well. It led to the error above, which I tried adding exception logging to and I got this:
Traceback (most recent call last):
File "/home/wikifur/dumpgenerator.py", line 37, in <module>
import mwclient
ModuleNotFoundError: No module named 'mwclient'
This was confusing because even when I installed mwclient via pip, it didn't work. I also had to use pipenv --python 3.8 run ./dumpgenerator.py
to even get that far, possibly because I had 2.7 installed at the same time.
Once I actually cloned the whole repo and checked out python3, it worked more smoothly, except that it broke when saving files:
./dumpgenerator.py --xml --xmlrevisions https://furry.wiki.opencura.com
[...namespaces downloaded...]
Titles saved at... furrywikiopencuracom_w-20210924-titles.txt
253 page titles loaded
https://furry.wiki.opencura.com/w/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Traceback (most recent call last):
File "./dumpgenerator.py", line 2839, in <module>
main()
File "./dumpgenerator.py", line 2830, in main
createNewDump(config=config, other=other)
File "./dumpgenerator.py", line 2350, in createNewDump
generateXMLDump(config=config, titles=titles, session=other["session"])
File "./dumpgenerator.py", line 822, in generateXMLDump
xmlfile.write(header)
TypeError: a bytes-like object is required, not 'str'
My understanding of this is that "wb" worked when writing strings in 2.x but won't in 3.x because they're now Unicode. Instead, it has to be opened as "w" - anyway, I made the following change and it worked:
diff --git a/dumpgenerator.py b/dumpgenerator.py
index f68f190..663aad3 100755
--- a/dumpgenerator.py
+++ b/dumpgenerator.py
@@ -818,7 +818,7 @@ def generateXMLDump(config={}, titles=[], start=None, session=None):
xmlfile = open("%s/%s" % (config["path"], xmlfilename), "a")
else:
print("Retrieving the XML for every page from the beginning")
- xmlfile = open("%s/%s" % (config["path"], xmlfilename), "wb")
+ xmlfile = open("%s/%s" % (config["path"], xmlfilename), "w")
xmlfile.write(header)
try:
r_timestamp = "<timestamp>([^<]+)</timestamp>"
@@ -2514,7 +2514,7 @@ def saveSpecialVersion(config={}, session=None):
raw = r.text
delay(config=config, session=session)
raw = removeIP(raw=raw)
- with open("%s/Special:Version.html" % (config["path"]), "wb") as outfile:
+ with open("%s/Special:Version.html" % (config["path"]), "w") as outfile:
outfile.write(raw)
@@ -2529,7 +2529,7 @@ def saveIndexPHP(config={}, session=None):
raw = r.text
delay(config=config, session=session)
raw = removeIP(raw=raw)
- with open("%s/index.html" % (config["path"]), "wb") as outfile:
+ with open("%s/index.html" % (config["path"]), "w") as outfile:
outfile.write(raw)
This appeared to fix --xml
and --xml --xmlrevisions
(FWIW, it's not immediately obvious that --xmlrevisions
requires --xml
). There may be other changes that need to be made, but I have not tested that (for example " wb" is used for images, but maybe that is correct because they are bytes?)
Hi @GreenReaper it’s been a month or two since I last worked on this, so it may take me a little bit to catch up with what’s going on here. Thank you for the detailed information, though!
any news on it ? I also get
(.venv38) ixxx@devHost:~/workspace/wikitools3$ python dumpgenerator.py
python: can't open file 'dumpgenerator.py': [Errno 2] No such file or directory
(.venv38) ixxx@devHost:~/workspace/wikitools3$ pip freeze
poster3==0.8.1
wikitools3==3.0.0
Hi @ImmoWetzel—if you pop over to the pull request at https://github.com/WikiTeam/wikiteam/pull/409, the instructions for how to use the (still somewhat incomplete) Python 3 port are a bit more up-to-date there. I’ve added installation instructions at the top of the thread so you don’t have to read all the way through just to use the mostly working version of dumpgenerator
.
Headline to grab people’s attention as necessary:
To use wikiteam3
visit https://github.com/WikiTeam/wikiteam/pull/409 and follow the instructions there.
Can all y’all give #409 a spin? Thanks!
Personally, I've just moved to another method. I also lack the time to test-run the updated script, sorry. If this would've come just a bit earlier, I might have made different choices. I've been a bit surprised that so few users have made themselves heard here, even though this ticket has been open since September 2020. I'm guessing most MediaWiki admins have, and use, shell access for backups.
Makes very little sense, since you, in 2022, usually have shell access to a server if you have FTP or SQL access.
https://github.com/WikiTeam/wikiteam/issues/433#issuecomment-1146897391
But #395 is two years old and as having not been fixed, it would not be illogical to renew it. The scripts should have been ported even long before that report.
@OAHOR suggests using conda to install an environment, but it makes no sense because as I wrote, Python 2.7 is no longer safe to use.
I am contemplating whether or not to help @elsiehupp if time permits.
Up to you. We also have some slightly different approaches on which one could choose to base any further work: https://github.com/WikiTeam/wikiteam/pull/331 https://github.com/nemobis/wikiteam/tree/2to3
@OAHOR still asking kitchen module?
`Please install the kitchen module. Please install or update the Requests module.
(wiki) C:\Users\karti>python --version`
@kwekewk I'm not exactly sure where you're running into problems, but I made a tidy version of the instructions for using miniconda for a pull request if you'd like to give them a try. (They're basically just @OAHOR's instructions, though.)
As an alternative, you can try the mostly functional Python 3 port I've been working on. There are other people helping me with the port, as well, so if you run into difficulties with it, you can feel free to open an Issue on that repository, and one or more of us can take a look.
@kwekewk I'm not exactly sure where you're running into problems, but I made a tidy version of the instructions for using miniconda for a pull request if you'd like to give them a try. (They're basically just @OAHOR's instructions, though.)
As an alternative, you can try the mostly functional Python 3 port I've been working on. There are other people helping me with the port, as well, so if you run into difficulties with it, you can feel free to open an Issue on that repository, and one or more of us can take a look.
@elsiehupp solved, apparently I had to repeat the command requirements in conda pip install --user --upgrade -r requirements.txt . And, why the downloader can only download 40-50 images per minute?
Apparently I had to repeat the command requirements in
conda pip install --user --upgrade -r requirements.txt
. And, why the downloader can only download 40-50 images per minute?
The delay functionality exists to help avoid getting temporarily blocked by a remote server for sending too many requests too quickly.
You should be able to specify the delay in seconds with a parameter. (You can get a list of available parameters with the --help
parameter.) I vaguely remember finding that 0.5
seconds seemed to be just slow enough not to get blocked, but presumably it varies by server.
Obviously you shouldn't need the delay functionality if you're running the script locally, but if you're running the script locally you should also be able to initiate an export from within the MediaWiki admin interface itself.
see also https://github.com/mediawiki-client-tools/mediawiki-scraper
via https://wiki.archiveteam.org/index.php?title=WikiTeam
see also https://github.com/mediawiki-client-tools/mediawiki-scraper
via https://wiki.archiveteam.org/index.php?title=WikiTeam
https://github.com/WikiTeam/wikiteam/pull/409 😉