dumpgenerator.py does not properly name Fandom images
When I do python2 dumpgenerator.py --api=https://valve-cut-content.fandom.com/api.php --xml --images the images folder is filled with misnamed images. This is due to Fandom image urls looking like this: https://vignette.wikia.nocookie.net/valve-cut-content/images/1/10/Ep2_outland_07_0405.png/revision/latest?cb=20170121045439 Is there anyway we could regex out the part after the actual File: name when writing the images? Example:
'latest?cb=20170120175610' 'latest?cb=20170120223427' 'latest?cb=20170121021057' 'latest?cb=20170121025324'
'latest?cb=20170120175610.desc' 'latest?cb=20170120223427.desc' 'latest?cb=20170121021057.desc' 'latest?cb=20170121025324.desc'
'latest?cb=20170120180010' 'latest?cb=20170120224629' 'latest?cb=20170121021507' 'latest?cb=20170121025433'
'latest?cb=20170120180010.desc' 'latest?cb=20170120224629.desc' 'latest?cb=20170121021507.desc' 'latest?cb=20170121025433.desc'
'latest?cb=20170120180059' 'latest?cb=20170120225823' 'latest?cb=20170121021604' 'latest?cb=20170121025450'
'latest?cb=20170120180059.desc' 'latest?cb=20170120225823.desc' 'latest?cb=20170121021604.desc' 'latest?cb=20170121025450.desc'
'latest?cb=20170120180125' 'latest?cb=20170120230920' 'latest?cb=20170121021619' 'latest?cb=20170121025529'
'latest?cb=20170120180125.desc' 'latest?cb=20170120230920.desc' 'latest?cb=20170121021619.desc' 'latest?cb=20170121025529.desc'
'latest?cb=20170120180136' 'latest?cb=20170121014405' 'latest?cb=20170121024249' 'latest?cb=20170121030023'
'latest?cb=20170120180136.desc' 'latest?cb=20170121014405.desc' 'latest?cb=20170121024249.desc' 'latest?cb=20170121030023.desc'
'latest?cb=20170120180314' 'latest?cb=20170121014613' 'latest?cb=20170121024304' 'latest?cb=20170121030049'
'latest?cb=20170120180314.desc' 'latest?cb=20170121014613.desc' 'latest?cb=20170121024304.desc' 'latest?cb=20170121030049.desc'
'latest?cb=20170120223006' 'latest?cb=20170121020850' 'latest?cb=20170121024323' 'latest?cb=20170121030549'
'latest?cb=20170120223006.desc' 'latest?cb=20170121020850.desc' 'latest?cb=20170121024323.desc'
Thanks for archiving Wikia wikis.
wertercatt, 19/03/19 06:33:
|Is there anyway we could regex out the part after the actual File: name when writing the images?|
We could, but then who's going to verify what's the effect on other wikis? These are valid characters for a title in MediaWiki: https://www.mediawiki.org/wiki/Manual:Page_title
A file could have title "File:A/revision/latest?foo=bar-thumb-800px.png" and there would be nothing strange about it.
Maybe it's better to introduce such a custom Wikia-specific regex with a local hack when archiving Wikia wikis.
Well, for Wikia images, we can just use the filename specified in the content-disposition header, which shouldn't negatively effect other wikis.
powerkitten@blaze-pc:~$ curl --head "https://vignette.wikia.nocookie.net/valve-cut-content/images/1/10/Ep2_outland_07_0405.png/revision/latest?cb=20170121045439"
HTTP/2 200
server: nginx
date: Tue, 19 Mar 2019 14:27:11 GMT
content-type: image/png
content-length: 485955
access-control-allow-origin: *
cache-control: public, max-age=31536000
content-disposition: inline; filename="Ep2_outland_07_0405.png"; filename*=UTF-8''Ep2_outland_07_0405.png
etag: ec6cdb86ea249a92a10e0ee13d5f16f0
surrogate-key: 0f0a01f752f58001f98f1b629b88c8736c02291c wiki-valve-cut-content thumblr original
x-thumbnailer: Thumblr
x-datacenter: SJC
x-cacheable: YES
age: 0
vary: Accept
x-cache: ORIGIN, MISS
timing-allow-origin: *
x-served-by: thumblr-676d848799-c2l2p, wk-cdn-r4
x-cache-hits: ORIGIN, 0
accept-ranges: bytes
wertercatt, 19/03/19 16:28:
Well, for Wikia images, we can just use the filename specified in the content-disposition header,
That's going to be webserver-specific, no reason to think it would equal the MediaWiki title in general. Feel free to submit a patch for the Wikia domain only, though.
This should be fixed now.