wikiteam icon indicating copy to clipboard operation
wikiteam copied to clipboard

dumpgenerator.py does not properly name Fandom images

Open wertercatt opened this issue 6 years ago • 4 comments

When I do python2 dumpgenerator.py --api=https://valve-cut-content.fandom.com/api.php --xml --images the images folder is filled with misnamed images. This is due to Fandom image urls looking like this: https://vignette.wikia.nocookie.net/valve-cut-content/images/1/10/Ep2_outland_07_0405.png/revision/latest?cb=20170121045439 Is there anyway we could regex out the part after the actual File: name when writing the images? Example:

'latest?cb=20170120175610'       'latest?cb=20170120223427'       'latest?cb=20170121021057'       'latest?cb=20170121025324'
'latest?cb=20170120175610.desc'  'latest?cb=20170120223427.desc'  'latest?cb=20170121021057.desc'  'latest?cb=20170121025324.desc'
'latest?cb=20170120180010'       'latest?cb=20170120224629'       'latest?cb=20170121021507'       'latest?cb=20170121025433'
'latest?cb=20170120180010.desc'  'latest?cb=20170120224629.desc'  'latest?cb=20170121021507.desc'  'latest?cb=20170121025433.desc'
'latest?cb=20170120180059'       'latest?cb=20170120225823'       'latest?cb=20170121021604'       'latest?cb=20170121025450'
'latest?cb=20170120180059.desc'  'latest?cb=20170120225823.desc'  'latest?cb=20170121021604.desc'  'latest?cb=20170121025450.desc'
'latest?cb=20170120180125'       'latest?cb=20170120230920'       'latest?cb=20170121021619'       'latest?cb=20170121025529'
'latest?cb=20170120180125.desc'  'latest?cb=20170120230920.desc'  'latest?cb=20170121021619.desc'  'latest?cb=20170121025529.desc'
'latest?cb=20170120180136'       'latest?cb=20170121014405'       'latest?cb=20170121024249'       'latest?cb=20170121030023'
'latest?cb=20170120180136.desc'  'latest?cb=20170121014405.desc'  'latest?cb=20170121024249.desc'  'latest?cb=20170121030023.desc'
'latest?cb=20170120180314'       'latest?cb=20170121014613'       'latest?cb=20170121024304'       'latest?cb=20170121030049'
'latest?cb=20170120180314.desc'  'latest?cb=20170121014613.desc'  'latest?cb=20170121024304.desc'  'latest?cb=20170121030049.desc'
'latest?cb=20170120223006'       'latest?cb=20170121020850'       'latest?cb=20170121024323'       'latest?cb=20170121030549'
'latest?cb=20170120223006.desc'  'latest?cb=20170121020850.desc'  'latest?cb=20170121024323.desc'

wertercatt avatar Mar 19 '19 04:03 wertercatt

Thanks for archiving Wikia wikis.

wertercatt, 19/03/19 06:33:

|Is there anyway we could regex out the part after the actual File: name when writing the images?|

We could, but then who's going to verify what's the effect on other wikis? These are valid characters for a title in MediaWiki: https://www.mediawiki.org/wiki/Manual:Page_title

A file could have title "File:A/revision/latest?foo=bar-thumb-800px.png" and there would be nothing strange about it.

Maybe it's better to introduce such a custom Wikia-specific regex with a local hack when archiving Wikia wikis.

nemobis avatar Mar 19 '19 06:03 nemobis

Well, for Wikia images, we can just use the filename specified in the content-disposition header, which shouldn't negatively effect other wikis.

powerkitten@blaze-pc:~$ curl --head "https://vignette.wikia.nocookie.net/valve-cut-content/images/1/10/Ep2_outland_07_0405.png/revision/latest?cb=20170121045439"
HTTP/2 200 
server: nginx
date: Tue, 19 Mar 2019 14:27:11 GMT
content-type: image/png
content-length: 485955
access-control-allow-origin: *
cache-control: public, max-age=31536000
content-disposition: inline; filename="Ep2_outland_07_0405.png"; filename*=UTF-8''Ep2_outland_07_0405.png
etag: ec6cdb86ea249a92a10e0ee13d5f16f0
surrogate-key: 0f0a01f752f58001f98f1b629b88c8736c02291c wiki-valve-cut-content thumblr original
x-thumbnailer: Thumblr
x-datacenter: SJC
x-cacheable: YES
age: 0
vary: Accept
x-cache: ORIGIN, MISS
timing-allow-origin: *
x-served-by: thumblr-676d848799-c2l2p, wk-cdn-r4
x-cache-hits: ORIGIN, 0
accept-ranges: bytes

wertercatt avatar Mar 19 '19 14:03 wertercatt

wertercatt, 19/03/19 16:28:

Well, for Wikia images, we can just use the filename specified in the content-disposition header,

That's going to be webserver-specific, no reason to think it would equal the MediaWiki title in general. Feel free to submit a patch for the Wikia domain only, though.

nemobis avatar Mar 19 '19 14:03 nemobis

This should be fixed now.

nemobis avatar Mar 07 '20 21:03 nemobis