zim-requests
zim-requests copied to clipboard
New ZIM request: scp-wiki.wikidot.com
- Website URL:https://scp-wiki.wikidot.com/
- License: Creative Commons Share-Alike 3.0 License (Important note: contains a single image not released under CC, but the creator 'allow[s] the use of the image of "Untitled 2004" by the SCP Foundation and its fanbase for non-commercial purposes only.', more information here and here (near bottom)). When using
zimit, it may be necessary to specify something like--exclude "SCP-173\\.jpg". - Desired ZIM Title: SCP-Wiki
- Desired ZIM Description: The SCP-Wiki is a collaborative creative writing website about the fictional SCP Foundation
- Desired ZIM Icon –png (URL or attach one): https://scp-wiki.wdfiles.com/local--files/main/logo_white.png
- Language (ISO 639-3): eng
- Is this a MediaWiki?: no
As the majority of this website (the exception being the 'random article'-buttons, login functionality and search) does not seem to need any backend whatsoever, downloading it via zimit seems like a viable option. Nearly everything on the website is CC-SA, the only exception (?) being the image of SCP-173, but excluding that one should be easy when using zimit. I am not even sure if it even needs to be excluded.
+1 for this. I actually made a clone of this website using httrack about a year ago and it was an ORDEAL! Would much rather this be in a zim file for my kiwix server. On a side-note the image of 173 is going to get a redesign in the near future to avoid this issue.
downloading it via zimit seems like a viable option. Nearly everything on the website is CC-SA, the only exception (?) being the image of SCP-173, but excluding that one should be easy when using zimit
@IMayBeABitShy Have you tried using the limited version of zimit already? did it work?
I had a cursory look but cannot see whether this is mediawiki-based or not.
Following @Popolechien's suggestion, I've used youzim.it to create a limited ZIM of the site. It seems like the website works (obviously some stuff like the search doesn't, but zim files have their own search functionality anyway). I did, however, noticed that a lot of junk javascript has been included (e.g. cookie confirmation, ...).
I suggest also excluding the following sites:
- ad-delivery.net
- ad.doubleclick.net
- btloader.com (not sure, couldn't find anything about this)
- consent.nit.ro
- s.nitropay.com
- stats.g.doubleclick,net
- tracker.nitropay.com
This list is probably incomplete, but this should be the most important ones on the main page.
I had a cursory look but cannot see whether this is mediawiki-based or not.
I don't think it is. There is a wikidot -> mediawiki conversion tool, which also indicates that it's not a media wiki. Still, I only have superficial knowledge of wiki software, so I may be wrong.
@IMayBeABitShy Awesome, I've started a recipe. Let us see what happens.
I think this one failed. I've checked the log a couple of times and zimit seemed to spend a lot of time parsing some background pages (like workbench I think they were called). The last time I've checked, the job was finally interrupted.
Looks like the favicon URL has changed. New URL: https://scp-wiki.wikidot.com/local--favicon/favicon.gif Also, the recipe log is flooded with these errors. I unfortunately am not familiar enough with zimit to know what this means.
[2023-07-02 17:23:23,192: WARNING] failed to load progress details: Expecting value: line 1 column 1 (char 0)
We can also omit the copyright concern with scp-173 image as this has been removed from the site to adhere to CC BY-SA license.
Another update to this request, the attempt on December 29, 2023 was successful! The resulting ZIM was usable, however, it looks like the depth needs to be increased by at least one.
https://farm.openzim.org/pipeline/6cc5755f-e0de-4a4c-a22f-fa9e43a0603f
Articles listed on the homepage are indexed but the majority of articles are under the series page that are just too deep.
https://scp-wiki.wikidot.com/scp-series
I noticed something very strange ...... all the offset pages are not being crawled correctly. Also, since the site uses Crom search, I think *.crom.avn.sh should be added to the exclusion list as well.
@Popolechien can you reopen this issue or update the recipe for this?
Just so everyone is on the same page the latest version available is at https://dev.library.kiwix.org/viewer#scp-wiki_en_all
As far as poking at the zimit recipe goes I'll defer to @benoit74
@lbrunkho @MCSeekeri I'm sorry but I don't get what your issues are.
Can you please provide link to a page with a non-working link (and details about this non-working link, e.g. position on the screen, text, screenshot, ...) so that I can understand what you are speaking about?
@lbrunkho @MCSeekeri I'm sorry but I don't get what your issues are.
Can you please provide link to a page with a non-working link (and details about this non-working link, e.g. position on the screen, text, screenshot, ...) so that I can understand what you are speaking about?
SCP-2998 The "Next iteration" at the bottom jumps to /offset/1 Zimit is not crawling correctly, it seems to be because the page returns 503.
{"timestamp":"2024-09-25T13:49:45.047Z","logLevel":"error","context":"general","message":"Page Crashed on Load","details":{"status":503,"page":"https://scp-wiki.wikidot.com/scp-2998/offset/1","workerid":0}}
There are also some issues that don't exist in the current zim file. I found them while crawling SCP-CN. SVG and MathJax The crawled version doesn't render SVGs correctly and doesn't display math formulas correctly, which is probably due to Wikidot's weird front-end implementation, so both of these issues can be left alone for the time being.
If the page returns a 503, unfortunately there is nothing we can do ... But here the message says "Page Crashed on Load", so I suspect there is another issue. Will have a look when time will be available to work on this ZIM request.
If the page returns a 503, unfortunately there is nothing we can do ... But here the message says "Page Crashed on Load", so I suspect there is another issue. Will have a look when time will be available to work on this ZIM request.
The strange thing is that the page doesn't actually return 503, the content is normal, I'm not sure why there is this output ......
I might be doing something wrong but I can't download the zim from https://farm.openzim.org/pipeline/ec07e544-a23f-4977-9a1f-64c8ee8cd174 in the files section because I get a 404 :( is there another one available?
@leberschnitzel Nothing wrong - the file was in dev and got deleted when we got the Hetzner Incident. I'll restart the recipe.
Edit: created a new recipe using MWoffliner 1.14.2 at https://farm.openzim.org/recipes/scp_wiki_en_all/edit
@benoit74 : should I delete the current zimit recipe already?
Do I miss something or this is not a Mediawiki?
Damn, I got fooled by the wiki part and thought no further /o\
now I'm confused?
@leberschnitzel We use different tools depending on the target website (a Mediawiki-based website would use MWoffliner; others would need zimit). I assigned the wrong tool to the task, believing that SCP wiki was mediawiki-based (there are other wiki softwares, much like you can edit documents on Word, LibreOffice or Google Docs). Long story short, I've restarted the previous recipe and you can follow its progress at https://farm.openzim.org/pipeline/0b1f37ad-fd7d-4501-a787-1a5e40c54926
I'd say we'll find out if it worked sometime tonight / tomorrow morning.
I cancelled the task and disabled the recipe, all calls where finishing in timeouts. Unfortunately is the website is not reliable, it is hard to hope being able to create a ZIM ...
I cancelled the task and disabled the recipe, all calls where finishing in timeouts. Unfortunately is the website is not reliable, it is hard to hope being able to create a ZIM ...
I've been working on using zimit to scrape scp-wiki-cn, and in zimit version 2.1.3 it successfully generated working ZIM files. However, it seems to have failed for various reasons since then, which is quite strange...
I've used the recipe now to try and download a zim but I get the same problem: It runs well at first but after some time every request is a timeout :(
@leberschnitzel Wikidot is unreliable and times out regularly even under normal use, unfortunately. You'll need to tolerate occasional failures and retry. Note that the non-CC image for SCP-173 is gone now so that's no longer a concern. (PS, be careful not to flood on IRC; you got auto-kicked.)
@leberschnitzel Wikidot is unreliable and times out regularly even under normal use, unfortunately. You'll need to tolerate occasional failures and retry. Note that the non-CC image for SCP-173 is gone now so that's no longer a concern. (PS, be careful not to flood on IRC; you got auto-kicked.)
not just kicked, also banned... I didn't know it counts new lines in the same message as new message :(
I cancelled the task and disabled the recipe, all calls where finishing in timeouts. Unfortunately is the website is not reliable, it is hard to hope being able to create a ZIM ...
I've been working on using zimit to scrape scp-wiki-cn, and in zimit version 2.1.3 it successfully generated working ZIM files. However, it seems to have failed for various reasons since then, which is quite strange...
You mentioned also that it worked for you in 2.1.4 here so I'm going back to that version and see if I can make something with this.
creating the zim with 2.1.4 worked and has the same problem as mentioned MCSeekeri with SCP-2998 but it's already far better than anything I've tried with 3.0.1.
creating the zim with 2.1.4 worked and has the same problem as mentioned MCSeekeri with SCP-2998 but it's already far better than anything I've tried with 3.0.1.
It's good to know that it's not that I'm using Zimit incorrectly, but that there is indeed something difficult to track down. @ikreymer Can you check the differences between the two versions? I suspect that there are some changes here that prevent browsertrix-crawler from crawling WIkidot pages properly.
I'm looking at the latest run from 10 days ago and it looks pretty good to me - but then I don't know what I should be looking for either. Can someone here please check it out and give feeback? https://dev.library.kiwix.org/#lang=eng&q=scp&category=
Ninja edit: fixed link
I'm looking at the latest run from 10 days ago and it looks pretty good to me - but then I don't know what I should be looking for either. Can someone here please check it out and give feeback? https://dev.library.kiwix.org/#lang=eng&q=scp&category=
Ninja edit: fixed link
The much smaller size than my copy (2 vs 20 GB) makes me think that there's lots missing. If you go to "SCP by Series" on the left side and open any of the roman numerals, and in there open any SCP, it seems to open an outside link.