Coding for tommorrow incomplete file
The recipe of coding for tomorrow has been successful but the file in dev library is incomplete, all internal links are not clickable. https://farm.openzim.org/recipes/codingfortomorrow_de_all
https://dev.library.kiwix.org/viewer#codingfortomorrow_de_all_2023-08/A/coding-for-tomorrow.de/downloads/
Can you check please?
Clicking on a download link, you get an error message:
Sorry, the url https://coding-for-tomorrow.de/wp-content/uploads/2020/11/Informationen-zum-neuen-Online-Angebot-von-Coding-For-Tomorrow.pdf is not found on this server
As you can see, this URL doesn't share the same prefix as the URL of the recipe (https://coding-for-tomorrow.de/wp-content is not within https://coding-for-tomorrow.de/downloads/). You need to change the scope to allow scraping such URLs
@rgaudin I tried to change the scope to Any, page, prefix and still the resulted file is the same. Could you please let me know in scope parameter which one shall I use?
That's exactly why some documentation is needed. All those scopes have different effects.
You haven't tested Any and that's good. I'd strongly advise against it as it would crawl anything. prefix is the default and page is somewhat similar.
I advise you try with host (will grab anything under coding-for-tomorrow.de) and see how that goes. I think often times, custom is appropriate but it requires specifying includes and excludes which is very tedious.
There's no documentation on those scopes ; code is at https://github.com/webrecorder/browsertrix-crawler/blob/165a9787af8a7dce6b0acb5f91e6803ef525fd5b/util/seeds.js#L75
I tried changing the scopes, the host scraped the website but without the needed projects can you check please? https://farm.openzim.org/recipes/codingfortomorrow_de_all
I disabled the recipe and marked the resulted file for deletion
Now that the URL configured is https://coding-for-tomorrow.de, what did you expected by changing the scope from the default (prefix) to host?
-
prefixscope will retrieve everything in the same directory so everything in https://coding-for-tomorrow.de -
hostscope will retrieve everything on the same host so everything in https://coding-for-tomorrow.de
I don't get what you expected by making this change.
That being said, I analyzed a bit the issue:
- the new URL setting and the scope (prefix or host would have worked the same) have allowed to retrieve all content in "downloads" subfolder (which was your initial problem and is now solved)
- you have a second issue (which has most probably always been present) that you mentioned last Thursday that all projects are missing on https://dev.library.kiwix.org/content/codingfortomorrow_de_all_2023-11/A/coding-for-tomorrow.de/unterrichtsmaterial/ for instance ; when you click on "Projektideen anzeigen" you see the list of projects but you cannot open any of them ; this might be solved by a custom behavior as suggested by Browsertrix team in https://github.com/openzim/zimit/issues/247
- I also noticed a third issue in the same page where the icon disappear when you hover the "Projektideen anzeigen" ; probably solvable by a custom behavior again
- I also noticed a fourth issue which is that most (all ?) Youtube videos embedded in projects are not accessible
- E.g. in https://dev.library.kiwix.org/content/codingfortomorrow_de_all_2023-11/A/coding-for-tomorrow.de/fake-news-erkennen/ the video is not accessible while it is accessible on https://coding-for-tomorrow.de/fake-news-erkennen/
- I'm not sure we can solve this
All that being said, as you see there is a significant effort needed by a developer to make the scraping of this website be enhanced, and I'm not even sure it will succeed (at least there is a significant chance that stuff like the Youtube videos will not be available).
@Popolechien what are your views on this, do you think this is worth the effort?
It's in German, not a core target audience. We can drop it I think.
The issue related is marked as upstream https://github.com/openzim/zim-requests/issues/460
Let's keep this issue open, I doubt we will make any progress in the coming months due to lack of resources but the ZIM request is legit, I've identified a potential solution and we should fix this at some point, it is not purely impossible or an immense effort, just not a priority for now.