zim-requests
zim-requests copied to clipboard
New request: shamela.ws المكتبة الشاملة
- Website URL: https://shamela.ws/
- License: Open free content
- Desired ZIM Title: المكتبة الشاملة
- Desired ZIM Description: مشروع يهدف لجمع ما يحتاجه طالب العلم من كتب وبحوث
- Desired ZIM Icon –png (URL or attach one):
- Language (ISO 639-3): ara
- Is this a MediaWiki?: no
Recipe created https://farm.openzim.org/recipes/shamela.ws_ar_all I'll update the library link once ready. I already sent them an email to double check if any of the books in the library has a copyright.
We've received an answer from the team that all the books in the website are more than 100 years old. In 20 years (their operation time) they've received only 2 claims of books copyrights and they've deleted the books immediately as per their website policy.
is that means it won't get crawled?! I have requested a zim file related to this website here[#986] I think it's public domain. could this be made?
@hamoudak no no, this means we're good. You can follow the task on the link given above.
thank you, it's an valuable website for reading and studying. sorry I was confused, so their website policy to be free.
After 3 days, crawler progress is 3% (100753 / 2859505). 2.8 million links to explore is way too much. I cancelled the task and disabled the recipe. We need to find another way of ZIMing this website, this is not feasible with zimit, at least as-is.
@benoit74 could my request [#986] be created; its one of five archives related to this domain, or I have to wait for some reason.
Also, I may suggest for the library to continue scraping it with Zimit but to be divided into 40 categories as it is on the website or to be divided by your side.
The idea of dividing the ZIM per category as on the website is a good one.
And looking a bit more into it, I don't get why we ended-up with 2M links.
Anyway, I've started a first sub-recipe of category 34: https://farm.openzim.org/recipes/shamela.ws_ar_34
In this new recipe, ZIM name, title and description are very bad, this will have to be fixed, but at least let's see how it goes.
I can give you the names of these categories in arabic ; I know arabic very well. this category called: [ al-shir-wa-dawawinu] poetry diwans. arabic : الشعر ودواوينه
why it ended with millions of pages; beacuse there are some books has many volumes like dictionaries, literature, explanations and interpreting quran. in addition to this there is a category called : sunna books (category/6) , most of this one has many links due to each page may have just two lines of text or one for a big book, so it ends eventually with many links.
why it ended with millions of pages; beacuse there are some books has many volumes like dictionaries, literature, explanations and interpreting quran. in addition to this there is a category called : sunna books (category/6) , most of this one has many links due to each page may have just two lines of text or one for a big book, so it ends eventually with many links.
OK, so what I did for [ al-shir-wa-dawawinu] peotry diwans is not going to work for all categories. I basically asked to explore only links listed in the category page, so if I understand you well, it will explore only the books but not their volumes.
If I get you correctly, what we would like is tell the scraper to:
- find all books URLs listed on the category page
- for all these URLs, explore the book URL and all its sub-URLs
- also explore all authors URLs since it (it might probably contain some "external links" because some authors will probably have books in another category which will hence not end inside the ZIM).
For instance, for https://shamela.ws/category/4, we want the book https://shamela.ws/book/23622 but also https://shamela.ws/book/23622/1, https://shamela.ws/book/23622/2 and so on, and also https://shamela.ws/author/263 ; and so on for all other books of the category.
Is this correct? Do we have other links / pages which would be needed in each ZIM per category?
All that being said, I don't know yet how to do it with zimit, but at least it is important to understand what we would like to achieve ^^
I can give you the names of these categories in arabic ; I know arabic very well. this category called: [ al-shir-wa-dawawinu] peotry diwans. arabic : الشعر ودواوينه
Glad you can help on this, thank a lot. Once we have a working plan, I will come back to you about what we need precisely.
first it will only explore the links but not their sub-pages [the books themselves are volumes] . and you are absolutely right in all the three points you gave with examples . I have made over 200 (highly important) books of this domain with youzimit , when I did a basic crawl. I got just the titles (the contents of the book) not the sub-pages. so I went to the custom scope and gave it the right parameters to the sub-pages links. I got it work then.
- no you don't have any other links or pages .
I think I've achieved to build a pretty good ZIM of category 34. You can see preview at https://dev.library.kiwix.org/#lang=&q=34 (this is not the final URL, and never guaranteed to work, this is just dev server).
I'm currently running again the recipe to update the icon (which is blank) and to update the CSS of HTML pages inside the ZIM (to hide useless things when offline). What do you think?
My main concern is that it took 7 hours, which is not that bad given the scraper had to explore 7966 links, which gives us an average of 3 secs per link, but this was for only 25 books. I don't know what this will mean for a huge category like category 6 with 1227 books and very huge ones like https://shamela.ws/book/13174.
For the new task which is currently processing, I've increased the number of parallel worker to 4, let's hope it will not trigger something bad on the upstream server.
I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.
Do you have any idea of how often we should update the ZIM? E.g. how often are they adding new books, or how many books are they adding per month / quarter / year?
Since it looks like we will finally have a plan, it is now time to ask for your help regarding ZIM metadata.
For every ZIM (and hence category for now), we will need:
- a
selection: this is what will go into the ZIM name, which will be namedshamela.wa_ar_<selection>(without the<and>). Since we are doing one ZIM per category, theselectionshould be more or less the category. It should be as short as possible, but also as expressive as possible. It can contain only alphanumeric characters and the dash. If I understand you well, I imagine that for category 34 it should be34-al-shir-wa-dawawinu(I'm not sure adding the category number will help ... it would help us to maintain the ZIM at least ^^) - a
title: this is the ZIM title, displayed in all readers. It is limited to 30 characters. It should help to identify which ZIM the user is going to open. It must be in ZIM native language, so arabic, and user-friendly (i.e. it is not used by machines). If it was in English, I would for instance consider to useshamela.ws books: category 34since it is difficult to be more expressive in only 30 characters. I don't really like it to be honest, it is a bit ugly, but at the same time I don't find how to fit more in 30 chars. Maybeshamela poetry diwans? Not sure it will be possible for all categories, and it is less precise than the first alternative I proposed - a
description: this is the ZIM description, displayed in all readers. It is limited to 80 characters. It should help understand what is inside the ZIM, as a complement to the title. It must be in ZIM native language, so arabic, and user-friendly (i.e. it is not used by machines). If it was in English and I understand you well, I would probably use something likeBooks of shamela.ws collection, category 34 of poetry diwans(maybe adding something about what these books are would be interesting, are they about art, religion, daily life, law, mechanics, technology, ... didn't understood this so far)
Could you propose something for category 34 first? Please do not hesitate to ask friends for feedback as well on these, it is hard work and often good ideas might come from interactions with others.
you actually made it very good and a complete one. I 've download it from the farm before you post this comment. everything work as intended. for the things you'll hide it, I don't know much a bout it but the zim file as it is shown on the website is good enough.
I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.
now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .
note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.
I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.
OK, thank you. Updating the ZIM once per semester is hence probably a bare minimum. But I would like to avoid creating multiple ZIM in parallel to avoid overwhelming their server, so I doubt we can update once per quarter or it would mean mostly always having a ZIM update running.
now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .
Thank you !
note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.
I'm not sure I get you here. Do you mean that the links I hide in latest version made the ZIM worse? I'm very interested to have your feedback on this, we always considered that it is better to hide as many "broken" external links since they usually do not work in offline scenario and we've considered so far they will bring only frustration.
I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.
On this estimate, as expected category 6 is the biggest one with approximately 1.1 million links. Given the speed of last task with 4 parallel workers, it should complete in about 10 days which is OK.
I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.
OK, thank you. Updating the ZIM once per semester is hence probably a bare minimum. But I would like to avoid creating multiple ZIM in parallel to avoid overwhelming their server, so I doubt we can update once per quarter or it would mean mostly always having a ZIM update running.
now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .
Thank you !
note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.
I'm not sure I get you here. Do you mean that the links I hide in latest version made the ZIM worse? I'm very interested to have your feedback on this, we always considered that it is better to hide as many "broken" external links since they usually do not work in offline scenario and we've considered so far they will bring only frustration.
I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.
On this estimate, as expected category 6 is the biggest one with approximately 1.1 million links. Given the speed of last task with 4 parallel workers, it should complete in about 10 days which is OK.
that's good news of making cat 6. and yes I see the latest changes on cat 34 is not suitable for reading books, the white theme make it un comfortable to the eye for reading anything; I do know that you make these changes for better experience for offline use (like stackoverflow and others), but this domain is simple and it needs no changes especially for styles or themes. I may suggest that you keep both the original and the customized one, but for the last keep just the styles and themes; you can remove only the external links (if you must change something). overall. the website has a little external links not much to make changes besides its arabic; which is a different culture.
sorry I haven't read carefully that you need title and description in arabic. I was talking to my family at this time. I made most of the plan.
edit: am working on descriptions .
yes I see the latest changes on cat 34 is not suitable for reading books, the white theme make it un comfortable to the eye for reading anything
I did not intentionally changed anything regarding themes, I just removed external links. And I always had a white theme, with a toggle (which is still there) to enable dark theme. So I don't get your point. Can you share a screenshot of previous ZIM and new one, or small videos?
there's a simple issue (related to theme and links) in the old zim you've created when am browsing it shows the links and themes as it is online then when i go forward or back while browsing a book it disppears completely !
this is : shamela.ws_ar_34_2024-10 _size 24.4MG page:80
page:81 for the same file.
for the theme you removed it was the light blue one, and you're right no difference between the two zims. so it's the light blue; could you keep it, please :)
I'm sorry but I still don't get it. What do you want to keep? The light blue header on the top of the ZIM? Isn't this just a zone with a link to contribute to shamela.ws, which is not going to work offline? Do you mean that I should keep the blue zone and just remove the link instead? (I feel like this blue zone was useless if empty, but I can easily add it back if you feel like it is useful even if empty)
yes; you got my point, it will relieve tired eyes a little bit when you reading a long time . you'll see a big difference.
edit: there's coloured texts, I think it will fit with.
I almost done of the descriptions , also I have translated all category names into english ; it was the first time to translate religious idioms, so It had much time to be sure of the translated text. besides reading and searching related context articles.
edit: there's coloured texts, I think it will fit with.
do you mean the text in yellow leading to a contribute page? I can keep it as well, no strong opinion other than this link will be outside the ZIM, needing an online connection.
no I didn't mean that; I meant the texts within the book, its coloured with many colours for study and remembering. so the light blue will fit with those coloured words. remove that yellow.
I think I have finished of the metadata file , you can now have a look, and tell me what do you think or if anything should be edited etc.
edit: this is an updated one; you can work on this :
there was a simple typo on dawawin in arabic its دواوين instead, not دوواوين is that ok?
no I didn't mean that; I meant the texts within the book, its coloured with many colours for study and remembering. so the light blue will fit with those coloured words. remove that yellow.
OK, I just relaunched category 34 with that change (and fixed ZIM name, title and description). Note that recipe is now at https://farm.openzim.org/recipes/shamela.ws_ar_alshir-wa-dawawinu for consistency with ZIM name.
I think I have finished of the metadata file , you can now have a look, and tell me what do you think or if anything should be edited etc.
Thank you a lot for this metadata file! It looks ok at first sight. I might have few more questions as I dive into details category per category, but it is sufficient to get me started.
I think with the light blue it has become very good; please work on the updated metadata a bove; I fixed some typo.
do you copy book ids manually for custom scraping? or the urls included in a category? if so I can help you on some categories you need them.
there's just a notice, when you go to the author page you can't read completely the author biography; that's happening in the zim file.
@Popolechien : ZIM for category 34 is ready to move to prod: https://dev.library.kiwix.org/#q=%D8%A7%D9%84%D9%85%D8%AC%D9%85%D9%88%D8%B9%D8%A9 ; can you please have a look before I publish it?
@benoit74 Not 100% but I suspect there is a formatting issue with that black bar over here: https://dev.library.kiwix.org/viewer#shamela.ws_ar_alshir-wa-dawawinu_2024-10/shamela.ws/author/3009
the top link also redirect to shamela.ws, which is blocked as it is considered external link - any chance we can have it redirect to the zim's home page?
@Popolechien that's will be good, to be redirected to the home page, then you can go to the category from there. before this zim go to prod, one little thing. there is this sign ( ; ) for seperating sentences, its for english not arabic. in arabic must be ( ؛ ) i've edited them all here
so this zim metadata is: the arabic with english letters: alshir-wa-dawawinu title: دواوين الشعر؛ المجموعة رقم 34 description: دواوين الشعر العربي في الجاهلية وصدر الإسلام، وبعض الشروحات عليها
sorry to mention this but now everything is complete from my side.
Edit: I have finished of category 4 urls for custom scope category_4.txt
OK, sorry for the broken layout, I broke thing with last CSS, I've fixed it, running again the recipe to update the ZIM ATM
the top link also redirect to shamela.ws, which is blocked as it is considered external link - any chance we can have it redirect to the zim's home page?
Nope
i've edited them all here
OK, I'm using your last file now.
Edit: I have finished of category 4 urls for custom scope category_4.txt
Thank you, but I already have built the list for all books per categories with a script, no need to do that for other categories manually.
:) that's cool. I've checked the zim, everything working fine. just on https://dev.library.kiwix.org/viewer#shamela.ws_ar_alshir-wa-dawawinu_2024-10 but on kiwix js pwa the same problem for the author biography. I think the first zim you've created before removing external links or editing css was working with no issues.
- its working fine now on "kiwix js pwa". thank you.