zimit icon indicating copy to clipboard operation
zimit copied to clipboard

Special handling for known websites (WP, youtube, ted, etc)

Open Popolechien opened this issue 4 years ago • 7 comments

I see that almost every day (and certainly several times a week) people are running requests for Wikipedia, Wikibooks or even Youtube. Zimit should be able to a) switch gears to run the corresponding scrapers (youtube), or directly offer the latest zim available (wikipedia, wikibooks).

Popolechien avatar Sep 28 '21 07:09 Popolechien

No, we've discussed that a while back and apparently, we did not create ticket but the idea was to have a list of known websites for which we refuses request and display a message explaining where to find already existing ZIMs.

Switching scraper is not practical for many reasons ; mainly because we have no limit on those other scrapers

rgaudin avatar Sep 28 '21 08:09 rgaudin

display a message explaining where to find already existing ZIMs.

Sounds good to me and was the main point, but then the response message should identify the target and corresponding zim (e.g. "here is the link to en.wikipedia.org's latest in available" and not "got to download.kiwix.org/zim and figure it out".

Popolechien avatar Sep 28 '21 08:09 Popolechien

Ideally, yes. It can probably be implemented in two steps so that this gets a chance to be done.

At first, we can redirect to the Wiki where files are listed. Or maybe the library with new kiwix-serve is considered easy-enough ?

First thing you can do is list the domains and where to point to. It's easy for those we have a category for. Youtube will require special treatment anyway as we don't have ready made ZIMs for all. I see two options:

  • we keep it as it is, but add a message on request saying this is probably not what they want and both link to the scaper and the contact form to request a custom ZIM.
  • or we block the request and show a similar message

rgaudin avatar Sep 28 '21 08:09 rgaudin

Or maybe the library with new kiwix-serve is considered easy-enough ?

This would have my preference by far, but when I look at domains, based on the past three months (and this doc) I think we can simply send them to wikipedia_en_all.zim

Popolechien avatar Sep 30 '21 13:09 Popolechien

We could have a ZIM metadata "source_url" and then allow library.kiwix.org to filter on it?

kelson42 avatar Sep 30 '21 13:09 kelson42

We could have a ZIM metadata "source_url" and then allow library.kiwix.org to filter on it?

Yes, that's an interesting feature for which the default behavior might be tricky: how much matching do you want? domain? netloc ? path ? scheme ? but yeah, that would be best for us.

rgaudin avatar Sep 30 '21 13:09 rgaudin

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Mar 02 '22 11:03 stale[bot]