gutenberg icon indicating copy to clipboard operation
gutenberg copied to clipboard

Support Wikisource EPUB import

Open kelson42 opened this issue 6 years ago • 10 comments

... through OPDS feed https://tools.wmflabs.org/wsexport/wikisource-fr-good.atom

kelson42 avatar Nov 12 '19 13:11 kelson42

@Tpt Maybe you can help here?

kelson42 avatar Nov 12 '19 13:11 kelson42

@eshellman If Gutenberg project would provide an OPDS stream as well, that would make things so much easier and quicker to run.

kelson42 avatar Nov 12 '19 16:11 kelson42

Gutenberg's OPDS feed dates from the early days of OPDS and it shows - we'll probably jump to v2 instead of changing it. http://www.gutenberg.org/ebooks.opds/

eshellman avatar Nov 13 '19 04:11 eshellman

@rgaudin Hmmm... any reason you can remember why we don't have use it 5 years ago at the time we have created gutemberg2zim?

kelson42 avatar Nov 13 '19 16:11 kelson42

I don't recall. Did it exist back then? Source says:

DON'T USE THIS PAGE FOR SCRAPING.

Seriously. You'll only get your IP blocked.

Download https://www.gutenberg.org/feeds/catalog.rdf.bz2 instead,
which contains *all* Project Gutenberg metadata in one RDF/XML file.

This catalog file (272MiB) looks like a good base for metadata but it only contains IDs, not links to the contents. I think that's why we had to rsync stuff.

rgaudin avatar Nov 14 '19 14:11 rgaudin

@rgaudin Thank you very much for this quick but insightful analysis. @eshellman Any change we can (1) use it for scraping (2) get the important information (links) within the OPDS stream?

kelson42 avatar Nov 14 '19 14:11 kelson42

The nastynote is an artifact of the templating system. it can be ignored. Is (2) referring to the RDF dump? because every file should be listed there. Maybe not the easiest format, but I have scripts to do the conversion. Based on our conversation, I had assumed that adding this would be a relatively easy way to improve the scraper. I'll ask the students today if they want to tackle it, otherwise I'll put it on my own list.

eshellman avatar Nov 14 '19 15:11 eshellman

@eshellman Everything is feasible, and probably easy. I just try to figure out what would be the best approach to do things. I will move the discussion topic of simplifying Gutenberg scraping to an other ticket (this ticket is primary about Wikisouce). If you have other sources of Ebooks (which you have), it would be great if you could open on ticket per source and give a few details about these new sources and in particular in which format is the catalog.

kelson42 avatar Nov 14 '19 15:11 kelson42

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Aug 22 '20 18:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar May 26 '23 16:05 stale[bot]