WebToEpub
WebToEpub copied to clipboard
Improvements to Default Parser
- [x] User can use CCS to specify Title and Content elements.
- [ ] Allow user to specify multiple elements to include.
- [x] User can use CCS to specify elements to remove.
- [ ] (Maybe) CCS to specify where to get Chapter URLs from.
- [x] Needs to have a preview mode, so user can check if the CCS work.
@typhoon71, @3dycosmo
If you're still around, could you give the latest experimental tab mode branch build a try? I've made changes to the default parser and I'd really like to get feedback. You could try it against http://www.ironteethserial.com/table-of-contents/
Thanks.
I'm around, but how I'm supposed to test it? How is this feaure supposed to work?
@typhoon71
My thought was, open http://www.ironteethserial.com/table-of-contents/, then open WebToEpub and see if you can figure out how to use it. If you can't then that tells me more work is required. The basic ideas are:
- The default parser is now using Cascading Style Sheet selectors to specify the elements to fetch/remove.
- Default parser can now have options for fetching a chapter title, and removing unwanted element(s) as well as getting the wanted content.
- There's a button to allow you to see how the selectors work.
I'm thinking I probably need to provide a "help" button that goes to some sort of help page. But maybe you've got some better suggestions? Because people usually refuse to read the help.
Mmm, yes, it wasn't super evident to me, but now with your explanation I understand. Basically it's like a pre-processor where you can specify what to get/remove? A dinamically built site plugin?
I couldn't get the title bit to work, or more like understand what it needs; it seems to grab it anyway. Removing stuff like script, nav, ul, style, noscript, mc-field-group, footer, widget_text, comments, widget seemed to work as I'd expect. So one writes a comma separated list of elements to remove and the junk will be cleaned out automatically. Nice. I suppose I can stop working on my python script (used to remove script, style, noscript, nav tagged content).
Some help or examples would really... help.
Side note: Sigil tells me the xhtml is malformed, while calibre editor lists a bunch of errors, mostly unknown properties and unreferenced images.
@typhoon71
Thanks for the feedback. I'm working on the "unknown properties" issues.
@typhoon71 Here's my first attempt at writing help for the Default Parser, https://dteviot.github.io/Projects/webToEpub_DefaultParser.html Please tell me what you think of it. (When you have time.)
I suppose I can stop working on my python script (used to remove script, style, noscript, nav tagged content).
You might like to have a look at this other project of mine. https://github.com/dteviot/EpubEditor
The help button is a good idea. The help is clear and complete, also the link to mozilla css selector page is helpful, I was just wondering why putting a complex tag didn't work.
Some minor things I noticed:
- I'd find more "natural" the buttons to be in test/finished/help order, because one should need the help button once, then it'll be "in the way".
- Since I'm talking about layout, I suggest to put the 5 buttons on the main interface on the same "line" if possible (cleaner UI).
- To show the DOM Inspector you can use F12 too.
Nice to see that the parser remembers the settings: is that on a per domain base or per link base? Is it easy to save or export? Thinking of sharing it between browsers or just a backup of stuff one doesn't want to redo from scratch.
The epub editor/cleaner is nice, just appeared late! XD Anyway it has issues with unrecognized css properties too: I used a ebup to it that has a couple of those and they still were in the output, and calibre editor complained.
@typhoon71
Thank you.
Nice to see that the parser remembers the settings: is that on a per domain base or per link base?
They're remembered "per hostname".
Is it easy to save or export? Thinking of sharing it between browsers or just a backup of stuff one doesn't want to redo from scratch.
That's a good idea. I'll create a new incident for that.
Anyway it has issues with unrecognized css properties too: I used a ebup to it that has a couple of those and they still were in the output, and calibre editor complained.
The "sanitize XHTML" option is supposed to fix that. However, the cleaner is still a work in progress. Any chance you can tell me where you got the test epub from?
I tested the epub editor with this: https://docs.google.com/document/d/1jclG56IcF6oSKmKOwBpHPJE6bGGRf8JGYgNRoNyWPSA/edit. File, save as, epub, and then use the generated epub.
More improvements:
- [x] Put box around the test output box (and put some space between it and the buttons)
- [x] Start with instructions in the test output box.
- [x] Remove (required) from Hostname
- [x] Put "Url of test chapter" after content CCS Selector. i.e. Make it 3rd input.
- [x] In Help, describe how to find selector for element
I was thinking: why not adding the exclusion bit of this to the defaults parsers too? In particular removing scripts and comments could be handy. Or it's stuff that's already taken care of in them?
The problem with removing scripts in the default parser is there's at least one site that puts the content into scripts. And I leave the comments in because WebToEpub generates comments (Mostly for tracking the origin of content.)
Which site is doing that? I'd like to check it.
I think it was a site holding "Sword Art Online". But I don't have the URL.
Alright noob question, in the help page, the example is using div with class, how about id? Like for example div id="chaptercontent", I did put div.chaptercontent and it wasn't working. Perhaps add selection with body, div, class, id like previous default parser.
Also how to include multiple elements to remove? Do we have to divide the elements with commas or something? I can't find it in help page.
@diablo348
how about id?
Here's the link for CSS selectors. https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors So, to answer your question: #toc will match the element that has the ID "toc".
Also how to include multiple elements to remove?
Use commas. see https://developer.mozilla.org/en-US/docs/Learn/CSS/Introduction_to_CSS/Combinators_and_multiple_selectors
While the UI is becoming a bit "messy", it has all that's needed now ;) The link to help is useful, really (in particular for ppl lik myself that don't know much about css selectors). Thanks.
@typhoon71
Thanks. If you can think of any way to make it less "messy" please let me know. Note, I don't promise to implement it.
It would mostly be some rearranging and aestetical fixing. I'll do a cpuple of mock-ups in the next week and post them, easier to show/see than to explain. np if they won't get implemented, they're suggestin after all ;)
I would like to ask if there is a way to obtain/display all of the TOC's chapters. Some sites don't show all of the chapters and are break down into several pages, so I can't grab all the chapters in one go and needed to edit and paste them onto of each other.
It's a lot of work when they have 2k+ pages. thanks.
@Rainiu Please provide URLs for a couple of sites with this problem.
My sites example are in Vietnamese. Truyenyy or truyenfull. Basically ex. page 1 would only display chapters 1-100 while page 2 101-200 and so on. There's no option to show all chapters.
@Rainiu Unfortunately, when the TOC is across multiple Web Pages, you need to write code to fetch all the TOC pages and extract the chapter URLs from them. In other words, you need to write a parser for the sites Assuming you've got some skills with HTML and Javascript, then writing a parser isn't that hard.
- https://github.com/dteviot/WebToEpub has instructions on how to get the source so you can add a parser.
- https://dteviot.github.io/Projects/webToEpub_FAQ.html#write-parser is start point for how to build a parser.
Unfortunately, the "build a parser" instructions don't include how to fetch multiple TOC pages. However, an example of an existing parser that does this is the ZenithNovelsParser. (see below) I think this is similar to the logic for the truyenyy.com site. (truyenfull.vn is a bit more complicated. That looks like it's doing AJAX calls to get the chapter lists.)
If you'd like to try writing a parser yourself, feel free to contact me if you have any problems. (Just raise a new issue or email me directly.) If you don't think you've got the skills to write a parser, please create a new issue giving the URLs for the sites you'd like done.
https://github.com/dteviot/WebToEpub/blob/887f109d64bd6b5ead7e31a0da11c2efc9ea8942/plugin/js/parsers/ZenithNovelsParser.js#L10-L34