WebToEpub icon indicating copy to clipboard operation
WebToEpub copied to clipboard

Improvements to Default Parser

Open dteviot opened this issue 6 years ago • 23 comments

  • [x] User can use CCS to specify Title and Content elements.
  • [ ] Allow user to specify multiple elements to include.
  • [x] User can use CCS to specify elements to remove.
  • [ ] (Maybe) CCS to specify where to get Chapter URLs from.
  • [x] Needs to have a preview mode, so user can check if the CCS work.

dteviot avatar Oct 02 '18 18:10 dteviot

@typhoon71, @3dycosmo

If you're still around, could you give the latest experimental tab mode branch build a try? I've made changes to the default parser and I'd really like to get feedback. You could try it against http://www.ironteethserial.com/table-of-contents/

Thanks.

dteviot avatar Jan 02 '19 07:01 dteviot

I'm around, but how I'm supposed to test it? How is this feaure supposed to work?

typhoon71 avatar Jan 06 '19 21:01 typhoon71

@typhoon71

My thought was, open http://www.ironteethserial.com/table-of-contents/, then open WebToEpub and see if you can figure out how to use it. If you can't then that tells me more work is required. The basic ideas are:

  1. The default parser is now using Cascading Style Sheet selectors to specify the elements to fetch/remove.
  2. Default parser can now have options for fetching a chapter title, and removing unwanted element(s) as well as getting the wanted content.
  3. There's a button to allow you to see how the selectors work.

I'm thinking I probably need to provide a "help" button that goes to some sort of help page. But maybe you've got some better suggestions? Because people usually refuse to read the help.

dteviot avatar Jan 07 '19 19:01 dteviot

Mmm, yes, it wasn't super evident to me, but now with your explanation I understand. Basically it's like a pre-processor where you can specify what to get/remove? A dinamically built site plugin?

I couldn't get the title bit to work, or more like understand what it needs; it seems to grab it anyway. Removing stuff like script, nav, ul, style, noscript, mc-field-group, footer, widget_text, comments, widget seemed to work as I'd expect. So one writes a comma separated list of elements to remove and the junk will be cleaned out automatically. Nice. I suppose I can stop working on my python script (used to remove script, style, noscript, nav tagged content).

Some help or examples would really... help.

Side note: Sigil tells me the xhtml is malformed, while calibre editor lists a bunch of errors, mostly unknown properties and unreferenced images.

typhoon71 avatar Jan 07 '19 20:01 typhoon71

@typhoon71

Thanks for the feedback. I'm working on the "unknown properties" issues.

dteviot avatar Jan 09 '19 22:01 dteviot

@typhoon71 Here's my first attempt at writing help for the Default Parser, https://dteviot.github.io/Projects/webToEpub_DefaultParser.html Please tell me what you think of it. (When you have time.)

I suppose I can stop working on my python script (used to remove script, style, noscript, nav tagged content).

You might like to have a look at this other project of mine. https://github.com/dteviot/EpubEditor

dteviot avatar Jan 10 '19 22:01 dteviot

The help button is a good idea. The help is clear and complete, also the link to mozilla css selector page is helpful, I was just wondering why putting a complex tag didn't work.

Some minor things I noticed:

  1. I'd find more "natural" the buttons to be in test/finished/help order, because one should need the help button once, then it'll be "in the way".
  2. Since I'm talking about layout, I suggest to put the 5 buttons on the main interface on the same "line" if possible (cleaner UI).
  3. To show the DOM Inspector you can use F12 too.

Nice to see that the parser remembers the settings: is that on a per domain base or per link base? Is it easy to save or export? Thinking of sharing it between browsers or just a backup of stuff one doesn't want to redo from scratch.

The epub editor/cleaner is nice, just appeared late! XD Anyway it has issues with unrecognized css properties too: I used a ebup to it that has a couple of those and they still were in the output, and calibre editor complained.

typhoon71 avatar Jan 11 '19 08:01 typhoon71

@typhoon71

Thank you.

Nice to see that the parser remembers the settings: is that on a per domain base or per link base?

They're remembered "per hostname".

Is it easy to save or export? Thinking of sharing it between browsers or just a backup of stuff one doesn't want to redo from scratch.

That's a good idea. I'll create a new incident for that.

Anyway it has issues with unrecognized css properties too: I used a ebup to it that has a couple of those and they still were in the output, and calibre editor complained.

The "sanitize XHTML" option is supposed to fix that. However, the cleaner is still a work in progress. Any chance you can tell me where you got the test epub from?

dteviot avatar Jan 12 '19 19:01 dteviot

I tested the epub editor with this: https://docs.google.com/document/d/1jclG56IcF6oSKmKOwBpHPJE6bGGRf8JGYgNRoNyWPSA/edit. File, save as, epub, and then use the generated epub.

typhoon71 avatar Jan 12 '19 20:01 typhoon71

More improvements:

  • [x] Put box around the test output box (and put some space between it and the buttons)
  • [x] Start with instructions in the test output box.
  • [x] Remove (required) from Hostname
  • [x] Put "Url of test chapter" after content CCS Selector. i.e. Make it 3rd input.
  • [x] In Help, describe how to find selector for element

dteviot avatar Jan 15 '19 18:01 dteviot

I was thinking: why not adding the exclusion bit of this to the defaults parsers too? In particular removing scripts and comments could be handy. Or it's stuff that's already taken care of in them?

typhoon71 avatar Jan 20 '19 15:01 typhoon71

The problem with removing scripts in the default parser is there's at least one site that puts the content into scripts. And I leave the comments in because WebToEpub generates comments (Mostly for tracking the origin of content.)

dteviot avatar Jan 20 '19 18:01 dteviot

Which site is doing that? I'd like to check it.

typhoon71 avatar Jan 20 '19 18:01 typhoon71

I think it was a site holding "Sword Art Online". But I don't have the URL.

dteviot avatar Jan 20 '19 20:01 dteviot

Alright noob question, in the help page, the example is using div with class, how about id? Like for example div id="chaptercontent", I did put div.chaptercontent and it wasn't working. Perhaps add selection with body, div, class, id like previous default parser.

Also how to include multiple elements to remove? Do we have to divide the elements with commas or something? I can't find it in help page.

diablo348 avatar Feb 05 '19 20:02 diablo348

@diablo348

how about id?

Here's the link for CSS selectors. https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors So, to answer your question: #toc will match the element that has the ID "toc".

Also how to include multiple elements to remove?

Use commas. see https://developer.mozilla.org/en-US/docs/Learn/CSS/Introduction_to_CSS/Combinators_and_multiple_selectors

dteviot avatar Feb 05 '19 20:02 dteviot

While the UI is becoming a bit "messy", it has all that's needed now ;) The link to help is useful, really (in particular for ppl lik myself that don't know much about css selectors). Thanks.

typhoon71 avatar Feb 17 '19 12:02 typhoon71

@typhoon71

Thanks. If you can think of any way to make it less "messy" please let me know. Note, I don't promise to implement it.

dteviot avatar Feb 17 '19 18:02 dteviot

It would mostly be some rearranging and aestetical fixing. I'll do a cpuple of mock-ups in the next week and post them, easier to show/see than to explain. np if they won't get implemented, they're suggestin after all ;)

typhoon71 avatar Feb 17 '19 19:02 typhoon71

I would like to ask if there is a way to obtain/display all of the TOC's chapters. Some sites don't show all of the chapters and are break down into several pages, so I can't grab all the chapters in one go and needed to edit and paste them onto of each other.

It's a lot of work when they have 2k+ pages. thanks.

Rainiu avatar Apr 23 '19 02:04 Rainiu

@Rainiu Please provide URLs for a couple of sites with this problem.

dteviot avatar Apr 23 '19 03:04 dteviot

My sites example are in Vietnamese. Truyenyy or truyenfull. Basically ex. page 1 would only display chapters 1-100 while page 2 101-200 and so on. There's no option to show all chapters.

Rainiu avatar Apr 24 '19 00:04 Rainiu

@Rainiu Unfortunately, when the TOC is across multiple Web Pages, you need to write code to fetch all the TOC pages and extract the chapter URLs from them. In other words, you need to write a parser for the sites Assuming you've got some skills with HTML and Javascript, then writing a parser isn't that hard.

  • https://github.com/dteviot/WebToEpub has instructions on how to get the source so you can add a parser.
  • https://dteviot.github.io/Projects/webToEpub_FAQ.html#write-parser is start point for how to build a parser.

Unfortunately, the "build a parser" instructions don't include how to fetch multiple TOC pages. However, an example of an existing parser that does this is the ZenithNovelsParser. (see below) I think this is similar to the logic for the truyenyy.com site. (truyenfull.vn is a bit more complicated. That looks like it's doing AJAX calls to get the chapter lists.)

If you'd like to try writing a parser yourself, feel free to contact me if you have any problems. (Just raise a new issue or email me directly.) If you don't think you've got the skills to write a parser, please create a new issue giving the URLs for the sites you'd like done.

https://github.com/dteviot/WebToEpub/blob/887f109d64bd6b5ead7e31a0da11c2efc9ea8942/plugin/js/parsers/ZenithNovelsParser.js#L10-L34

dteviot avatar Apr 24 '19 04:04 dteviot