WebToEpub icon indicating copy to clipboard operation
WebToEpub copied to clipboard

Add an option to change < and > to &lt; and &gt; respectively.

Open Saladitas opened this issue 1 year ago • 2 comments

I would like to see and option that lets you change < and > into &lt; and &gt;. Sometimes when scraping I would get bad formatting because of this.

Example 1: (random text btw)

	<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sed lectus vestibulum mattis ullamcorper velit sed ullamcorper <morbi 
	
	tincidunt>. Mollis aliquam ut porttitor leo. Eros donec ac odio tempor orci. Nulla pellentesque dignissim enim sit amet venenatis urna cursus. Adipiscing diam donec adipiscing tristique. In eu mi bibendum neque egestas congue quisque egestas. Sit amet dictum sit amet justo donec enim diam vulputate. Sit amet dictum sit amet justo donec enim diam vulputate. Aenean pharetra magna ac placerat. Phasellus egestas tellus rutrum tellus pellentesque eu tincidunt tortor.</morbi></p>

It would have a brake and not render the other half.

Example 2:

	<p>Convallis posuere morbi leo urna. Magna fermentum iaculis eu non diam phasellus. Eget nunc scelerisque viverra mauris in. Malesuada fames ac turpis egestas. 
	<Cras semper auctor neque>, 
	<vitae tempus>, 
	<quam pellentesque>, 
	<nec nam>. Tempor orci eu lobortis elementum nibh tellus molestie. Condimentum lacinia quis vel eros donec. Vitae semper quis lectus nulla. In hendrerit gravida rutrum quisque non tellus orci ac auctor. Elit ullamcorper dignissim cras tincidunt lobortis. Et ligula ullamcorper malesuada proin.</Cras></vitae></quam></nec></p>

It would sometimes not render anything within the < > but the paragraph would be fine, if incomplete.

There are other errors to, though there are less common. Lets say that in an rpg there is a character trait that is labeled as <strong>, It would sometime be marked as bad formatting because there is no </strong>, even though one is not needed. Even now while writing this I needed to add an escape character to the text above for it to render properly (good thing I checked the preview lol).

I end up having to open Sigil and having to fix these errors manually, which can get difficult to fix all of them if its a large novel. So a option like that would be appreciated.

Saladitas avatar Nov 25 '23 03:11 Saladitas

@Saladitas

Please provide a URL to page with issue and I'll have a look at it. But, I suspect I'm not going to be able to do much. Because I've seen this in the past. The problem likely is, you've got a web page with incorrectly formed HTML. (The writer has probably added "<" when they should have written "&lt;") You could look at the source of the page itself to see if this is what has happened. This isn't so much of a problem with HTML browsers, because when this happens they're supposed to take their best guess at showing it. Unfortunately, epub uses XHTML, not HTML. And XHTML is much more strict. If the XHTML is incorrectly formatted, it is rejected. So, WebToEpub has to convert the HTML into XHTML. And when the HTML is not well formed, figuring out what is wrong is hard (for a machine. If you're interested in details, google "halting problem".) And the library I'm using to do the conversion is assuming the "<" is the opening of a tag, rather than a "<" embedded in the text.

For my notes: 18 minutes work

dteviot avatar Nov 26 '23 20:11 dteviot

This is where I'm scrapping from. Infinite Mana In The Apocalypse

This is the Original source. WebNovel - Infinite Mana In The Apocalypse

That's the sires that has caused those problems the most. And its just like you said, the site and the author are using < and > instead of &lt; and &gt;.

Edit: I had to change the URL of the site I scrap from since I pasted in the wrong site. The URL from before actually works (bednovel.com) the best I've seen so far. Even though when I use inspect code I saw < and > instead of &lt; and &gt; but when I opened the document in Sigil everything was formatted just fine.

This may be a case of freewebnovel.com not having the best formatting and not WebToEpub being the issue.

Saladitas avatar Nov 26 '23 21:11 Saladitas