Provide an option for charset
Is your feature request related to a problem? Please describe. Unfortunately, not all websites are using utf-8, for example Zeno is still written in iso-8859-1. The same problem has already been mentioned in several issues before (such as #129, #1317).
Describe the solution you'd like Detect the charset of the page or just provide an option to specify it manually will solve the problem permanently.
@fluviusmagnus
I am aware of the problem.
Detect the charset of the page
Unfortunately, this is not an easy thing to do. Yes, I know Browsers do this. But, last I looked, they don't provide an API or similar to access this functionality. Note, if you can find a way to do this, I will be very happy to add this ability.
an option to specify it manually
I have considered this as well. The problem is, most users seem unable to grasp CSS. I suspect trying to explain charsets to them to be impossible.
That said, I'm prepared for you to convince me that I'm wrong.
@dteviot
Yes, I agree that it could be sometimes very confusing to many.
But as a compromise, maybe to hide this option in the 'Advanced Options' is less unacceptable (at the expense of a working 'Test' workflow)?
@fluviusmagnus
at the expense of a working 'Test' workflow
I don't understand. Can you expand on this?
I'll add
- WebToEpub DOES handle sites that don't use UTF-8. (Mostly the assorted Chinese charsets) It's just I have to add some code to the parser for each site.
- The Advanced Settings apply to All Sites. You'd really want to set the charset on a per-site basis. So, would be a field on the default parser.
@dteviot
Sorry for the ambiguity. I WAS talking about the default parser. But all I thought then was to find a place to show this option exclusively to advanced users. If it’s not on the default parser page, one must move on to the next step, even if the testing result seems weird.
But realizing that the default parser is already prepared for advanced users, now I think a field on the default parser page would be great, and quite logical. Thank you for mentioning that.
I also think an option the user can manually set the charset is very useful. It cannot pack epub with correct encoding. e.g. http://boruo.goodweb.net.cn/article2/1677.htm
@dteviot
Unfortunately, this is not an easy thing to do. Yes, I know Browsers do this. But, last I looked, they don't provide an API or similar to access this functionality. Note, if you can find a way to do this, I will be very happy to add this ability.
What is with document.characterSet? (reference https://github.com/dteviot/WebToEpub/issues/1704#issuecomment-2711432780) (https://developer.mozilla.org/en-US/docs/Web/API/Document/characterSet)
As an idea create new Parser variable and set the charset in getChapterUrls() with this.charset = dom.characterSet;
D'oh!