WebToEpub icon indicating copy to clipboard operation
WebToEpub copied to clipboard

Provide an option for charset

Open fluviusmagnus opened this issue 1 year ago • 4 comments

Is your feature request related to a problem? Please describe. Unfortunately, not all websites are using utf-8, for example Zeno is still written in iso-8859-1. The same problem has already been mentioned in several issues before (such as #129, #1317).

Describe the solution you'd like Detect the charset of the page or just provide an option to specify it manually will solve the problem permanently.

fluviusmagnus avatar Nov 13 '24 03:11 fluviusmagnus

@fluviusmagnus

I am aware of the problem.

Detect the charset of the page

Unfortunately, this is not an easy thing to do. Yes, I know Browsers do this. But, last I looked, they don't provide an API or similar to access this functionality. Note, if you can find a way to do this, I will be very happy to add this ability.

an option to specify it manually

I have considered this as well. The problem is, most users seem unable to grasp CSS. I suspect trying to explain charsets to them to be impossible.

That said, I'm prepared for you to convince me that I'm wrong.

dteviot avatar Nov 13 '24 18:11 dteviot

@dteviot

Yes, I agree that it could be sometimes very confusing to many.

But as a compromise, maybe to hide this option in the 'Advanced Options' is less unacceptable (at the expense of a working 'Test' workflow)?

fluviusmagnus avatar Nov 13 '24 21:11 fluviusmagnus

@fluviusmagnus

at the expense of a working 'Test' workflow

I don't understand. Can you expand on this?

I'll add

  1. WebToEpub DOES handle sites that don't use UTF-8. (Mostly the assorted Chinese charsets) It's just I have to add some code to the parser for each site.
  2. The Advanced Settings apply to All Sites. You'd really want to set the charset on a per-site basis. So, would be a field on the default parser.

dteviot avatar Nov 14 '24 00:11 dteviot

@dteviot

Sorry for the ambiguity. I WAS talking about the default parser. But all I thought then was to find a place to show this option exclusively to advanced users. If it’s not on the default parser page, one must move on to the next step, even if the testing result seems weird.

But realizing that the default parser is already prepared for advanced users, now I think a field on the default parser page would be great, and quite logical. Thank you for mentioning that.

fluviusmagnus avatar Nov 14 '24 01:11 fluviusmagnus

I also think an option the user can manually set the charset is very useful. It cannot pack epub with correct encoding. e.g. http://boruo.goodweb.net.cn/article2/1677.htm

jack6th avatar Mar 09 '25 11:03 jack6th

@dteviot

Unfortunately, this is not an easy thing to do. Yes, I know Browsers do this. But, last I looked, they don't provide an API or similar to access this functionality. Note, if you can find a way to do this, I will be very happy to add this ability.

What is with document.characterSet? (reference https://github.com/dteviot/WebToEpub/issues/1704#issuecomment-2711432780) (https://developer.mozilla.org/en-US/docs/Web/API/Document/characterSet) As an idea create new Parser variable and set the charset in getChapterUrls() with this.charset = dom.characterSet;

gamebeaker avatar Mar 10 '25 18:03 gamebeaker

D'oh!

dteviot avatar Mar 10 '25 19:03 dteviot