zero-epwing icon indicating copy to clipboard operation
zero-epwing copied to clipboard

Encoding Issue with some Entries

Open CalculusAce opened this issue 3 years ago • 1 comments

I've been using zero-epwing to convert a number of old epwing dictionaries I have over to yomichan, and I have run into an issue that I haven't seen before. It seems that some of these epwing dictionaries have characters (like �) in their entries that cannot be encoded in EUC-JP. As a result, I think zero-epwing is unable to convert the text in these entries to UTF-8 successfully, and it ends up jumping over the text for the definitions of various headwords. As most of the entries are valid in EUC-JP, their definitions are collected as expected.

I verified this by looking at the json output from zero epwing for certain headwords that had definitions containing � when viewed in an epwing file reader and noticed that the json data had no text key associated with those headwords. I am trying to figure out if a regex could be implemented in the zero epwing code that could attempt to remove characters like � prior to doing the encoding shift to UTF-8. If those characters could be removed, more entry data could be collected when attempting to move the epwing data over to yomichan.

CalculusAce avatar Oct 30 '20 17:10 CalculusAce

I've noticed this as well. Doesn't seem to be an issue in the new version of yomichan import, which uses https://github.com/FooSoft/zero-epwing-go ? However that version seems to have an issue that this version doesn't (number 3 here. the others are dictionary specific)

Thermospore avatar Mar 08 '21 11:03 Thermospore