zero-epwing
zero-epwing copied to clipboard
Encoding Issue with some Entries
I've been using zero-epwing to convert a number of old epwing dictionaries I have over to yomichan, and I have run into an issue that I haven't seen before. It seems that some of these epwing dictionaries have characters (like �) in their entries that cannot be encoded in EUC-JP. As a result, I think zero-epwing is unable to convert the text in these entries to UTF-8 successfully, and it ends up jumping over the text for the definitions of various headwords. As most of the entries are valid in EUC-JP, their definitions are collected as expected.
I verified this by looking at the json output from zero epwing for certain headwords that had definitions containing � when viewed in an epwing file reader and noticed that the json data had no text key associated with those headwords. I am trying to figure out if a regex could be implemented in the zero epwing code that could attempt to remove characters like � prior to doing the encoding shift to UTF-8. If those characters could be removed, more entry data could be collected when attempting to move the epwing data over to yomichan.
I've noticed this as well. Doesn't seem to be an issue in the new version of yomichan import, which uses https://github.com/FooSoft/zero-epwing-go ? However that version seems to have an issue that this version doesn't (number 3 here. the others are dictionary specific)