KindleClippings icon indicating copy to clipboard operation
KindleClippings copied to clipboard

Is it able to decode other language such as Chinese

Open l1lsl0th opened this issue 3 years ago • 11 comments

New to python but I think it's having issue decoding Chinese, need encoding="utf-8" maybe?:

Error: return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 121: character maps to

l1lsl0th avatar Mar 22 '21 19:03 l1lsl0th

Did the error say which line was causing the problems? I don't think I've ever tried it on Chinese characters / Kanji etc.

Also, which python version are you using?

robertmartin8 avatar Mar 23 '21 13:03 robertmartin8

I am running on python 3.8 but had try 3.9 too. Thanks a bunch

l1lsl0th avatar Mar 27 '21 14:03 l1lsl0th

Hi! Same problem trying to use your script, Python 3.7.6, and books in English and Spanish:

Traceback (most recent call last): File "KindleClippings.py", line 116, in <module> parse_clippings(source_file, destination) File "KindleClippings.py", line 57, in parse_clippings for highlight in f.read().split("=========="): File "d:\Miniconda3\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1589: character maps to <undefined>

Thanks

aiturri avatar May 12 '21 18:05 aiturri

Hi @aiturri,

Thanks for raising this. Would it be possible for you to share the part of the clipping file that is causing the errors? I'd love to try and fix this but can't reproduce the error.

Best, Robert

robertmartin8 avatar May 12 '21 18:05 robertmartin8

Sure! Thanks in advance!

aiturri avatar May 12 '21 19:05 aiturri

Hi @aiturri,

I've just pushed a potential fix. Can you download the script again and try?

Otherwise you can manually modify line 55 to specify an encoding.

    with open(source_file, "r", encoding="utf8") as f:

Let me know if it does or doesn't work. For the record, the original script worked fine on my machine with your clippings file so I couldn't verify the issue

Best, Robert

robertmartin8 avatar May 12 '21 19:05 robertmartin8

Hi @robertmartin8 , I tried again, and still not working:

Traceback (most recent call last): File "KindleClippings.py", line 116, in parse_clippings(source_file, destination) File "KindleClippings.py", line 88, in parse_clippings outfile.write(clipping_text + "\n\n...\n\n") File "d:\Miniconda3\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u03c3' in position 142: character maps to

I will attach here my original clippings file so you can try, but I will delete as soon as you download it (please, let me know so I can delete (for privacy reasons!))

Thanks again!

aiturri avatar May 12 '21 20:05 aiturri

@aiturri OK, I've downloaded it. Feel free to remove

robertmartin8 avatar May 12 '21 20:05 robertmartin8

@aiturri still can't reproduce it – I can parse your file accents and all. I think it's a mac/windows issue.

Can you try again? I forgot to add encoding="utf8" to a couple of the file opens.

robertmartin8 avatar May 12 '21 20:05 robertmartin8

@robertmartin8

_Traceback (most recent call last): File "KindleClippings.py", line 117, in parse_clippings(source_file, destination) File "KindleClippings.py", line 82, in parse_clippings current_text = textfile.read() File "d:\Miniconda3\lib\codecs.py", line 322, in decode (result, consumed) = self.buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 472: invalid start byte

aiturri avatar May 12 '21 20:05 aiturri

@aiturri Ok it seems this is related to a particular windows encoding. Other people seem to have had the same issue.

(Please save your clippings file beforehand just in case)

I've put two fixes: the first just ignores the errors – have a go and see whether it works (the output might be garbled).

The second is a new argument to specify the encoding:

python KindleClippings.py -encoding=cp1252

It might solve your problem?

robertmartin8 avatar May 12 '21 20:05 robertmartin8