langextract icon indicating copy to clipboard operation
langextract copied to clipboard

Russian text Unicode error

Open marianasignal opened this issue 5 months ago • 1 comments

Great work, but I encountered an error when extracting entities from mixed Russian and English text. The core error line is as follows: File "D:\project\07-multilan\langextract_example.py", line 43, in lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl") File "D:\project\07-multilan.venv\lib\site-packages\langextract\io.py", line 123, in save_annotated_documents
f.write(json.dumps(doc_dict, ensure_ascii=False) + '\n') UnicodeEncodeError: 'gbk' codec can't encode character '\u0301' in position 12984: illegal multibyte sequence

marianasignal avatar Aug 14 '25 05:08 marianasignal

Could you give a small example of the text that causes this error? Ideally one line, a few words. This will make it much easier to reproduce on dev machines and fix

mrzasa avatar Sep 29 '25 07:09 mrzasa