langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Set encoding of reading text files default to UTF-8

Open richarddwang opened this issue 2 years ago • 1 comments

For processing non-english texts, especially on windows, we often encoutered text encoding problem and error message like UnicodeDecodeError: 'cp950' codec can't decode byte 0xe6 in position 2: illegal multibyte sequence. This PR make the most text-based document loader (search by matching f.read) to read text with the universal utf-8 encoding.

richarddwang avatar May 12 '23 09:05 richarddwang

@richarddwang thanks for the contribution! Would you mind only keeping the encoding changes for loaders that deal with a single file?

eyurtsev avatar May 12 '23 16:05 eyurtsev

stale

baskaryan avatar Aug 11 '23 22:08 baskaryan