kotaemon [BUG] - 'gbk' codec can't decode byte 0x8c in position 2: illegal multibyte sequence When use GraphIndex

Description

When I try to build a graph index, I uploaded a PDF file and started building the index. During this process, the following errors will occur:

Indexing [1/1]: small_test.pdf
 => Converting small_test.pdf to text
 => Converted small_test.pdf to text
 => [small_test.pdf] Processed 2 chunks
 => Finished indexing small_test.pdf
[GraphRAG] Creating index... This can take a long time.
Logging enabled at 

c:\Users\**\Desktop\small\remote\kotaemon\ktem_app_data\user_data\files\graphr

ag\8ebbc1ff-2bef-49aa-803a-c72ffcbeb476\output\20240909-162212\reports\indexing

-engine.log

Error: 'gbk' codec can't decode byte 0x8c in position 2: illegal multibyte sequence

Are there any constraints or limitations on the uploaded PDF document?

Reproduction steps

1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

Screenshots

No response

Logs

No response

Browsers

No response

OS

No response

Additional information

No response

Sep 09 '24 08:09 flyboyer

Encountering same issue using GraphRAG indexing. The UI doesn't provide enough information for debug, nor can I find any logging for it in the console, or a log for debugging GraphRAG indexing process

The same pdf does just fine in normal indexing process

2013 Reinforcement Learning in Robotics - A Survey.pdf

Sep 10 '24 01:09 RealmX1

do you solve it yet? cin-jimmy

Sep 28 '24 18:09 zjiang4