ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: parse pdf file error

Open ben-qiao opened this issue 1 year ago • 5 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch name

main

Commit ID

fjoiesjf0923iur092jdpo2

Other environment information

linux
docker install ragflow
copy deepdoc model manually because of the "No such file or directory: '/ragflow/rag/res/deepdoc/ocr.res'be0c1e50eef6047b412d1800aa89aba4d275f997/ocr.res"

Actual behavior

the ragflow start normally. but when import pdf file, it report error: Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

Expected behavior

the pdf file can be parsed normally

Steps to reproduce

create a knowledge base and then import pdf file

Additional information

No response

ben-qiao avatar Apr 19 '24 01:04 ben-qiao

I have tested with the pdf file on my server , It works fine.

Screenshot 2024-04-19 at 10 10 49

But I have checked my docker path. The ocr.res file located in /ragflow/rag/res : /ragflow/rag/res/ocr.res . It's different than the path you mentioned

@ben-qiao

Screenshot 2024-04-19 at 10 13 27

Jiafan avatar Apr 19 '24 02:04 Jiafan

I have tested with the pdf file on my server , It works fine.

Screenshot 2024-04-19 at 10 10 49

But I have checked my docker path. The ocr.res file located in /ragflow/rag/res : /ragflow/rag/res/ocr.res . It's different than the path you mentioned

@ben-qiao

Screenshot 2024-04-19 at 10 13 27

I have tested with the pdf file on my server , It works fine.

Screenshot 2024-04-19 at 10 10 49

But I have checked my docker path. The ocr.res file located in /ragflow/rag/res : /ragflow/rag/res/ocr.res . It's different than the path you mentioned

@ben-qiao

Screenshot 2024-04-19 at 10 13 27

i checked my path, the ocr.res file is same path: image

ben-qiao avatar Apr 19 '24 03:04 ben-qiao

I have tested with the pdf file on my server , It works fine.

Screenshot 2024-04-19 at 10 10 49

But I have checked my docker path. The ocr.res file located in /ragflow/rag/res : /ragflow/rag/res/ocr.res . It's different than the path you mentioned

@ben-qiao

Screenshot 2024-04-19 at 10 13 27

  1. Manually download the resource files from huggingface.co/InfiniFlow/deepdoc to your local folder ~/deepdoc.
  2. Add a volumes to docker-compose.yml, for example:
  • ~/deepdoc:/ragflow/rag/res/deepdoc

KevinHuSh avatar Apr 19 '24 03:04 KevinHuSh

Is there an existing issue for the same bug?

  • [x] I have checked the existing issues.

Branch name

main

Commit ID

fjoiesjf0923iur092jdpo2

Other environment information

linux
docker install ragflow
copy deepdoc model manually because of the "No such file or directory: '/ragflow/rag/res/deepdoc/ocr.res'be0c1e50eef6047b412d1800aa89aba4d275f997/ocr.res"

Actual behavior

the ragflow start normally. but when import pdf file, it report error: Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

Expected behavior

the pdf file can be parsed normally

Steps to reproduce

create a knowledge base and then import pdf file

Additional information

No response

If it happened on demo website, please delete and upload again. If it's local, check the status of minio.

KevinHuSh avatar Apr 19 '24 03:04 KevinHuSh

  • ragflow/rag/res/deepdoc

i download deepdoc from huggingface and add a volumes to docker-compose.yml, after ragflow startup, i import pdf file to kb, and get a new error:

''' WARNING] [2024-04-19 16:22:04,334] [synonym.init] [line:24]: Realtime synonym is disabled, since no redis connection. [WARNING] Load term.freq FAIL! [WARNING] Load term.freq FAIL! Traceback (most recent call last): File "/ragflow/deepdoc/parser/pdf_parser.py", line 42, in init self.updown_cnt_mdl.load_model(os.path.join( File "/root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/core.py", line 2588, in load_model _check_call(_LIB.XGBoosterLoadModel(self.handle, c_str(fname))) File "/root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/core.py", line 282, in _check_call raise XGBoostError(py_str(_LIB.XGBGetLastError())) xgboost.core.XGBoostError: [16:22:07] /workspace/dmlc-core/src/io/local_filesys.cc:209: Check failed: allow_null: LocalFileSystem::Open "/ragflow/rag/res/deepdoc/updown_concat_xgb.model": No such file or directory Stack trace: [bt] (0) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f002eec424e] [bt] (1) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0xcc9637) [0x7f002f9d3637] [bt] (2) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0xcb54ce) [0x7f002f9bf4ce] [bt] (3) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0x18e) [0x7f002ee78ace] [bt] (4) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/../../libffi.so.8(+0xa052) [0x7f0157371052] [bt] (5) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/../../libffi.so.8(+0x8925) [0x7f015736f925] [bt] (6) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/../../libffi.so.8(ffi_call+0xde) [0x7f015737006e] [bt] (7) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x92ba) [0x7f01573812ba] [bt] (8) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x87e3) [0x7f01573807e3]

'''

ben-qiao avatar Apr 19 '24 08:04 ben-qiao

it is a configuration problem. i pull new version 0.3.0,and run docker with docker-compose-CN.yml(before a run docker-compose.yml). then import pdf, the file is parsed successfully. add new config in docker-compose-CN.yml:

  • ./deepdoc:/ragflow/rag/res/deepdoc

ben-qiao avatar Apr 22 '24 02:04 ben-qiao