[Bug]: parse pdf file error
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch name
main
Commit ID
fjoiesjf0923iur092jdpo2
Other environment information
linux
docker install ragflow
copy deepdoc model manually because of the "No such file or directory: '/ragflow/rag/res/deepdoc/ocr.res'be0c1e50eef6047b412d1800aa89aba4d275f997/ocr.res"
Actual behavior
the ragflow start normally. but when import pdf file, it report error: Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.
Expected behavior
the pdf file can be parsed normally
Steps to reproduce
create a knowledge base and then import pdf file
Additional information
No response
I have tested with the pdf file on my server , It works fine.
But I have checked my docker path. The ocr.res file located in /ragflow/rag/res : /ragflow/rag/res/ocr.res .
It's different than the path you mentioned
@ben-qiao
I have tested with the pdf file on my server , It works fine.
But I have checked my docker path. The
ocr.resfile located in/ragflow/rag/res:/ragflow/rag/res/ocr.res. It's different than the path you mentioned@ben-qiao
I have tested with the pdf file on my server , It works fine.
But I have checked my docker path. The
ocr.resfile located in/ragflow/rag/res:/ragflow/rag/res/ocr.res. It's different than the path you mentioned@ben-qiao
i checked my path, the ocr.res file is same path:
I have tested with the pdf file on my server , It works fine.
But I have checked my docker path. The
ocr.resfile located in/ragflow/rag/res:/ragflow/rag/res/ocr.res. It's different than the path you mentioned@ben-qiao
- Manually download the resource files from huggingface.co/InfiniFlow/deepdoc to your local folder ~/deepdoc.
- Add a volumes to docker-compose.yml, for example:
- ~/deepdoc:/ragflow/rag/res/deepdoc
Is there an existing issue for the same bug?
- [x] I have checked the existing issues.
Branch name
main
Commit ID
fjoiesjf0923iur092jdpo2
Other environment information
linux docker install ragflow copy deepdoc model manually because of the "No such file or directory: '/ragflow/rag/res/deepdoc/ocr.res'be0c1e50eef6047b412d1800aa89aba4d275f997/ocr.res"Actual behavior
the ragflow start normally. but when import pdf file, it report error: Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.
Expected behavior
the pdf file can be parsed normally
Steps to reproduce
create a knowledge base and then import pdf fileAdditional information
No response
If it happened on demo website, please delete and upload again. If it's local, check the status of minio.
- ragflow/rag/res/deepdoc
i download deepdoc from huggingface and add a volumes to docker-compose.yml, after ragflow startup, i import pdf file to kb, and get a new error:
''' WARNING] [2024-04-19 16:22:04,334] [synonym.init] [line:24]: Realtime synonym is disabled, since no redis connection. [WARNING] Load term.freq FAIL! [WARNING] Load term.freq FAIL! Traceback (most recent call last): File "/ragflow/deepdoc/parser/pdf_parser.py", line 42, in init self.updown_cnt_mdl.load_model(os.path.join( File "/root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/core.py", line 2588, in load_model _check_call(_LIB.XGBoosterLoadModel(self.handle, c_str(fname))) File "/root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/core.py", line 282, in _check_call raise XGBoostError(py_str(_LIB.XGBGetLastError())) xgboost.core.XGBoostError: [16:22:07] /workspace/dmlc-core/src/io/local_filesys.cc:209: Check failed: allow_null: LocalFileSystem::Open "/ragflow/rag/res/deepdoc/updown_concat_xgb.model": No such file or directory Stack trace: [bt] (0) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f002eec424e] [bt] (1) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0xcc9637) [0x7f002f9d3637] [bt] (2) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0xcb54ce) [0x7f002f9bf4ce] [bt] (3) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0x18e) [0x7f002ee78ace] [bt] (4) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/../../libffi.so.8(+0xa052) [0x7f0157371052] [bt] (5) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/../../libffi.so.8(+0x8925) [0x7f015736f925] [bt] (6) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/../../libffi.so.8(ffi_call+0xde) [0x7f015737006e] [bt] (7) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x92ba) [0x7f01573812ba] [bt] (8) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x87e3) [0x7f01573807e3]
'''
it is a configuration problem. i pull new version 0.3.0,and run docker with docker-compose-CN.yml(before a run docker-compose.yml). then import pdf, the file is parsed successfully. add new config in docker-compose-CN.yml:
- ./deepdoc:/ragflow/rag/res/deepdoc



