QAnything icon indicating copy to clipboard operation
QAnything copied to clipboard

[BUG] <title> python最新版pdf无法解析,已经下载了pdf模型文件

Open changqingla opened this issue 6 months ago • 2 comments

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

按照这个“在modelscope下载相关的解析模型,并将其放置到根目录的qanything_kernel/utils/loader/pdf_to_markdown/checkpoints/下”进行了操作。在qanything_kernel/utils/loader/pdf_to_markdown/checkpoints/目录下git clone https://www.modelscope.cn/netease-youdao/QAnything-pdf-parser.git。但是无法解析pdf: 2024-08-22 12:00:02,808 split error: Traceback (most recent call last): File "/data/ht/rag/qanything_kernel/core/local_doc_qa.py", line 98, in insert_files_to_faiss local_file.split_file_to_docs(self.get_ocr_result) File "/data/ht/rag/qanything_kernel/utils/general_utils.py", line 73, in inner res = func(*arg, **kwargs) File "/data/ht/rag/qanything_kernel/core/local_file.py", line 169, in split_file_to_docs docs = loader.load_and_split(texts_splitter) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 63, in load_and_split docs = self.load() File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 29, in load return list(self.lazy_load()) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py", line 88, in lazy_load elements = self._get_elements() File "/data/ht/rag/qanything_kernel/utils/loader/pdf_loader.py", line 57, in _get_elements return partition_text(filename=txt_file_path, **self.unstructured_kwargs) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/partition/text.py", line 93, in partition_text return _partition_text( File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/documents/elements.py", line 526, in wrapper elements = func(*args, **kwargs) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 619, in wrapper elements = func(*args, **kwargs) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 574, in wrapper elements = func(*args, **kwargs) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/chunking/init.py", line 69, in wrapper elements = func(*args, **kwargs) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/partition/text.py", line 169, in _partition_text file_content = _split_by_paragraph( File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/partition/text.py", line 301, in _split_by_paragraph _split_content_to_fit_max( File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/partition/text.py", line 333, in _split_content_to_fit_max sentences = sent_tokenize(content) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 30, in sent_tokenize return _sent_tokenize(text) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/nltk/tokenize/init.py", line 119, in sent_tokenize tokenizer = _get_punkt_tokenizer(language) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/nltk/tokenize/init.py", line 105, in _get_punkt_tokenizer return PunktTokenizer(language) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1744, in init self.load_lang(lang) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang lang_dir = find(f"tokenizers/punkt_tab/{lang}/") File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/nltk/data.py", line 579, in find raise LookupError(resource_not_found) LookupError:


Resource punkt_tab not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download('punkt_tab')

For more information see: https://www.nltk.org/data.html

Attempted to load tokenizers/punkt_tab/english/

Searched in: - '/home/ubuntu/nltk_data' - '/home/ubuntu/.conda/envs/rag-XY/nltk_data' - '/home/ubuntu/.conda/envs/rag-XY/share/nltk_data' - '/home/ubuntu/.conda/envs/rag-XY/lib/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - '/data/ht/rag/qanything_kernel/nltk_data'


2024-08-22 12:00:02,809 insert_to_faiss: success num: 0, failed num: 1 2024-08-22 12:00:03,438 list_docs zzp 2024-08-22 12:00:03,439 kb_id: KB68e60de6f07d47daab54fd0bc673aa83

期望行为 | Expected Behavior

No response

运行环境 | Environment

- OS:
- NVIDIA Driver:
- CUDA:
- docker:
- docker-compose:
- NVIDIA GPU:
- NVIDIA GPU Memory:

QAnything日志 | QAnything logs

No response

复现方法 | Steps To Reproduce

No response

备注 | Anything else?

No response

changqingla avatar Aug 22 '24 04:08 changqingla