Langchain-Chatchat [BUG] 安装detectron2后，pdf无法加载

问题描述 / Problem Description 用简洁明了的语言描述这个问题 / Describe the problem in a clear and concise manner.

Centos 操作系统，开始没有安装detectrion2，虽然报错，但是pdf文件可以正常加载但是安装detectron2后，pdf无法加载，提示错误：

Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on. /home/admin/lcchglm/langchain-ChatGLM/content/oil/1.pdf 未能成功加载文件均未成功加载，请检查依赖包或替换为其他文件再次上传。文件未成功加载，请重新上传文件

复现问题的步骤 / Steps to Reproduce

执行 '...' / Run '...'
点击 '...' / Click '...'
滚动到 '...' / Scroll to '...'
问题出现 / Problem occurs

预期的结果 / Expected Result 描述应该出现的结果 / Describe the expected result.

实际结果 / Actual Result 描述实际发生的结果 / Describe the actual result.

环境信息 / Environment Information

langchain-ChatGLM 版本/commit 号：Master，最新
是否使用 Docker 部署（是/否）：否
使用的模型：ChatGLM-6B）
使用的 Embedding 模型：GanymedeNil/text2vec-large-chinese
操作系统及版本：CentOS 7
Python 版本 / Python version: 3.8
其他相关环境信息 / Other relevant environment information:

附加信息 / Additional Information 添加与问题相关的任何其他信息 / Add any other information related to the issue.

May 08 '23 08:05 wjlszhang

建议使用cli_demo.py进行调试，调试完成后再用webui进行应用

wjlszhang @.***>于2023年5月8日周一16:20写道：

问题描述 / Problem Description 用简洁明了的语言描述这个问题 / Describe the problem in a clear and concise manner.

Centos 操作系统，开始没有安装detectrion2，虽然报错，但是pdf文件可以正常加载但是安装detectron2后，pdf无法加载，提示错误：

Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on. /home/admin/lcchglm/langchain-ChatGLM/content/oil/1.pdf 未能成功加载文件均未成功加载，请检查依赖包或替换为其他文件再次上传。文件未成功加载，请重新上传文件

复现问题的步骤 / Steps to Reproduce

执行 '...' / Run '...'

点击 '...' / Click '...'

滚动到 '...' / Scroll to '...'

问题出现 / Problem occurs

预期的结果 / Expected Result 描述应该出现的结果 / Describe the expected result.

实际结果 / Actual Result 描述实际发生的结果 / Describe the actual result.

环境信息 / Environment Information

langchain-ChatGLM 版本/commit 号：Master，最新

是否使用 Docker 部署（是/否）：否

使用的模型：ChatGLM-6B）

使用的 Embedding 模型：GanymedeNil/text2vec-large-chinese

操作系统及版本：CentOS 7

Python 版本 / Python version: 3.8

其他相关环境信息 / Other relevant environment information:

附加信息 / Additional Information 添加与问题相关的任何其他信息 / Add any other information related to the issue.

— Reply to this email directly, view it on GitHub https://github.com/imClumsyPanda/langchain-ChatGLM/issues/276, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLH5EVR6UMIBNMRHNDAC6DXFCUF3ANCNFSM6AAAAAAXZTRNTQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

May 08 '23 11:05 imClumsyPanda

目前推测可能是因为detectron仅针对英文pdf读取有优化，可能无法识别pdf文件中的中文，目前开发组正在开展ChinesePDFLoader类编写，后续将会替代langchain自带pdf读取方式

May 10 '23 05:05 imClumsyPanda

试了一下，detectrion 是可以读取中文的。相反的，如果没有 detectrion，用 unstructured 的 pdf parser。因为他会把 pdf 转成图片做 ocr，然后又因为可能图片中会同时存在多语种混排的情况。会导致解析错误。写了一个简单的 pdf parser，根据你 pdf 格式拿着改一改大概能用。

风险：代码里有一个粗浅的 table 处理的方案。但是遇到 pdf 版面一行中既有表格又有段落的情况会失效。又可能由于 pdf 制作时的问题，pdfplumber不能读取到表格信息，可能会导致错误。

建议：参考 detectrion 用 CV 识别表格后再处理。

import re, os import pdfplumber

def get_table_elemens_del(tables, page_number): first_cell = {} table_dict = {} for table_index in range(len(tables)): table_dict[f'table-{page_number}-{table_index}'] = tables[table_index] for row in tables[table_index]: for r in row: if r is not None: if r != '' and os.linesep not in r: first_cell[r] = f'table-{page_number}-{table_index}' break for cell in row: if cell is not None: if os.linesep in cell: cell_lst = cell.split(os.linesep) for c in cell_lst: if c != '': first_cell[c] = f'table-{page_number}-{table_index}' return first_cell, table_dict

def del_table_contents(text, tables, page_number): first_cell, table_dict = get_table_elemens_del(tables, page_number) new_text = [] text_lst = text.split(os.linesep) first_cell_lst = tuple(first_cell.keys()) if len(first_cell) > 0: for t in text_lst: if t.startswith(first_cell_lst): table_index = 'dsgfggsdgfdg' _key = t.split(' ')[0] for x in first_cell.keys(): if x.startswith(_key): table_index = first_cell[x] + os.linesep break if table_index == 'dsgfggsdgfdg': new_text.append(t + os.linesep) elif table_index not in new_text: new_text.append(table_index) else: new_text.append(t + os.linesep)

return new_text, table_dict

def to_original_format_table(table): new_table = "" for row in table: row_text = " ".join(str(item).replace(os.linesep, '') for item in row).replace('None', '') new_table += row_text + "\n" return new_table

def new_table_dict(table_dict): new = {} for k in table_dict: new[k] = to_original_format_table(table_dict[k]) return new

def to_original_format(content, table_dict): new_content = [] for text in content: if text.strip(os.linesep) in table_dict: new_content.append(table_dict[text.strip(os.linesep)]) else: _content = [] for sentence in text.split(os.linesep): if sentence.strip(os.linesep).endswith(('。', '：', ':')): _content.append(sentence + os.linesep) else: if len(re.sub(r"[a-zA-Z0-9\s\W]+", "", sentence)) < 15 and len(sentence) > 5: _content.append(sentence + os.linesep) else: _content.append(sentence.strip(os.linesep)) new_content.append(''.join(_content)) return ''.join(new_content).split(os.linesep)

def process_pdf(file_path): with pdfplumber.open(file_path) as pdf: extracted_content = [] table_dict = {} for page in pdf.pages: text = page.extract_text() tables = page.extract_tables()

        if len(tables) > 0:
            text, _table_dict = del_table_contents(text, tables, page.page_number)
            extracted_content.extend(text)
            table_dict.update(_table_dict)
        else:
            extracted_content.extend(text.split(os.linesep))

table_dict = new_table_dict(table_dict)

return to_original_format(extracted_content, table_dict)

if name == "main": path = 'xxx.pdf' elements = process_pdf(path)

May 11 '23 06:05 TerrenceVarada

Langchain-Chatchat Langchain-Chatchat copied to clipboard

[BUG] 安装detectron2后，pdf无法加载

Langchain-Chatchat
Langchain-Chatchat copied to clipboard