MineContext
MineContext copied to clipboard
[BUG]: 按照逻辑,目前的markdown中如果包含了图片,可能会触发报错的问题
🐛 Bug description [Please make everyone to understand it]
针对 opencontext/context_processing/processor/document_processor.py下的_process_document_page_by_page支持了.md文档,但是在具体的_extract_vlm_pages方法中又不支持md文档中的图片做处理的逻辑进行修复。
具体代码如下:
def _process_document_page_by_page(self, raw_context: RawContextProperties, file_path: str, file_ext: str) -> List[ProcessedContext]:
.......
# 1. Analyze pages
if file_ext == ".pdf":
page_infos = self._document_converter.analyze_pdf_pages(file_path, self._text_threshold)
elif file_ext in [".docx", ".doc"]:
page_infos = self._document_converter.analyze_docx_pages(file_path)
elif file_ext == ".md":
page_infos = self._document_converter.analyze_markdown_pages(file_path)
elif file_ext == ".txt":
return self._process_txt_file(raw_context, file_path)
else:
raise ValueError(f"Unsupported file type for page-by-page: {file_ext}")
# 2. Classify pages
text_pages = [p for p in page_infos if not p.has_visual_elements]
vlm_pages = [p for p in page_infos if p.has_visual_elements]
logger.info(f"Document analysis: {len(text_pages)} text pages, {len(vlm_pages)} visual pages")
# 3. Process visual pages (extract text)
vlm_texts = {} # dict: page_number -> extracted_text
if vlm_pages:
vlm_text_list = self._extract_vlm_pages(file_path, vlm_pages)
# Associate extracted text with page numbers
for page_info, text in zip(vlm_pages, vlm_text_list):
vlm_texts[page_info.page_number] = text
但是在_extract_vlm_pages方法中:
def _extract_vlm_pages(self, file_path: str, page_infos: List[PageInfo]) -> List[str]:
"""Extract text from visual pages using VLM, returns extracted text list (in page order)"""
file_ext = Path(file_path).suffix.lower()
if file_ext in [".docx", ".doc"]:
return self._process_vlm_pages_with_doc_images(page_infos)
# For PDF and other formats, convert pages to images
# Convert document to images
all_images = self._document_converter.convert_to_images(file_path)
也就是会走到 self._document_converter.convert_to_images(file_path)中
同时,在 self._document_converter.convert_to_images(file_path)中,逻辑如下:
def convert_to_images(self, file_path: str) -> List[Image.Image]:
"""Convert document to image list"""
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
file_ext = Path(file_path).suffix.lower()
logger.info(f"Converting document to images: {file_path} (type: {file_ext})")
if file_ext == ".pdf":
return self._convert_pdf_to_images(file_path)
elif file_ext in [".png", ".jpg", ".jpeg", ".gif", ".bmp", ".webp"]:
return self._load_image(file_path)
elif file_ext in [".pptx", ".ppt"]:
return self._convert_pptx_to_images(file_path)
else:
raise ValueError(f"Unsupported file format: {file_ext}")
也就是说,对于.md文档, self._document_converter.convert_to_images(file_path)将会raise error:Unsupported file format: .md
对于该bug,修改如下:
修改_extract_vlm_pages方法为:
def _extract_vlm_pages(self, file_path: str, page_infos: List[PageInfo]) -> List[str]:
"""Extract text from visual pages using VLM, returns extracted text list (in page order)"""
file_ext = Path(file_path).suffix.lower()
if file_ext in [".docx", ".doc", ".md"]:
return self._process_vlm_pages_with_doc_images(page_infos)
使其将.md文档交由self._process_vlm_pages_with_doc_images(page_infos)方法处理
🧑💻 Step to reproduce
理论上只要本地的markdown文档中包含了图片都会有error
👾 Expected result
不报错,正常处理vlm逻辑
🚑 Any additional information
No response
🛠️ MineContext Version
最新版本
💻 Platform Details
如上