parse docx file error :
Question
... drawing_blip = element.xpath(".//a:blip") ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "src/lxml/etree.pyx", line 1600, in lxml.etree._Element.xpath File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.call File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result lxml.etree.XPathEvalError: Undefined namespace prefix
Same question with me.
crashes in walk_linear. Looks to be related to this change in 2.6.0 - https://github.com/DS4SD/docling/commit/8533039b0c0eff131b524da765f15c3279b554c5
If i go to this point in the debugger, there is no mapping for the 'a' prefix in element.nsmap
Code to reproduce was:
from pathlib import Path
from docling.datamodel.base_models import InputFormat
from docling.document_converter import (
DocumentConverter,
# PdfFormatOption,
WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline
source = Path("path/to/my/example/icantsharesorry.docx")
doc_converter = (
DocumentConverter( # all of the below is optional, has internal defaults.
allowed_formats=[
# InputFormat.PDF,
# InputFormat.IMAGE,
InputFormat.DOCX,
# InputFormat.HTML,
# InputFormat.PPTX,
# InputFormat.ASCIIDOC,
# InputFormat.MD,
], # whitelist formats, non-matching files are ignored.
format_options={
# InputFormat.PDF: PdfFormatOption(
# pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend
# ),
InputFormat.DOCX: WordFormatOption(
pipeline_cls=SimplePipeline # , backend=MsWordDocumentBackend
),
},
)
)
result = doc_converter.convert(source)
downgrading to 2.5.2 allows document to be converted.
Have the same issue. had to downgrade to 2.5.2
@huihuiliu74 @melvinebenezer, I’m looking into solution, but would appreciate if you could provide me an example of docx file that triggers the issue, if possible
@maxmnemonic here it is ... example.docx
@melvinebenezer @huihuiliu74 , for now you can use raises_on_error=False parameter of convert_all, like here:
conv_results = doc_converter.convert_all(input_paths, raises_on_error=False)
This will skip over problematic files if you wish to continue,
But I’m working on fixing issue properly
@maxmnemonic on a Mac OS (Sonoma 14.6.1 (23G93))
Here is work in progress PR to fix this: https://github.com/DS4SD/docling/pull/432
@maxmnemonic Thanks for the fix. when are the releases planned normally?
@melvinebenezer @mattmalcher @huihuiliu74 , we just released docling 2.7.1, and the fix is part of it, hope it helps!