docling icon indicating copy to clipboard operation
docling copied to clipboard

parse docx file error :

Open huihuiliu74 opened this issue 1 year ago • 8 comments

Question

... drawing_blip = element.xpath(".//a:blip") ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "src/lxml/etree.pyx", line 1600, in lxml.etree._Element.xpath File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.call File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result lxml.etree.XPathEvalError: Undefined namespace prefix

huihuiliu74 avatar Nov 23 '24 09:11 huihuiliu74

Same question with me.

FengCeUp avatar Nov 25 '24 05:11 FengCeUp

crashes in walk_linear. Looks to be related to this change in 2.6.0 - https://github.com/DS4SD/docling/commit/8533039b0c0eff131b524da765f15c3279b554c5

If i go to this point in the debugger, there is no mapping for the 'a' prefix in element.nsmap

Code to reproduce was:

from pathlib import Path

from docling.datamodel.base_models import InputFormat
from docling.document_converter import (
    DocumentConverter,
    # PdfFormatOption,
    WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline

source = Path("path/to/my/example/icantsharesorry.docx")
doc_converter = (
    DocumentConverter(  # all of the below is optional, has internal defaults.
        allowed_formats=[
            # InputFormat.PDF,
            # InputFormat.IMAGE,
            InputFormat.DOCX,
            # InputFormat.HTML,
            # InputFormat.PPTX,
            # InputFormat.ASCIIDOC,
            # InputFormat.MD,
        ],  # whitelist formats, non-matching files are ignored.
        format_options={
            # InputFormat.PDF: PdfFormatOption(
            #     pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend
            # ),
            InputFormat.DOCX: WordFormatOption(
                pipeline_cls=SimplePipeline  # , backend=MsWordDocumentBackend
            ),
        },
    )
)

result = doc_converter.convert(source)

downgrading to 2.5.2 allows document to be converted.

mattmalcher avatar Nov 25 '24 12:11 mattmalcher

Have the same issue. had to downgrade to 2.5.2

melvinebenezer avatar Nov 25 '24 12:11 melvinebenezer

@huihuiliu74 @melvinebenezer, I’m looking into solution, but would appreciate if you could provide me an example of docx file that triggers the issue, if possible

maxmnemonic avatar Nov 25 '24 12:11 maxmnemonic

@maxmnemonic here it is ... example.docx

melvinebenezer avatar Nov 25 '24 13:11 melvinebenezer

@melvinebenezer @huihuiliu74 , for now you can use raises_on_error=False parameter of convert_all, like here: conv_results = doc_converter.convert_all(input_paths, raises_on_error=False) This will skip over problematic files if you wish to continue,

But I’m working on fixing issue properly

maxmnemonic avatar Nov 25 '24 14:11 maxmnemonic

@maxmnemonic on a Mac OS (Sonoma 14.6.1 (23G93))

melvinebenezer avatar Nov 25 '24 14:11 melvinebenezer

Here is work in progress PR to fix this: https://github.com/DS4SD/docling/pull/432

maxmnemonic avatar Nov 25 '24 15:11 maxmnemonic

@maxmnemonic Thanks for the fix. when are the releases planned normally?

melvinebenezer avatar Nov 26 '24 14:11 melvinebenezer

@melvinebenezer @mattmalcher @huihuiliu74 , we just released docling 2.7.1, and the fix is part of it, hope it helps!

maxmnemonic avatar Nov 26 '24 15:11 maxmnemonic