docling icon indicating copy to clipboard operation
docling copied to clipboard

UnboundLocalError and Loss of Data from Multiple Documents

Open imene-swaan opened this issue 1 year ago • 1 comments

Description:

The current implementation of the Export multimodal Docling Example (examples/export_multimodal.py) has two issues:

  1. UnboundLocalError: When no documents are successfully converted, the rows list is not initialized, resulting in an UnboundLocalError when trying to normalize the data into a DataFrame.
  2. Loss of data from multiple documents: The rows list is reinitialized inside the loop that processes each document. This causes the data from previous documents to be discarded, keeping only the data from the last converted document.

Expected Behavior:

  • The rows list should accumulate the data from all successfully converted documents.
  • If no documents are successfully converted, the script should handle this gracefully and not raise an UnboundLocalError.

Suggested Fix:

  • Move the initialization of the rows list outside the loop so that it collects data from all documents.
  • Add a check before normalizing the rows into a DataFrame to ensure that the list is not empty.

Original code:

rows = []  # This is inside the document loop

for (
    content_text,
    content_md,
    content_dt,
    page_cells,
    page_segments,
    page,
) in generate_multimodal_pages(doc):
    # Rows are appended here, but this only keeps data for the current document
    ...

Suggested Fix:

# Initialize rows before the loop
rows = []

for doc in converted_docs:
    if doc.status != ConversionStatus.SUCCESS:
        continue  # Log failures
    for (
        content_text,
        content_md,
        content_dt,
        page_cells,
        page_segments,
        page,
    ) in generate_multimodal_pages(doc):
        rows.append( ... )  # Now rows accumulate data from all documents

imene-swaan avatar Sep 19 '24 09:09 imene-swaan

This seems to be outdated by now, the given example only demonstrates how to convert a single file, so no document loop is necessary. Closing.

cau-git avatar Jan 31 '25 09:01 cau-git