unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/text-as-html-missing-content

Open mpolomdeepsense opened this issue 7 months ago • 9 comments

Describe the bug Sometimes when using chunking, the text_as_html for Table elements is missing some of the content compared to text property. Reasoning:

  • Text for a table can only come from within the cells of the table.
  • Therefore If a Table element has text, it must have come from one or more of the table cells.
  • Therefore the text_as_html table should be populated with text in those same cells.

To Reproduce

import unstructured_client
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
from unstructured.staging.base import elements_from_dicts

client = unstructured_client.UnstructuredClient(
    api_key_auth="...",
    server_url=" ...",
)

filename_a = r"doc.pdf"

with open(filename_a, "rb") as f:
    data = f.read()

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=data,
            file_name=filename_a,
        ),
        strategy = "hi_res",
        coordinates=True,
        hi_res_model_name = "yolox",
        chunking_strategy="by_page",
        split_pdf_page=False,
        include_page_breaks=True,
        output_format = "application/json",
        languages=['eng'],
    ),
)

resp = client.general.partition(req)

elements = elements_from_dicts(resp.elements)
tables = [e for e in elements if e.category == "Table"]
for table in tables:
    dataframe = pd.read_html(e.metadata.text_as_html)
    print(dataframe)

Expected behavior Chunked elements text and text_as_html contain the same content (text_as_html has that content parsed to an HTML table).

mpolomdeepsense avatar Jul 08 '24 11:07 mpolomdeepsense