unstructured
unstructured copied to clipboard
bug/text-as-html-missing-content
Describe the bug
Sometimes when using chunking, the text_as_html
for Table elements is missing some of the content compared to text
property.
Reasoning:
- Text for a table can only come from within the cells of the table.
- Therefore If a Table element has text, it must have come from one or more of the table cells.
- Therefore the text_as_html table should be populated with text in those same cells.
To Reproduce
import unstructured_client
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
from unstructured.staging.base import elements_from_dicts
client = unstructured_client.UnstructuredClient(
api_key_auth="...",
server_url=" ...",
)
filename_a = r"doc.pdf"
with open(filename_a, "rb") as f:
data = f.read()
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
files=shared.Files(
content=data,
file_name=filename_a,
),
strategy = "hi_res",
coordinates=True,
hi_res_model_name = "yolox",
chunking_strategy="by_page",
split_pdf_page=False,
include_page_breaks=True,
output_format = "application/json",
languages=['eng'],
),
)
resp = client.general.partition(req)
elements = elements_from_dicts(resp.elements)
tables = [e for e in elements if e.category == "Table"]
for table in tables:
dataframe = pd.read_html(e.metadata.text_as_html)
print(dataframe)
Expected behavior
Chunked elements text
and text_as_html
contain the same content (text_as_html
has that content parsed to an HTML table).