bug/text-as-html-missing-content
Describe the bug
Sometimes when using chunking, the text_as_html for Table elements is missing some of the content compared to text property.
Reasoning:
- Text for a table can only come from within the cells of the table.
- Therefore If a Table element has text, it must have come from one or more of the table cells.
- Therefore the text_as_html table should be populated with text in those same cells.
To Reproduce
import unstructured_client
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
from unstructured.staging.base import elements_from_dicts
client = unstructured_client.UnstructuredClient(
api_key_auth="...",
server_url=" ...",
)
filename_a = r"doc.pdf"
with open(filename_a, "rb") as f:
data = f.read()
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
files=shared.Files(
content=data,
file_name=filename_a,
),
strategy = "hi_res",
coordinates=True,
hi_res_model_name = "yolox",
chunking_strategy="by_page",
split_pdf_page=False,
include_page_breaks=True,
output_format = "application/json",
languages=['eng'],
),
)
resp = client.general.partition(req)
elements = elements_from_dicts(resp.elements)
tables = [e for e in elements if e.category == "Table"]
for table in tables:
dataframe = pd.read_html(e.metadata.text_as_html)
print(dataframe)
Expected behavior
Chunked elements text and text_as_html contain the same content (text_as_html has that content parsed to an HTML table).
@christinestraub
@mpolomdeepsense Can you please share a pdf document that you're testing?
I have been encountering the same issue with a test PDF I created. The first row of the table is within elements[0].text but not elements[0].metadata.text_as_html. It was using this pdf test_pdf_table.pdf, and the following code.
>>> from unstructured.partition.pdf import partition_pdf
>>> elements = partition_pdf(
... filename="test_pdf_table.pdf",
... url=None,
... infer_table_structure=True,
... strategy="hi_res",
... )
>>> elements[0].text
'Header 1 Text 1.1 Text 1.2 Header 2 Text 2.1 Text 2.2 Header 3 Text 3.1 Text 3.2'
>>> elements[0].metadata.text_as_html
'<table><tbody><tr><td>Text 1.1</td><td>Text 2.1</td><td>Text 3.1</td></tr><tr><td>Text 1.2</td><td>Text 2.2</td><td>Text 3.2</td></tr></tbody></table>'
Output of collect_env.py
OS version: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.11.4
unstructured version: 0.14.10
unstructured-inference version: 0.7.36
pytesseract version: 0.3.10
Torch version: 2.2.0
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice version: LibreOffice 7.3.7.2 30(Build:2)
As far as I can tell, after digging into the code a bit, it seems the issue comes from the cropping of the image in unstructured.partition.pdf_image.ocr.supplement_element_with_table_extraction which is causing the top border of the table to be cut off. This means the tables_agent is not able to detect the top row as a row, only identifying the 2nd row onwards. Changing it to crop one pixel higher seems to fix the issue.
I have been encountering the same issue with a test PDF I created. The first row of the table is within
elements[0].textbut notelements[0].metadata.text_as_html. It was using this pdf test_pdf_table.pdf, and the following code.>>> from unstructured.partition.pdf import partition_pdf >>> elements = partition_pdf( ... filename="test_pdf_table.pdf", ... url=None, ... infer_table_structure=True, ... strategy="hi_res", ... ) >>> elements[0].text 'Header 1 Text 1.1 Text 1.2 Header 2 Text 2.1 Text 2.2 Header 3 Text 3.1 Text 3.2' >>> elements[0].metadata.text_as_html '<table><tbody><tr><td>Text 1.1</td><td>Text 2.1</td><td>Text 3.1</td></tr><tr><td>Text 1.2</td><td>Text 2.2</td><td>Text 3.2</td></tr></tbody></table>'Output of
collect_env.pyOS version: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python version: 3.11.4 unstructured version: 0.14.10 unstructured-inference version: 0.7.36 pytesseract version: 0.3.10 Torch version: 2.2.0 Detectron2 is not installed PaddleOCR is not installed Libmagic version: file-5.41 magic file from /etc/magic:/usr/share/misc/magic LibreOffice version: LibreOffice 7.3.7.2 30(Build:2)As far as I can tell, after digging into the code a bit, it seems the issue comes from the cropping of the image in
unstructured.partition.pdf_image.ocr.supplement_element_with_table_extractionwhich is causing the top border of the table to be cut off. This means thetables_agentis not able to detect the top row as a row, only identifying the 2nd row onwards. Changing it to crop one pixel higher seems to fix the issue.
hi, @alastairmarchant , can you tell me how to solve that problem that you had met before, what the "Changing it to crop one pixel higher seems to fix the issue" mean, how to do it? thanks!
hi, @alastairmarchant , can you tell me how to solve that problem that you had met before, what the "Changing it to crop one pixel higher seems to fix the issue" mean, how to do it? thanks!
Hi @huangpan2507, it does mean adjusting environment variable TABLE_IMAGE_CROP_PAD e.g.
os.environ["TABLE_IMAGE_CROP_PAD"] = "1"
If you need more accurate table processing results, consider using our API. Document parsing model available through the API is more accurate and incremental improvements to the model will be deployed there. This model is not supported in open source. CC: @alastairmarchant
Hi, @christinestraub, thanks for your kindly help, then, I had another issues about the result of pdf(had english and Chinese word): I use the same code, but the result about Chinese character is different, one time is very good, but another time is very bad, especially when the Chinese characters are on the first line of a page, or at the edge of a page, also, when complex Chinese characters are present . I'm not sure if the environment is the same when running the same code twice, so, which module can cause this effect, and which version of that module is better at recognizing Chinese and English characters. Can you help me?
@huangpan2507 Can you please provide a pdf document that we could use to reproduce?
@huangpan2507 Can you please provide a pdf document that we could use to reproduce?
Finance-policy.pdf Hi, @christinestraub Here's the document, after I've desensitized the data,and it will be a little different than the one I ran befor, but some issues also exist, especially the result about page1, page5.
some result about page1(English characters)like below: page_content='OVerVvieW .ee 2 1 费用 分 类 Payment Categories .4 2 2 请 款 对 象 Persons that Request the Payments .es 3 3 付款 对 象 信息 维护 Recipient information maintenance .pp 3 4 所 需 文件 _ Required Documents .4 3 5 付款 方式 Payment Method .4 4 6 付款 期 限 Payment Terms .4 4 7 报销 期 限 Reimbursement 攻 me limit 4 之 票 Invoice (FaPiag) 5 1 有 效 发 票 Official Invoice: 5 2 发 票 遗 失 Invoice LOSt 6 the relevant origin text in pdf are: Overview ................................................................................................................................2 1 费用分类 Payment Categories.............................................................................................2 2 请款对象 Persons that Request the Payments.....................................................................3 3 付款对象信息维护 Recipient information maintenance......................................................3 4 所需文件 Required Documents..........................................................................................3 5 付款方式 Payment Method .................................................................................................4 6 付款期限 Payment Terms....................................................................................................4 7 报销期限 Reimbursement time limit....................................................................................4 发票 Invoice (FaPiao)..............................................................................................................5 1 有效发票 Official Invoice: ....................................................................................................5 2 发票遗失 Invoice Lost..........................................................................................................6
some result about page5(Chinese characters) like below: page_content='Company Name 公司 名 称 : BESCD IRA (kM) APRASI' , page_content='Company Name AS) ZAR: BE CD FA (ACM) AMATAMNDAS', page_content='Company Name 公司 名 称 : 填 登 CD 技术 (北京 ) 有 限 公 司 天 津 分 公司 Taxpayer ID 44#t AiR SIS: 1234567889D', the relevant origin text in pdf are: Beijing: Company Name 公司名称: 叠登 CD 技术(北京)有限公司 Taxpayer ID 纳税人识别号:1234567889 Wuhan: Company Name 公司名称: 叠登 CD 技术(北京)有限公司武汉分公司 Taxpayer ID 纳税人识别号:1234567889 Tianjing: Company Name 公司名称: 叠登 CD 技术(北京)有限公司天津分公司 Taxpayer ID 纳税人识别号:1234567889D
I'm encountering the missing content bug using the API, just for a few table elements "text" field has more text then "text_as_html". Any way to solve this?
I noticed this happeing when there is a <br /> inside the table cell. It leaves the br, but removes any text that comes afterwards
I'm getting the same problem with <br /> or '\n'. Unstructured does not treat like a table, but divides in two different elements. For example, if we have a name "Jean-Claude Van Damme" in two parts in a cell ("Jean-Claude" and "Van Damme"), Unstructured processes two different elements.