unstructured
unstructured copied to clipboard
bug/Inferred Table Data -- info in text_as_html far less than text (cropped?)
Describe the bug I am parsing a PDF, which contains text and tables. It's in German, has a complex layout of many smaller tables, uses Umlauts (ä, ö, ü), and so on.
I am inferring tables and noticed, that in the returned elements (type: Table) the information in "text_as_html" is sometimes far less than in "text" or the original PDF.
I wonder if this example/case is just too complex to be parsed well or if it would be possible with some prior preprocessing/transcoding, different configuration or use of another model (other than the default-hi_res_model).
Any feedback/pointers what I can do to improve the result, would be appreciated. Thanks!
To Reproduce The way I call my unstructured-service (hosted on azure) is I think straight forward...
elements = partition_via_api(
api_url="http://***/general/v0/general",
api_key="***",
file=file,
metadata_filename=file_name,
strategy="hi_res",
pdf_infer_table_structure=True,
skip_infer_table_types="[]",
chunking_strategy="by_title",
max_characters="4000",
new_after_n_chars="3800",
)
Here's one of the extracted elements, which is faulty...
{
"element_id": "0da0d11164d4c4876aa721503d395782",
"metadata": {
"filename": "lampe02.pdf",
"filetype": "application/pdf",
"page_number": 1,
"text_as_html": "<table><tr><td>e \u2014</td><td></td><td>E</td><td></td></tr><tr><td></td><td></td><td>E==\u2014\u20141</td><td></td></tr><tr><td>Transportkarton/Abmessun gen</td><td>L=562 B=531 H=245 mm</td><td>Enthalt verbaute LED.</td><td>nein</td></tr><tr><td></td><td></td><td>1P-Schutzart</td><td>20</td></tr><tr><td rowspan=\"2\">Triman-Kennzeichen</td><td></td><td>Kabelende</td><td>Direktanschiuf</td></tr><tr><td></td><td></td><td>Schutzklasse</td><td></td></tr></table>"
},
"text": "Artikel Elektrische Daten Lichttechnische Daten Produktma\u00dfe + Gewicht Artikelvariante 20100302C Dimmbar mit externem Dimmer nein Farbkonsistenz initial < 5 L\u00e4nge/Tiefe 507 mm Barcode Verpackungseinheit 4004894534999 Farbwertanteil X 0,459 Breite 40 mm Elektrischer Leistungsfaktor > 0,90 Farbwertanteil Y 0,413 H\u00f6he 60 mm Hersteller M\u00fcller-Licht Energieeffizienzklasse enthaltene Lichtquelle Lichtquelle mit EEK: F Farbtemperatur 2700 K Gewicht 301,00 g Zolltarifnummer 94051040900 Farbwiedergabeeigenschaft R9 \u22655 Produktdaten Gewichteter Verbrauch 8 kWh/1000h Verpackung Austausch stromlos nein Lebensdauer Nominalwert 25000 h Colorbox/Barcode 1 4004894534999 Farbwiedergabeeigenschaft Ra \u226580 Beleuchtungstechnologie LED Leistungsaufnahme Nominalwert 8 W Colorbox/Inhalt (St\u00fcck) 1 Bel\u00fcftung erforderlich nein Lichtausbeute Nominalwert 88 lm/W Colorbox/Gewicht 79,00 g Nom. Stromst\u00e4rke 70 mA Frostung Chemisch Lichtfarbe warmwhite Colorbox/Abmessungen L=65 B=43 H=545 mm Spannung Nominalwert 230 V Modell (Technisch) LED-R\u00f6hre Lichtstrom enthaltene Lichtquelle 700 lm Innerbox/Barcode 1 4004894852802 Stromart AC Nicht in Reflektoren betreiben nein Innerbox/Inhalt (St\u00fcck) 2 Frequenz Nominalwert 50/60 Hz Stroboskopeffekt 0,9 Innerbox/Gewicht 85,00 g Sockel S14s Verschiebungsfaktor (cos \u03c6) 0,62 Spektrumbild Innerbox/Abmessungen L=550 B=102 H=75 mm Inverkehrbringer M\u00fcller-Licht Transportkarton/Barcode 1 4004894852819 Marke M\u00fcller-Licht Transportkarton/Inhalt (St\u00fcck) 30 CE Kennzeichnung ja Garantieprodukt 5 Jahre Garantiebedingungen Transportkarton/Gewicht 1050,00 g Leuchtendaten Transportkarton/Abmessun gen L=562 B=531 H=245 mm Enth\u00e4lt verbaute LED nein IP-Schutzart 20 Umwelteigenschaften Kabelende Direktanschlu\u00df Triman-Kennzeichen Schutzklasse II",
"type": "Table"
},
Expected behavior More of the data from the PDF ending up in text_as_html.
Screenshots
This is the corresponding section in the PDF...
And this is what's left in 'text_as_html'...
Environment Info Running the unscripted-api image on azure-VM
Additional context
I'm also seeing this with partition_xlsx
as well. In a pretty small 100 row sheet, the text_as_html
only returns the first ~30
@isaacna can you check your unstructured
version and update to the latest? There was a bug fix related to missing items in XLSX recently.
@scanny That fixed the issue, thanks! We were previously on 0.12.4
@isaacna Can you share the PDF document you're trying?
@christinestraub This was for an Excel spreadsheet (just some filler dummy data), not a PDF. We didn't see this issue for tables nested in PDFs specifically
@hschmied Can you share the PDF document you're trying?
@hschmied We've made some updates in table extraction recently. Although it's not perfect yet for your pdf, I can confirm that it has a few improvements. Did you try your code recently? You'll need to pass languages=["deu"]
to improve text accuracy. We'll consider this case for further improvement.
I have not checked recently, but will. thank you!
quick update -- I looked into it and still got the old result, but I suspect the issue is that the hosted image of the unstructured-api on azure isn't running on the latest api-version, unless I do something. currently figuring out what needs to happen to get my azure-service up-to-date.
just tested it -- it's great improvement! I tested with the same config as before and looked at the original section --> Screenshot: left-most = original, second = same settings w/ new api-version
then I tried it with the setting "languages: ['deu', 'eng']" and finally just with "languages: ['deu']"...
it's not perfect yet, but a lot better. thank you!
@christinestraub Hi, I'm also facing the same issue. I am using yolox, and the model picks up the table but only the body and not the header. In addition to that the text_as_html cropped the body of the table leaving out the last row entirely.
This is the definition of the partition_pdf, unfortunately I can not share the table or the pdf, but is a very small table and the pdf is not complex at all. And I am using the version 0.13.6 of unstructured.
elements = partition_pdf(filename=filename,
strategy='hi_res',
hi_res_model_name="yolox",
infer_table_structure=True,
languages=["eng"]
)
If ayone has any advice I would appreciate it. Thanks
Closing this one, if you need to process pages fast or recommendation use the unstructured-python-client
library with our SaaS API. That will split up the PDF and distribute the workload across multiple workers.