unstructured
unstructured copied to clipboard
yolox outputs incorrect `text_as_html`
The API with yolox outputs the correct HTML table structure and text, but incorrect text_as_html.
Yokebe.pdf
- The unit of measurement 'g' is extracted as the number 8 in the row for "Fett". Output:
<tr><td>Fett</td><td>7,78</td><td>-</td><td>6,48</td><td>-</td></tr> - "1,8 g" was extracted as "18¢g" in the row for "-davon gesättigte Fettsäuren"
- Commas as omitted. The row for "Kohlenhydrate" was extracted as
<tr><td>Kohlenhydrate</td><td>133g</td><td>-</td><td>17,7g</td><td>-</td></tr>and the row for "- davon Zucker" was extracted as<tr><td>- davon Zucker</td><td>71g</td><td>-</td><td>16,2 g</td><td>-</td></tr>.
- Text next to barcode is extracted as gibberish. The row for "GTIN" is extracted as
<tr><td>GTIN</td><td>aeosssrises 11N NN RO</td></tr>. - Space omitted. The row for "PPN" is extracted as
<tr><td>PPN</td><td>1118294344 89</td></tr> - extracted incorrect character. The row for "GEWICHT BRUTTO" is extracted as
<tr><td>GEWICHT BRUTTO</td><td>62,5¢</td></tr>. - omitted comma. The row for "GEWICHT NETTO" is extracted as
<tr><td>GEWICHT NETTO</td><td>383g</td></tr> - extracted incorrect character. The row for "MASSE..." is extracted as
<tr><td>MASSE (XH; CM)</td><td>4,8%9,3</td></tr> - two rows extracted as one. In the table in the right column on page 2, the rows for "davon Polysaccharide" and "Hericium-Extrakt" are extracted as
<tr><td>- davon Polysaccharide Hericium-Extrakt (6:1)</td><td>7,5mg 40 mg</td><td></td></tr>
Steps to reproduce
See attached documents. A user used the hosted API with the yolox strategy. They also tried setting "languages" to "['deu']" and "OCR_AGENT" to "paddle" but noticed no difference. Here is their code:
import requests
unstructured_api_key = '.............'
unstructured_api_headers = {
"accept": "application/json",
"unstructured-api-key": unstructured_api_key
}
unstructured_api_url = "https://api.unstructured.io/general/v0/general"
data = {
"strategy": "hi_res",
"pdf_infer_table_structure": "true",
"hi_res_model_name": "yolox",
"languages": "['eng']"
}
file_path = "..............."
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url=unstructured_api_url,
files=file_data,
data=data,
headers=unstructured_api_headers)
More details: using the API, following the commands document in this PR
I see the following tables for Yokebe.pdf
Closing since this is a table extraction model update, and updated models will only be available via the API.