unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

yolox outputs incorrect `text_as_html`

Open six5532one opened this issue 1 year ago • 1 comments

The API with yolox outputs the correct HTML table structure and text, but incorrect text_as_html. Yokebe.pdf

  • The unit of measurement 'g' is extracted as the number 8 in the row for "Fett". Output: <tr><td>Fett</td><td>7,78</td><td>-</td><td>6,48</td><td>-</td></tr>
  • "1,8 g" was extracted as "18¢g" in the row for "-davon gesättigte Fettsäuren"
  • Commas as omitted. The row for "Kohlenhydrate" was extracted as <tr><td>Kohlenhydrate</td><td>133g</td><td>-</td><td>17,7g</td><td>-</td></tr> and the row for "- davon Zucker" was extracted as <tr><td>- davon Zucker</td><td>71g</td><td>-</td><td>16,2 g</td><td>-</td></tr>.

Mykoforte.pdf

  • Text next to barcode is extracted as gibberish. The row for "GTIN" is extracted as <tr><td>GTIN</td><td>aeosssrises 11N NN RO</td></tr>.
  • Space omitted. The row for "PPN" is extracted as <tr><td>PPN</td><td>1118294344 89</td></tr>
  • extracted incorrect character. The row for "GEWICHT BRUTTO" is extracted as <tr><td>GEWICHT BRUTTO</td><td>62,5¢</td></tr>.
  • omitted comma. The row for "GEWICHT NETTO" is extracted as <tr><td>GEWICHT NETTO</td><td>383g</td></tr>
  • extracted incorrect character. The row for "MASSE..." is extracted as <tr><td>MASSE (XH; CM)</td><td>4,8%9,3</td></tr>
  • two rows extracted as one. In the table in the right column on page 2, the rows for "davon Polysaccharide" and "Hericium-Extrakt" are extracted as <tr><td>- davon Polysaccharide Hericium-Extrakt (6:1)</td><td>7,5mg 40 mg</td><td></td></tr>

Steps to reproduce

See attached documents. A user used the hosted API with the yolox strategy. They also tried setting "languages" to "['deu']" and "OCR_AGENT" to "paddle" but noticed no difference. Here is their code:

import requests

unstructured_api_key = '.............' 
unstructured_api_headers = {
    "accept": "application/json",
    "unstructured-api-key": unstructured_api_key
}

unstructured_api_url = "https://api.unstructured.io/general/v0/general"

data = {
    "strategy": "hi_res",
    "pdf_infer_table_structure": "true",
    "hi_res_model_name": "yolox",
    "languages": "['eng']"
}

file_path = "..............."
file_data = {'files': open(file_path, 'rb')}

response = requests.post(url=unstructured_api_url,
                         files=file_data,
                         data=data,
                         headers=unstructured_api_headers)

six5532one avatar Dec 04 '23 19:12 six5532one

More details: using the API, following the commands document in this PR I see the following tables for Yokebe.pdf image image

cragwolfe avatar Dec 06 '23 19:12 cragwolfe

Closing since this is a table extraction model update, and updated models will only be available via the API.

MthwRobinson avatar Jun 13 '24 13:06 MthwRobinson