docling Seeing "numbers" as text in converted "tables" json

Question

I want to convert the output from json to separate tables csv. I wrote a code for it. But, I am seeing numbers converted to text. json { "bbox": { "l": 364.781005859375, "t": 337.5539855957031, "r": 377.7409973144531, "b": 328.56201171875, "coord_origin": "BOTTOMLEFT" }, "row_span": 1, "col_span": 1, "start_row_offset_idx": 4, "end_row_offset_idx": 5, "start_col_offset_idx": 2, "end_col_offset_idx": 3, "text": "/five.lt/period.tab/eight.lt", "column_header": false, "row_header": false, "row_section": false }, { "bbox": { "l": 420.2619934082031, "t": 337.5539855957031, "r": 433.22198486328125, "b": 328.56201171875, "coord_origin": "BOTTOMLEFT" }, "row_span": 1, "col_span": 1, "start_row_offset_idx": 4, "end_row_offset_idx": 5, "start_col_offset_idx": 3, "end_col_offset_idx": 4, "text": "/six.lt/period.tab/seven.lt", "column_header": false, "row_header": false, "row_section": false },

5.8 is showing as "/five.lt/period.tab/eight.lt"

In one of the other tables; "/three.osf_tab./zero.osf_tab/zero.osf_tab% ", "-/zero.osf_tab./two.osf_tab/three.osf_tab%"

I think this is due to otsl; the work from this https://arxiv.org/abs/2305.03393 used in the tsr model; but how can convert it to normal numbers with post-processing. Any utility code available for this already? or any other help will be appreciated

Dec 13 '24 07:12 mllife

@mllife this is a matter of how the PDF encoded the text, you'll be getting out whatever the PDF has encoded in it. So, this is not a matter of TableFormer but one of the PDF backend and its string sanitation.

Dec 13 '24 13:12 cau-git

@cau-git , UPDATE: tried the other backend "pypdfium2", the output is correct now; docling_v2 parser had some text encoding issue

Dec 14 '24 16:12 mllife