table-transformer
table-transformer copied to clipboard
Error in Postprocess.py file
Hello team, First of all I highly appreciate the work that you are doing for the open source community by providing such robust and highly designed models and I am very thankful for this.
Coming to my issue I was trying to form csv file using the post process function after doing OCR to the table and supplying the bounding boxes of text as well as the text itself. I passed the tokens to the function "objects_to_structures(objects, tokens, str_class_thresholds)" and from thereon it went to the post process file in which i encountered a key error stated as follows - KeyError: 'span_num' This error comes in line 332 which is as follows -
spans_copy.sort(key=lambda span: span['span_num'])
the error comes in the lambda function and i am having a hard time to understand this issue. Can you please help me to overcome this please. Thankyou in advance.
Tagging just for the visibility of post. @bsmock
Hi,
That code was written with PyMuPDF in mind, which is a Python library that can be used to extract words from digital-born PDF documents. Every word extracted by PyMuPDF has a block_num, line_num, and span_num property.
The code in postprocess.py sorts the words by these fields in order to put the words in each table cell into reading order.
If your text is already in reading order, the simplest thing you can do to fix the issue is to comment out or remove the 3 lines of code that do the sorting.
Otherwise, for each word you can set the 'span_num' field to have a number (0, 1, 2...) that indicates its sequential reading order. If you do that, 'line_num' and 'block_num' can all be 0.
Best, Brandon
Really happy to see your reply and at the same time very thankful to you to provide such a good model. I was able to extract text and its bounding box using pytesseract and get great results in which tables are in normal format. However the dataframe is not correct when some cells are nested. Can you suggest some great ideas in which i can handle the spanning cells or the nested rows/columns. Very thankful for to reply to.my queries.
On Fri, 16 Jun, 2023, 12:03 am Brandon Smock, @.***> wrote:
Hi,
That code was written with PyMuPDF in mind, which is a Python library that can be used to extract words from digital-born PDF documents. Every word extracted by PyMuPDF has a block_num, line_num, and span_num property.
The code in postprocess.py sorts the words by these fields in order to put the words in each table cell into reading order.
If your text is already in reading order, the simplest thing you can do to fix the issue is to comment out or remove the 3 lines of code that do the sorting.
Otherwise, for each word you can set the 'span_num' field to have a number (0, 1, 2...) that indicates its sequential reading order. If you do that, 'line_num' and 'block_num' can all be 0.
Best, Brandon
— Reply to this email directly, view it on GitHub https://github.com/microsoft/table-transformer/issues/118#issuecomment-1593547113, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7LOZOLOU7IIW6W24GR6FADXLNIPTANCNFSM6AAAAAAZAIW2KM . You are receiving this because you modified the open/close state.Message ID: @.***>