table-transformer
table-transformer copied to clipboard
Question on post-processing table structure with text bounding boxes
Hello,
I am working with the table structure detection model, using it over table images. I extract the structure and the text, using CRAFT for the detection of the text bounding boxes and the table-transformer model for the table structure. To post-process the table structure prediction I use the text bounding boxes with the postprocess
functions.
I encounter the following problem when following this approach. For some table images in which the text in a cell is a single character, CRAFT commonly detects those individual characters as together, producing large text bounding boxes like in the image below (second column).
The issue is when I use these bounding boxes, some of the predicted rows are enlarged so as they contain this large OCR bounding boxes. In the image below you see the raw predicted rows, without any postprocessing.
As you can see the predicted rows are accurate. But when I take the predicted table structure and put it together with the OCR bounding boxes, using the postprocess
module and the function objects_to_cells
, the rows transform to this:
I hope it is visible that there is a green dotted row that goes from B to H characters, including exactly the text bounding box. I have been looking at this problem and it seems to be produced in the table_structure_to_cells
function, in lines 810-844 of postprocess
module.
I was wondering if you could suggest of a way to improve the postprocessing operations so this does not occur. Maybe adding a further step of postprocessing or modifying those lines of code. Or if you know of an algorithm that works better than CRAFT to detect text I am also interested.
Many thanks in advance
.
First of all, congrats on integrating OCR with the model code. This looks very well done and we hope it inspires others to do the same!
As far as your problem with the OCR is concerned, I don't see any easy way to overcome it using post-processing. If OCR does not give you a bounding box for B and C separately, you have no way to split that large text bounding box and know where B is and where C is within the box. So then you have no way to slot B and C into their correct cells using the model output.
One thing you could do is tell the post-processing code to ignore the word bounding boxes and keep its cell bounding boxes as-is. Then you could crop your input image at each cell bounding box and pass each individually to an OCR function to get the text of each cell. It sounds like a painful solution to me but could get the job done.
In my view, the best solution would be to get better OCR. Your case is a tricky one, it's easy to understand why the OCR naively thinks vertical characters stacked over each other would go together as a word.
You could try PyTesseract as an open source solution. I've also been very impressed with OCR from Azure Cognitive Services. I suggest giving these a try.
Cheers, Brandon
Thanks for the answer.
The thing is the post-processing is quite useful in some other cases, so I'd prefer keeping this step. I will try to find a way to improve OCR bounding boxes.
Cheers, Roberto
hello, @RobAcc22 can u share the inference code for TSR thanks in advance.