Auto-CORPus icon indicating copy to clipboard operation
Auto-CORPus copied to clipboard

Process text in images of tables

Open alexdewar opened this issue 9 months ago • 6 comments

It might be useful to process text in tables which are represented as images. The code in main currently contains a file (tableimage.py) which ostensibly can do this processing, although there's no way for users to access this functionality via the command-line interface and it's slated for removal in #151. As with #169, if anyone wants this functionality they can retrieve the old version of the code from the git history, fix it up and resubmit.

The old version of the code is available in the v1.1.0 release (commit 2fefd955e07e42253e41ca167ef60a2dc8da100f).

alexdewar avatar Mar 25 '25 13:03 alexdewar

Can this be linked to the automated file extension detection, and if it is an image file it then processes it. This is how it was done before, where the filename had to have the same ID as the main paper with added suffix _table1 for example. Then all tables were combined into one output file as done for inline tables.

jmp111 avatar Mar 25 '25 15:03 jmp111

I've had a look at the code and it seems like it wouldn't be too hard to plumb it in... It already made a list of table image files but then didn't use it anywhere, so the code for actually processing those files was never run. I'm not sure if this is just because adding this feature was low priority or if there were problems with the processing code. Maybe @Thomas-Rowlands knows?

Perhaps I was a bit too hasty in removing it -- sorry! We can always revert some of my surgery if we do want this functionality. It might not be a bad thing for it to be submitted as a new PR anyway so we can review it properly and add tests etc. (like the DAG code it might well not work anyway if it hasn't been used in a while).

alexdewar avatar Mar 25 '25 17:03 alexdewar

Ohhhhh I see what's happened. The functionality was accidentally removed in #149 and then I later noticed, assumed it was never used, and ripped out all the leftover bits. Oops. My bad.

I can open a PR to put it back, but I suppose while we're at it, it might not be a bad point to check that the code actually works. Does anyone happen to have any sample table images we can run it over? I could try to grab something off Google, but it would be nice if it was actual data from a paper.

alexdewar avatar Mar 25 '25 17:03 alexdewar

Do you mean the part Thomas removed because it wasn't implemented? See his comment

AdrianDAlessandro avatar Mar 26 '25 09:03 AdrianDAlessandro

Pretty much all OCR code is unused right now, we used to use tesseract years ago but it was experimental as far as I know and never fully implemented as a "polished" product. It is something I believe we are going to revisit in future after checking the SOTA and what we want out of it etc

Thomas-Rowlands avatar Mar 26 '25 12:03 Thomas-Rowlands

Ah ok. Phew! Glad I didn't just delete something people are actually using 😆.

If we decide to implement it, we should do #79 at the same time, so that people who install AC via pip or whatever can use this functionality out of the box.

alexdewar avatar Mar 27 '25 09:03 alexdewar