Helping tabula find the top of a table - column heading cribs?
I think my question belongs more here than in tabula-extractor; see https://github.com/tabulapdf/tabula-extractor/issues/112
Original comment by psychemedia:
When parsing large documents with tables placed in arbitrary locations on a page, I wonder if it would useful to help Tabula get its eye in as to the location of a table by giving it one or more keywords that you expect to see, or require, in the table column headings?
So for example we might provide a set of required heading tokens (Date, Region) that must appear in a tokenised set generated from words in guessed at column headings to help identify a particular table or sort of table, or a set of possible heading tokens that we know often appear in the headings of tables we want to extract, though we're also open to Tabula extracting other things it thinks are tables?
My comment: I wonder if this has gotten anywhere? I'm writing a bank-statement parser however the table detection can be very fickle. In essence I extract the whole page and unfortunately Tabula doesn't always find the tables so depending on the contents it will group one-or-more columns together making it really difficult to work with the data.
I was just thinking of doing the something very similar in my app as suggested above:
- extract everything from the page
- find the keywords that mark the start/end of a table
- rerun the table extraction process using just those coordinates as start/end (hoping that it will now work). As I only have tables that span the whole page I will only be using the Y-coordinates
I have empirically verified that this works with a few examples using the Tabula UI so I think I will give it a try, however if it already exists, or people have better ideas I would be delighted to hear.
@Darkvater you should talk with @dbangera23 and his team, who are implementing something like this in Tabula UI and Tabula-Java (their fork). I think they're calling "Regex Search"
See this as well: https://github.com/dbangera23/Tabula_Senior_Design_Project/blob/490633b2e51da86cb02fc4217d2077c2661fe1bd/lib/tabula_job_executor/jobs/regex_search.rb
Hey guys,
Our team has been work on a similar feature. We have been calling it regex searching. While it isn't true regex based searching currently we hope changes later could implement that.
Currently we have implemented 2 string and 4 string searching. Basically, given 2 string we can specify the top and bottom of a table within a page. And 4 strings for a 4 corner based searching.
Our team has been working on this as part of a senior design project. The latest versions of our code is currently within a private repo. We hope to start incorporating/creating a new fork with the original tabula project soon.
Hi, that is good news!
What i have done for now is a two-tiered scan of a page. First I am looking for the table start based on (regex) keywords then for the bottom again based on regex. This regex matches the whole line and checks if keywords are present.
I then take the top-left and bottom-right coordinates including some padding (in my case 10px) and have tabula re-evaluate based on this bounding box.
Very similar to your way, but obviously not generic and outside of tabula. It works quite well I get proper column boundaries more consistently whereas before tabula would get very confused sometimes of a whole page was parsed for tables.
I'll keep an eye on your project!
Did any of that forked work get merged back into this project?
@Darkvater can you share the code you worked on please?
It looks like there was a PR that allowed some sort of regex based filtering that was approved but not merged? https://github.com/tabulapdf/tabula-java/pull/217