camelot icon indicating copy to clipboard operation
camelot copied to clipboard

Tables on multiple pages

Open lucmartinon opened this issue 5 years ago • 4 comments

Hey there! This is more a question than an issue, sorry! I am using Camelot to extract data from PDFs, some are big. I have a lot of cases where a table is on more than one page. In some cases like this: https://snipboard.io/dMEuF7.jpg the table of the first page will have 14 columns as expected, the one on the second page will have only 13, the first one on the left disappears because there is no line (it's actually 1 merged cell that goes from page 1 to page 6)

Is there a way to

  • force camelot to extract only one table, or
  • extract the columns places from the table extracted from the first page? This way I could use the "columns" params from the doc: https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-column-separators

Thanks a lot! Luc

lucmartinon avatar Apr 18 '20 16:04 lucmartinon

Hi, Did you find the solution for the tables on multiple pages? I am also getting the same issue when reading such tables it treats as new table per page

idea1002 avatar Jun 12 '20 18:06 idea1002

Hey,

I didn't find any solution within Camelot, no. But I switched to first converting PDFs to XLSX using commercial products, then importing the data. It was simply much faster & easier. Converter tested: smallPDF (10€ /month): works well, also on locked PDFs, but generates an excel with one tab per pdf table, and potentially you have to harmonize manually the tabs. But likely if it is the really same table going on many pages it will be in one tab in excel. Adobe (17 € annually, for the online converter). Always convert to only one tab in Excel, and works globally better in my case at least. Doens't work with locked PDFs.

Both have a free trial that allows testing. There may be more than these, I stopped searching because I was happy with the result.

lucmartinon avatar Jun 15 '20 09:06 lucmartinon

You can implement a method to accept templates as parameters for each page, something like tabula.io.read_pdf_with_template() method. You can find more about it here and here

P.S. - Tabula doesn't have it properly implemented, it can be a great addition to Camelot and which I am actively looking for!

KunalGehlot avatar Jul 19 '21 15:07 KunalGehlot

image

qianxuanyon avatar Aug 20 '21 04:08 qianxuanyon