Same page exported multiple times

Open dml5 opened this issue 5 years ago • 1 comments

I am exporting a large PDF to tables then exporting them to csv but I am getting multiple pages. So if the PDF is 1000 pages long, the output expected is 1000 single csv -- one for each page. The original PDF does not have duplicate tables or more than one table on a page. How can I stop this? I can delete the extra pages but I don't want to delete them if they are required and I can't go through 1000 pages manually every time I run it to check.

from_pdf-page-46-table-1.csv from_pdf-page-46-table-2.csv from_pdf-page-46-table-3.csv from_pdf-page-46-table-4.csv

import pandas as pd
import glob
import camelot

tables = camelot.read_pdf('C:\\temp\\to_csv.pdf', pages='1-1000', row_tol=4, flavor='stream')
tables.export('c:\\temp\\from_pdf.csv', f='csv', compress=False)

filepaths = glob.glob('C:\\temp\\*.csv')
df = pd.concat(map(pd.read_csv, filepaths))
df.to_excel("c:\\temp\\from_pdf.xlsx")

Edit: I also tried flavor='lattice' and got this error: error: C:\ci\opencv_1512688052760\work\modules\core\src\matrix.cpp:436: error: (-215) u != 0 in function cv::Mat::create

I don't have a c:\ci directory on my computer.

Nov 14 '20 22:11 dml5

Same problem encountered here.

On multiple documents, the tables on specific pages are detected twice. One the duplicate, the page attribute is wrong (wrong page number).

EDIT : Found the problem. It seems that the pages argument of camelot.read_pdf() is 1-based and not 0 based. Thus passing pages="0-end" leads to the last page of the document being read twice. Change to "1-end" fixes the issue.

Jul 12 '24 08:07 MathieuCiancone