Same page exported multiple times
I am exporting a large PDF to tables then exporting them to csv but I am getting multiple pages. So if the PDF is 1000 pages long, the output expected is 1000 single csv -- one for each page. The original PDF does not have duplicate tables or more than one table on a page. How can I stop this? I can delete the extra pages but I don't want to delete them if they are required and I can't go through 1000 pages manually every time I run it to check.
from_pdf-page-46-table-1.csv from_pdf-page-46-table-2.csv from_pdf-page-46-table-3.csv from_pdf-page-46-table-4.csv
import pandas as pd
import glob
import camelot
tables = camelot.read_pdf('C:\\temp\\to_csv.pdf', pages='1-1000', row_tol=4, flavor='stream')
tables.export('c:\\temp\\from_pdf.csv', f='csv', compress=False)
filepaths = glob.glob('C:\\temp\\*.csv')
df = pd.concat(map(pd.read_csv, filepaths))
df.to_excel("c:\\temp\\from_pdf.xlsx")
Edit: I also tried flavor='lattice' and got this error: error: C:\ci\opencv_1512688052760\work\modules\core\src\matrix.cpp:436: error: (-215) u != 0 in function cv::Mat::create
I don't have a c:\ci directory on my computer.
Same problem encountered here.
On multiple documents, the tables on specific pages are detected twice.
One the duplicate, the page attribute is wrong (wrong page number).
EDIT : Found the problem. It seems that the pages argument of camelot.read_pdf() is 1-based and not 0 based. Thus passing pages="0-end" leads to the last page of the document being read twice. Change to "1-end" fixes the issue.