camelot
camelot copied to clipboard
not able to identify dataframe from bank statements pdf
tabula and camelot both are not able to extract tables from bank statements pdf like the one sample attached
- the area for the table is not fixed i.e. co-ordinates are changed for every months statement
- lattice and stream mode both not working and gives always empty dataframe with column names C:\Users\vikas\Desktop\GreenariaSociety\Tools>python sample.py <class 'pandas.core.frame.DataFrame'> Empty DataFrame Columns: [DATE, MODE, PARTICULARS, DEPOSITS, WITHDRAWALS, BALANCE] Index: []
- also in case some columns are having multiple lines in the values for e.g. PARTICULARS/DESCRIPTIONS from bank statements the table cell data is not correctly extracted and it is spread across other cells/rows
sample code used as below:- df = tabula.read_pdf(pdf_path, pages="1",stream=True,multiple_tables=True)[0] #//tried lattice, pages='all', etc. print(type(df)) print(df)
hello @arreyaar ,
I tried extracting same using Camelot attached output for your reference: sample_output.csv
import camelot
inputpdf = r'D:\Personal\ope_source\sample_bank_statement.pdf'
tables = camelot.read_pdf(inputpdf, pages = str(1), flavor='stream', edge_tol=500)
tables[1].df