not able to identify dataframe from bank statements pdf

Open arreyaar opened this issue 2 years ago • 1 comments

tabula and camelot both are not able to extract tables from bank statements pdf like the one sample attached

the area for the table is not fixed i.e. co-ordinates are changed for every months statement
lattice and stream mode both not working and gives always empty dataframe with column names C:\Users\vikas\Desktop\GreenariaSociety\Tools>python sample.py <class 'pandas.core.frame.DataFrame'> Empty DataFrame Columns: [DATE, MODE, PARTICULARS, DEPOSITS, WITHDRAWALS, BALANCE] Index: []
also in case some columns are having multiple lines in the values for e.g. PARTICULARS/DESCRIPTIONS from bank statements the table cell data is not correctly extracted and it is spread across other cells/rows

sample code used as below:- df = tabula.read_pdf(pdf_path, pages="1",stream=True,multiple_tables=True)[0] #//tried lattice, pages='all', etc. print(type(df)) print(df)

sample_bank_statement.pdf

Mar 04 '23 08:03 arreyaar

hello @arreyaar ,

I tried extracting same using Camelot attached output for your reference: sample_output.csv

import camelot
inputpdf = r'D:\Personal\ope_source\sample_bank_statement.pdf'
tables = camelot.read_pdf(inputpdf, pages = str(1), flavor='stream', edge_tol=500)
tables[1].df

Oct 10 '23 18:10 kdshreyas