camelot icon indicating copy to clipboard operation
camelot copied to clipboard

not able to identify dataframe from bank statements pdf

Open arreyaar opened this issue 2 years ago • 1 comments

tabula and camelot both are not able to extract tables from bank statements pdf like the one sample attached

  1. the area for the table is not fixed i.e. co-ordinates are changed for every months statement
  2. lattice and stream mode both not working and gives always empty dataframe with column names C:\Users\vikas\Desktop\GreenariaSociety\Tools>python sample.py <class 'pandas.core.frame.DataFrame'> Empty DataFrame Columns: [DATE, MODE, PARTICULARS, DEPOSITS, WITHDRAWALS, BALANCE] Index: []
  3. also in case some columns are having multiple lines in the values for e.g. PARTICULARS/DESCRIPTIONS from bank statements the table cell data is not correctly extracted and it is spread across other cells/rows

sample code used as below:- df = tabula.read_pdf(pdf_path, pages="1",stream=True,multiple_tables=True)[0] #//tried lattice, pages='all', etc. print(type(df)) print(df)

sample_bank_statement.pdf

arreyaar avatar Mar 04 '23 08:03 arreyaar

hello @arreyaar ,

I tried extracting same using Camelot attached output for your reference: sample_output.csv

import camelot
inputpdf = r'D:\Personal\ope_source\sample_bank_statement.pdf'
tables = camelot.read_pdf(inputpdf, pages = str(1), flavor='stream', edge_tol=500)
tables[1].df

kdshreyas avatar Oct 10 '23 18:10 kdshreyas