excalibur
excalibur copied to clipboard
Having a hard time parsing documents from Curitiba's town hall
Hi
I have started a little personal project to collect, monitor and consolidate governmental data (specially for the town where I live - Curitiba), and some interesting numbers are available but in PDF format.
I have tried many different parameters in excalibur, but I am having a hard time making it break the rows correctly. Since there was a link on the WebUI to report extraction problems, here I am :) The final implementation will actually use camelot directly from python, but first I need to figure out if and how to parse those documents.
This is the document I was actually trying to parse: https://mid.curitiba.pr.gov.br/contaspublicas/2020/01/An7_RP_1B20.pdf
But that's not the only format. I have also found one that has tables across multiple pages: https://mid.curitiba.pr.gov.br/contaspublicas/2020/02/An2_Fun_RREO_2B20.pdf
Could you please give me some guidance on what to try to get these documents parsed correctly or how I can help debugging what is causing the parsing to fail?
Thank you very much