excalibur Having a hard time parsing documents from Curitiba's town hall

Having a hard time parsing documents from Curitiba's town hall

Open boiko opened this issue 4 years ago • 0 comments

I have started a little personal project to collect, monitor and consolidate governmental data (specially for the town where I live - Curitiba), and some interesting numbers are available but in PDF format.

I have tried many different parameters in excalibur, but I am having a hard time making it break the rows correctly. Since there was a link on the WebUI to report extraction problems, here I am :) The final implementation will actually use camelot directly from python, but first I need to figure out if and how to parse those documents.

This is the document I was actually trying to parse: https://mid.curitiba.pr.gov.br/contaspublicas/2020/01/An7_RP_1B20.pdf

But that's not the only format. I have also found one that has tables across multiple pages: https://mid.curitiba.pr.gov.br/contaspublicas/2020/02/An2_Fun_RREO_2B20.pdf

Could you please give me some guidance on what to try to get these documents parsed correctly or how I can help debugging what is causing the parsing to fail?

Thank you very much

Nov 06 '20 18:11 boiko

excalibur excalibur copied to clipboard

Having a hard time parsing documents from Curitiba's town hall

excalibur
excalibur copied to clipboard