city-scrapers icon indicating copy to clipboard operation
city-scrapers copied to clipboard

Spider: Illinois Finance Authority

Open pjsier opened this issue 5 years ago • 18 comments

URL: https://www.il-fa.com/ Documents URL: https://www.il-fa.com/public-access/board-documents/ Spider Name: il_finance_authority Agency Name: Illinois Finance Authority

See the contribution guide for information on how to get started

pjsier avatar Oct 24 '19 02:10 pjsier

I'd like to work on this issue

aneesh404 avatar Oct 24 '19 02:10 aneesh404

@aneesh404 sounds good!

pjsier avatar Oct 24 '19 02:10 pjsier

Hi! I'm sorry I'm not getting time to work on this issue. Please feel free to assign it to someone else.

aneesh404 avatar Oct 28 '19 11:10 aneesh404

Hi! It looks like this issue isn't claimed. Is it ok if I work on this issue?

janeskim avatar Oct 30 '19 00:10 janeskim

@janeskim all yours!

pjsier avatar Oct 30 '19 00:10 pjsier

If this one is open I'm going to work on it

mesterhammerfic avatar Mar 06 '20 23:03 mesterhammerfic

@mesterhammerfic sounds great! Assigning you now

pjsier avatar Mar 09 '20 17:03 pjsier

Hi, I was wondering If I could work on this issue if it hasn't been active recently. Thanks!

ledaliang avatar Jun 16 '20 15:06 ledaliang

@ledaliang thanks for your interest! We try to limit contributors to one issue at a time, but once your other PR is merged you can feel free to work on this one

pjsier avatar Jun 16 '20 15:06 pjsier

Hi there I've only just asked for a Slack invite, but could I start working on this now?

PatrickKlingler avatar Jun 26 '20 19:06 PatrickKlingler

@PatrickKlingler sure! Marking it claimed now

pjsier avatar Jun 26 '20 19:06 pjsier

Hey Patrick would it be possible to add another PDF parser?

The PyPDF2 parser does not seem to work for the PDFs on IFA's website, i.e. it returns an empty string. I copied this code to parse the PDF: https://github.com/City-Bureau/city-scrapers/blob/main/city_scrapers/spiders/il_pollution_control.py#L103

Apparently PyPDF2 is limited to certain kinds of PDF encodings: https://stackoverflow.com/questions/30272269/python-text-extraction-does-not-work-on-some-pdfs

I ended up using pdfplumber and that works but it would introduce another dependency.

PatrickKlingler avatar Jun 26 '20 22:06 PatrickKlingler

@PatrickKlingler gotcha, we've run into issues with PyPDF2 so I think it's fine to add something additional here, but on other projects we've been working with pdfminer.six directly. If it works for you I'm fine with adding pdfminer.six as a dependency here since we'll try to eventually remove PyPDF2. We have an example of using it here https://github.com/City-Bureau/city-scrapers-cle/blob/46cf904f87f7c78fe2733eafc4ac97a68ce47d02/city_scrapers/spiders/cuya_developmental_disabilities.py#L36-L44

pjsier avatar Jun 27 '20 11:06 pjsier

@PatrickKlingler wanted to follow up on this, we just replaced PyPDF2 with pdfminer.six throughout all of our repos so hopefully that makes this easier!

pjsier avatar Jul 14 '20 13:07 pjsier

Good to hear!

Haven't been able to get to this in a while, but I'll have some time this weekend!

On Tue, Jul 14, 2020, 9:13 AM Patrick Sier [email protected] wrote:

@PatrickKlingler https://github.com/PatrickKlingler wanted to follow up on this, we just replaced PyPDF2 with pdfminer.six throughout all of our repos so hopefully that makes this easier!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/City-Bureau/city-scrapers/issues/914#issuecomment-658171342, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEVAQHNRBZWNJTA73FL3ZTR3RKXBANCNFSM4JEN2HEQ .

PatrickKlingler avatar Jul 14 '20 16:07 PatrickKlingler

Hey, seems like this issue has been opened for a while. I would like to tackle on this issue as my first contrib. Also seems like a good opportunity since I have built projects using Scrapy before. If that's fine by you.

solisedwin avatar Sep 29 '20 06:09 solisedwin

@solisedwin yep, this has been inactive more than 30 days so it's all yours if you're interested! I can assign you now

pjsier avatar Sep 29 '20 12:09 pjsier

Hey I'm still working on this web crawler. Just been rewriting it and fine tuning it for better code readability. Should have it done soon. Thanks

solisedwin avatar Nov 05 '20 01:11 solisedwin