city-scrapers Spider: Illinois Finance Authority

URL: https://www.il-fa.com/ Documents URL: https://www.il-fa.com/public-access/board-documents/ Spider Name: il_finance_authority Agency Name: Illinois Finance Authority

See the contribution guide for information on how to get started

Oct 24 '19 02:10 pjsier

I'd like to work on this issue

Oct 24 '19 02:10 aneesh404

@aneesh404 sounds good!

Oct 24 '19 02:10 pjsier

Hi! I'm sorry I'm not getting time to work on this issue. Please feel free to assign it to someone else.

Oct 28 '19 11:10 aneesh404

Hi! It looks like this issue isn't claimed. Is it ok if I work on this issue?

Oct 30 '19 00:10 janeskim

@janeskim all yours!

Oct 30 '19 00:10 pjsier

If this one is open I'm going to work on it

Mar 06 '20 23:03 mesterhammerfic

@mesterhammerfic sounds great! Assigning you now

Mar 09 '20 17:03 pjsier

Hi, I was wondering If I could work on this issue if it hasn't been active recently. Thanks!

Jun 16 '20 15:06 ledaliang

@ledaliang thanks for your interest! We try to limit contributors to one issue at a time, but once your other PR is merged you can feel free to work on this one

Jun 16 '20 15:06 pjsier

Hi there I've only just asked for a Slack invite, but could I start working on this now?

Jun 26 '20 19:06 PatrickKlingler

@PatrickKlingler sure! Marking it claimed now

Jun 26 '20 19:06 pjsier

Hey Patrick would it be possible to add another PDF parser?

The PyPDF2 parser does not seem to work for the PDFs on IFA's website, i.e. it returns an empty string. I copied this code to parse the PDF: https://github.com/City-Bureau/city-scrapers/blob/main/city_scrapers/spiders/il_pollution_control.py#L103

Apparently PyPDF2 is limited to certain kinds of PDF encodings: https://stackoverflow.com/questions/30272269/python-text-extraction-does-not-work-on-some-pdfs

I ended up using pdfplumber and that works but it would introduce another dependency.

Jun 26 '20 22:06 PatrickKlingler

@PatrickKlingler gotcha, we've run into issues with PyPDF2 so I think it's fine to add something additional here, but on other projects we've been working with pdfminer.six directly. If it works for you I'm fine with adding pdfminer.six as a dependency here since we'll try to eventually remove PyPDF2. We have an example of using it here https://github.com/City-Bureau/city-scrapers-cle/blob/46cf904f87f7c78fe2733eafc4ac97a68ce47d02/city_scrapers/spiders/cuya_developmental_disabilities.py#L36-L44

Jun 27 '20 11:06 pjsier

@PatrickKlingler wanted to follow up on this, we just replaced PyPDF2 with pdfminer.six throughout all of our repos so hopefully that makes this easier!

Jul 14 '20 13:07 pjsier

Good to hear!

Haven't been able to get to this in a while, but I'll have some time this weekend!

On Tue, Jul 14, 2020, 9:13 AM Patrick Sier [email protected] wrote:

@PatrickKlingler https://github.com/PatrickKlingler wanted to follow up on this, we just replaced PyPDF2 with pdfminer.six throughout all of our repos so hopefully that makes this easier!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/City-Bureau/city-scrapers/issues/914#issuecomment-658171342, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEVAQHNRBZWNJTA73FL3ZTR3RKXBANCNFSM4JEN2HEQ .

Jul 14 '20 16:07 PatrickKlingler

Hey, seems like this issue has been opened for a while. I would like to tackle on this issue as my first contrib. Also seems like a good opportunity since I have built projects using Scrapy before. If that's fine by you.

Sep 29 '20 06:09 solisedwin

@solisedwin yep, this has been inactive more than 30 days so it's all yours if you're interested! I can assign you now

Sep 29 '20 12:09 pjsier

Hey I'm still working on this web crawler. Just been rewriting it and fine tuning it for better code readability. Should have it done soon. Thanks

Nov 05 '20 01:11 solisedwin

city-scrapers city-scrapers copied to clipboard

Spider: Illinois Finance Authority

city-scrapers
city-scrapers copied to clipboard