city-scrapers
city-scrapers copied to clipboard
Spider: Illinois Finance Authority
URL: https://www.il-fa.com/ Documents URL: https://www.il-fa.com/public-access/board-documents/ Spider Name: il_finance_authority Agency Name: Illinois Finance Authority
See the contribution guide for information on how to get started
I'd like to work on this issue
@aneesh404 sounds good!
Hi! I'm sorry I'm not getting time to work on this issue. Please feel free to assign it to someone else.
Hi! It looks like this issue isn't claimed. Is it ok if I work on this issue?
@janeskim all yours!
If this one is open I'm going to work on it
@mesterhammerfic sounds great! Assigning you now
Hi, I was wondering If I could work on this issue if it hasn't been active recently. Thanks!
@ledaliang thanks for your interest! We try to limit contributors to one issue at a time, but once your other PR is merged you can feel free to work on this one
Hi there I've only just asked for a Slack invite, but could I start working on this now?
@PatrickKlingler sure! Marking it claimed now
Hey Patrick would it be possible to add another PDF parser?
The PyPDF2 parser does not seem to work for the PDFs on IFA's website, i.e. it returns an empty string. I copied this code to parse the PDF: https://github.com/City-Bureau/city-scrapers/blob/main/city_scrapers/spiders/il_pollution_control.py#L103
Apparently PyPDF2 is limited to certain kinds of PDF encodings: https://stackoverflow.com/questions/30272269/python-text-extraction-does-not-work-on-some-pdfs
I ended up using pdfplumber
and that works but it would introduce another dependency.
@PatrickKlingler gotcha, we've run into issues with PyPDF2 so I think it's fine to add something additional here, but on other projects we've been working with pdfminer.six
directly. If it works for you I'm fine with adding pdfminer.six
as a dependency here since we'll try to eventually remove PyPDF2. We have an example of using it here https://github.com/City-Bureau/city-scrapers-cle/blob/46cf904f87f7c78fe2733eafc4ac97a68ce47d02/city_scrapers/spiders/cuya_developmental_disabilities.py#L36-L44
@PatrickKlingler wanted to follow up on this, we just replaced PyPDF2 with pdfminer.six
throughout all of our repos so hopefully that makes this easier!
Good to hear!
Haven't been able to get to this in a while, but I'll have some time this weekend!
On Tue, Jul 14, 2020, 9:13 AM Patrick Sier [email protected] wrote:
@PatrickKlingler https://github.com/PatrickKlingler wanted to follow up on this, we just replaced PyPDF2 with pdfminer.six throughout all of our repos so hopefully that makes this easier!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/City-Bureau/city-scrapers/issues/914#issuecomment-658171342, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEVAQHNRBZWNJTA73FL3ZTR3RKXBANCNFSM4JEN2HEQ .
Hey, seems like this issue has been opened for a while. I would like to tackle on this issue as my first contrib. Also seems like a good opportunity since I have built projects using Scrapy before. If that's fine by you.
@solisedwin yep, this has been inactive more than 30 days so it's all yours if you're interested! I can assign you now
Hey I'm still working on this web crawler. Just been rewriting it and fine tuning it for better code readability. Should have it done soon. Thanks