city-scrapers
city-scrapers copied to clipboard
Spider: Chicago Southwest Home Equity Commission I
URL: https://swhomeequity.com/agenda-%26-minutes
Spider Name: chi_southwest_home_equity_i
Agency Name: Chicago Southwest Home Equity Commission I
See the contribution guide for information on how to get started
Hi, just cloned the repo. I'm going to make a branch for this spider
@pjsier looks like most info besides the date and meeting type == BOARD is contained in a pdf.
Let's discuss how to proceed but since this is my first, I think I'll start work on a more straightforward one.
@mattpair that approach makes sense to me, let me know if you need any help finding a clearer spider to work on but we have a good amount available
I'll resume work on this one
If this issue is unclaimed, I would like to work on that.
@haidtang sure! In general we like to contributors to stick to one issue at a time, so I'll assign you to this one for now and not the other. Let me know if you'd like to switch that though
@pjsier I got it. Could you please switch me to the other issue #566, I think that I have a better clue on how to deal with that one. Thank you so much.
Sure thing!
Hey is this issue unclaimed? I'd be happy to work on this one if so. Also, it seems like it will involve reading pdfs; has that been done within this project before?
It looks like it's been inactive for more than a month, so it's all yours if you're interested!
Hi @pjsier, most of information for these meetings is included in PDFs of the minutes and agenda for each meeting. Are there any other spiders that have downloaded / parsed PDFs in this project already?
Just working on my own computer, I'm able to parse the files using the package pdfminer.six
which I just found via googling, but was the most highly recommend Python package for reading PDFs that I could. That package also requires that the files are downloaded, so I'm using tempfile
which should delete the files after the text is extracted from them.
My question though is do you want to introduce that new package into the project? And is the fact that it needs to download files going to be a problem for running the spider on different computers?
@egfrank thanks for checking that out! We're currently using PyPDF2, but we've used pdfminer.six
on another project. Here's an example in chi_human_relations
:
https://github.com/City-Bureau/city-scrapers/blob/a6a0ea801c94ab8cbab8345cf34053fd3e49fe5e/city_scrapers/spiders/chi_human_relations.py#L56-L68
For now let's see if the parsing will work in PyPDF2, but if you run into more issues I think it would be fine to add pdfminer.six
(which is in the Pipfile
for city-scrapers-cle
).
Related to the tempfile
, you should be able to use BytesIO
instead since we're using Python 3, and there's an example of that in the chi_human_relations
example. BytesIO
should work for both PyPDF2 and pdfminer
.
Let me know if you run into any issues with this, and thanks again for doing that research!
Oh awesome I don't know why I missed that in the codebase! Sweet okay I'll look at PyPDF2 and BytesIO.
I finally looked back at this and opened up a new PR! https://github.com/City-Bureau/city-scrapers/pull/973
Sorry about the delay - once my branch got out of date it was difficult to get the checks to pass and it ended being easier to start fresh.