city-scrapers icon indicating copy to clipboard operation
city-scrapers copied to clipboard

Spider: Chicago Southwest Home Equity Commission I

Open pjsier opened this issue 5 years ago • 14 comments

URL: https://swhomeequity.com/agenda-%26-minutes Spider Name: chi_southwest_home_equity_i Agency Name: Chicago Southwest Home Equity Commission I

See the contribution guide for information on how to get started

pjsier avatar Feb 03 '19 19:02 pjsier

Hi, just cloned the repo. I'm going to make a branch for this spider

mattpair avatar Apr 05 '19 03:04 mattpair

@pjsier looks like most info besides the date and meeting type == BOARD is contained in a pdf.

Let's discuss how to proceed but since this is my first, I think I'll start work on a more straightforward one.

mattpair avatar Apr 05 '19 03:04 mattpair

@mattpair that approach makes sense to me, let me know if you need any help finding a clearer spider to work on but we have a good amount available

pjsier avatar Apr 05 '19 12:04 pjsier

I'll resume work on this one

mattpair avatar May 08 '19 21:05 mattpair

If this issue is unclaimed, I would like to work on that.

haidtang avatar Nov 17 '19 21:11 haidtang

@haidtang sure! In general we like to contributors to stick to one issue at a time, so I'll assign you to this one for now and not the other. Let me know if you'd like to switch that though

pjsier avatar Nov 18 '19 00:11 pjsier

@pjsier I got it. Could you please switch me to the other issue #566, I think that I have a better clue on how to deal with that one. Thank you so much.

haidtang avatar Nov 18 '19 01:11 haidtang

Sure thing!

pjsier avatar Nov 18 '19 01:11 pjsier

Hey is this issue unclaimed? I'd be happy to work on this one if so. Also, it seems like it will involve reading pdfs; has that been done within this project before?

egfrank avatar Jan 24 '20 06:01 egfrank

It looks like it's been inactive for more than a month, so it's all yours if you're interested!

pjsier avatar Jan 24 '20 19:01 pjsier

Hi @pjsier, most of information for these meetings is included in PDFs of the minutes and agenda for each meeting. Are there any other spiders that have downloaded / parsed PDFs in this project already?

Just working on my own computer, I'm able to parse the files using the package pdfminer.six which I just found via googling, but was the most highly recommend Python package for reading PDFs that I could. That package also requires that the files are downloaded, so I'm using tempfile which should delete the files after the text is extracted from them.

My question though is do you want to introduce that new package into the project? And is the fact that it needs to download files going to be a problem for running the spider on different computers?

egfrank avatar Feb 08 '20 18:02 egfrank

@egfrank thanks for checking that out! We're currently using PyPDF2, but we've used pdfminer.six on another project. Here's an example in chi_human_relations:

https://github.com/City-Bureau/city-scrapers/blob/a6a0ea801c94ab8cbab8345cf34053fd3e49fe5e/city_scrapers/spiders/chi_human_relations.py#L56-L68

For now let's see if the parsing will work in PyPDF2, but if you run into more issues I think it would be fine to add pdfminer.six (which is in the Pipfile for city-scrapers-cle).

Related to the tempfile, you should be able to use BytesIO instead since we're using Python 3, and there's an example of that in the chi_human_relations example. BytesIO should work for both PyPDF2 and pdfminer.

Let me know if you run into any issues with this, and thanks again for doing that research!

pjsier avatar Feb 08 '20 21:02 pjsier

Oh awesome I don't know why I missed that in the codebase! Sweet okay I'll look at PyPDF2 and BytesIO.

egfrank avatar Feb 08 '20 21:02 egfrank

I finally looked back at this and opened up a new PR! https://github.com/City-Bureau/city-scrapers/pull/973

Sorry about the delay - once my branch got out of date it was difficult to get the checks to pass and it ended being easier to start fresh.

egfrank avatar Sep 16 '20 21:09 egfrank