Parse Klamath County Results
Klamath produces image PDFs, so these files will need to be OCRd before any parsing:
- [x] 2016 primary
- [x] 2014 general
- [x] 2014 primary
- [x] 2012 general
- [x] 2012 primary
- [x] 2010 general
- [x] 2010 primary
- [x] 2008 general
- [x] 2008 primary
- [x] 2006 general
- [x] 2006 primary
- [x] 2004 general
- [x] 2004 primary
- [x] 2002 general
- [x] 2002 primary
- [x] 2000 general
- [ ] 2000 primary
I'm tackling this now. It's this first time I've contributed to Open Elections, but I think I have a good grasp of what I need to do after reading the docs.
I downloaded and OCRd all the PDFs with pypdfocr. I'm extracting the 2002 primary data with Tabula and now cleaning and checking.
Thanks, @JasonBernert! That sounds like a great approach - let me know if you run into any issues or questions.
Hey @JasonBernert, how's it coming? I am having trouble getting pypdfocr working, so I can't OCR things now… Do you have any OCRed files I could work on in the meantime? Also, in case you haven't run across it, I recommend OpenRefine. Once you get the hang of it, it makes cleaning up OCRed data like this much easier.
Hey @nk9! It's a little messy. pypdfocr is great for batch processing, but doesn't do a great job on PDF images with low DPI. I downloaded the Adobe Acrobat free trial. It's great at OCR, but takes a bit longer. I'm cleaning up 2002, 2004, and 2008 results now. It looks like 2006 will have to be entered in by hand. Want to try Acrobat OCR on Clatsop County results? Or tackle the 2006 results?
It turns out I have access to Acrobat myself, so I'm good to go on the OCR front. Unfortunately, it doesn't seem to embed the OCRed text in the document itself… which seems like the primary thing people would want to do. :-( Anyway, I can start on Clatsop County now.
@JasonBernert @nk9 if you're running into issues with OCR, I've got Able2Extract which has been pretty good.
Got the last two elections from Klamath, in 2000.
@nk9 Looks like 2000 primary results are Democratic-only.