OpenOversight icon indicating copy to clipboard operation
OpenOversight copied to clipboard

Data import: Buzzfeed NYPD disciplinary data

Open redshiftzero opened this issue 6 years ago • 6 comments

We should crosslink these records from OpenOversight from NYPD: https://www.buzzfeed.com/kendalltaggart/nypd-police-misconduct-database

This means, for every entry in their database, we should make an incident and link to the NYPD officers involved. Since we don't have the roster, we can create the officers as we go. I uploaded one incident as a demonstration: https://openoversight.com/incidents/2.

These are the kinds of datasets we could write an import script like in #392 for. Note that developer time is limited, but a lot of people are interested in helping OpenOversight that are not developers, so we could e.g. organize events where people input this kind of data into the database manually (by temporarily making their accounts area coordinators).

See also: https://github.com/joshtemple/nypd-cases

redshiftzero avatar Jul 15 '18 07:07 redshiftzero

This is an interesting project that I would like to work on. Looking into it, I have two questions, or points of discussion:

  • The badge number (or Tax number): The last two digits of every cop's badge number are redacted. So I am not sure if they should be added to the database in some way, as on the one hand we don't know the exact number, but on the other hand, the number is necessary to differentiate between cops with the same name. /EDIT: I just realized that this is connected to/depending on #462
  • How to incorporate data from the pdfs: The OCR-txt files provided by the file host documentcloud.org are not very good, and I doubt they can be used to extract data like incident date or (even harder) a description of the incident. The options that I see here are to either utilize better OCR software or leave this part to volunteers.

abandoned-prototype avatar Jul 26 '18 16:07 abandoned-prototype

great questions @abandoned-prototype !

The last two digits of every cop's badge number are redacted. So I am not sure if they should be added to the database in some way, as on the one hand we don't know the exact number, but on the other hand, the number is necessary to differentiate between cops with the same name.

clarifying question re: NY probably for @camfassett: is the tax registry number a static number that does not appear on an officer's uniform? if so then we should implement #462 and input the numbers exactly as is on the BuzzFeed database, e.g. 1234xx, and when we get more information we can fill in the missing digits.

How to incorporate data from the pdfs:

so we have someone who is currently working on the OCR (privately off GitHub), will update on the status on that in the next few days. They are not working on the upload though, just extracting the text from the PDFs, so figuring out how to insert that data into OpenOversight (ideally in an idempotent manner) would be a great contribution in the meantime. How were you thinking of doing the upload? manage.py command?

redshiftzero avatar Jul 31 '18 04:07 redshiftzero

clarifying question re: NY probably for @camfassett: is the tax registry number a static number that does not appear on an officer's uniform? if so then we should implement #462 and input the numbers exactly as is on the BuzzFeed database, e.g. 1234xx, and when we get more information we can fill in the missing digits.

Yes, the tax registry number is a static number and does not appear on uniforms.The badge/shield number is distinct from the tax registry number. Definitely not ideal that we don't have the complete tax registry numbers, but let's definitely import what we have! :)

Thanks @abandoned-prototype!

ssempervirens avatar Jul 31 '18 14:07 ssempervirens

Hey, I just barely started coding, but I would be willing to manually enter data and learn to code as I go?

ereynolds123 avatar Jun 27 '19 23:06 ereynolds123

Just wanted to update that I am finally back actually working on this. Main focus right now is to get the OCRing of the pdf right, but between tesseract and some image transformations I am getting reasonable results. I also plan to cross-reference the .csv file that buzzfeed provided that seems to be manually extracted and contains the key information to each case (name and tax registry number to the extent available)

abandoned-prototype avatar Nov 30 '19 03:11 abandoned-prototype

I might have put a little more time into this than I should have, but I finally have a csv file that contains the data of 517 of the 540 pages (output.txt) (.txt for github) There are certainly many small and probably some bigger errors, but one thing I made sure was to cross-reference the provided csv (http://data.buzzfeed.com/projects/2018-04-nypd/nypd-discipline.csv) that has name, tax number and case number for each case, so those values are most likely correct. Next I will probably manually add the remaining 17 pages. If anything is interested in the code I used to ocr and parse this let me know. I still plan to polish that code a bit and it's hopefully reusable for similar projects.

Then it's time to see what tools we have and which need to be built to import this data into our database.

abandoned-prototype avatar Jan 09 '20 05:01 abandoned-prototype