CROL-Overview
CROL-Overview copied to clipboard
City Record Online parsing libraries and supporting files
City Record Online Workgroup (CROW) - Parsing
This is the main repository containing efforts pertaining to the parsing efforts of CROW. For Notice Schema development, see https://github.com/CityOfNewYork/CROL-Schema.
Disclaimer. In case of conflicting document versions, please refer to documents mentioned in GitHub as the latest version.
Important Docs
- Gold standard - a human parsed file that showed the "correct" extraction of the different object.
- [The Main Schema - a reference file that shows what all the output fields should be and where (the source) they can be derived from.] (https://docs.google.com/spreadsheets/d/1str6vjjHS5EA_2ww9r4WjHA1t32Z00uLLbviegTc8WI/edit#gid=1430366155)
###Open Standard Links
- [Reference Standards.] (https://docs.google.com/document/d/1USFMTHfrmBzDvNW08b2f6osyl9I375d7h47uGcvxXjY/edit)
Community Links
About
As the City embarks on implementing Intro 363-2014 and unlocking its daily actions, we are working together with the Department of Citywide Services to publish the City Record as open, clean and structured data. At the same time, we are unlocking decades of historical information and making it accessible to all, at no charge.
Our goal is to optimize the utility of City Record content by making accessible and structuring the data; addresses, dates, persons, subjects, agencies, contract types and more are parsed and made available as individual objects. This way, residents, organizations and small and large businesses alike will be able to access, interact and stay informed, whether through notifications, visualizations or other easy-to-use community tools.
Project Partners
- City of New York
- BetaNYC
- Commune
- Citizens Union
- Dev Bootcamp
- Ontodia
- Socrata
- Sunlight Foundation
Achieved Milestones
- Came together to form a CROW parsing and scraping volunteer team
- Set up collaboration framework with DCAS
- Scraped PDFs from 2008 - 2014
- Proposed public notice schema
- Added “addresses” and “time & dates” fields to the City’s input workflow
Tasks
For a list of current tasks, please see Issues.
Phase 1: Parsers and Schema
-
Develop a set of collaboratively produced open source library parsers to populate the Public Notice Data Standard schema using the DCAS pipeline
-
Work with DCAS to implement the pipeline into the City’s workflow by August 1, and use that as their way of publishing the City Record data
-
Publish a Public Notice Data Standard and documentation on an interactive website
Phase 2: PDF Scraping
- Scrape the archival PDFs
- Apply and modify the parsers to be able to parse and structure the data in the PDFs