permitdata.org icon indicating copy to clipboard operation
permitdata.org copied to clipboard

Restructure specification document and data tables

Open c-sayre opened this issue 9 years ago • 4 comments

The specification document could benefit by more closely following the structure of other conceptually similar data standards, such as https://developers.google.com/transit/gtfs/reference.

Start with title, version number (less than 0 for draft status), and date. For example:

Building & Land Development Specification (BLDS) standard

Version 0.m.n

Revised April 1, 2015

The document should make no assumptions about the knowledge of the reader. In particular, it must state that BLDS defines a set of files in Comma-Separated Values (CSV) format. These are separate plain text files, not tables in a single database file. Therefore, the data in the files is just text, not formally typed variables in a programming language or data base engine (RDBMS).

It is essential to give each CSV file a required filename. Show a table of the files, such as this:

File Name Required? Description
publication_info.csv yes Contains information about this particular permits data source.
permits.csv yes Contains core information about current permits.
permits_history.csv no Contains historical information about permits, if any.
contractors.csv no Contains information about all licensed contractors working on permits, if any.
permit_contractor.csv no Shows which contractors are working on which permits.
inspections.csv no Contains information about inspections, if any.

State some basic file requirements, such as:

  • All files must be saved as comma-delimited text and meet the requirements of the CSV file format.
  • The first line of each file must contain the specified field names.
  • Files may be published individually or collected in a ZIP file. The suggested naming convention for the latter is BLDS_PublisherName_YYYYMMDD.zip.
  • etc.

Note that in the list of files above I have included one called publication_info.csv. It is important to supply information about the source of the data, including the name and contact of the publishing entity, the date generated, the BLDS version used, and so on.

I have also included another new file called permit_contractor.csv. This file shows the many-to-many relationships between permits and contractors. This minimizes the data redundancy that would otherwise exist if contractor data is forced into the permits file or if permits data is forced into the contractors file.

It is also important to take all the primary contractor fields out of the permits file; they don't logically belong there: the phone, address, etc. of a contractor is not information about a permit. The spec document mentions the creation of "additional datasets for each contractor", which sounds like one file per contractor (was that intended?). All contractors -- including the primary one on a specific permit -- must be listed in one file.

Several of these structural issues relate to the theoretical concept of database normalization. Even though the spec is not a database file, it is still important that the multi-file data structure be normalized, especially since many users of the data will want to import it into a RDBMS. Practical considerations can override perfect normalization, but the normalized form should be the starting point.

Some general points about how fields are treated in the spec document:

  • Remember that fields will contain only text so technically they cannot have any other "data type". Instead of each field being assigned a "data type", it should have a "format" (string, numeric, integer, date, time, Boolean, currency [2 decimals]).
  • Do not have separate tables of required, recommended, and optional fields. Combine them into one table, add a Required? column, and list the required ones first. Consider eliminating the distinction of "recommended".
  • There are many pairs of "raw" and "mapped" fields. The "raw" fields have simple names and the mapped fields append "Mapped" to the simple names of the raw fields. The use of the term "mapped" might be confusing, especially since buildings have a geographic location and a couple other fields contain GIS coordinates. Since the idea is to provide both free form and standardized values, a better convention might be to append "Raw" to the raw fields and "Std" to the standardized field. E.g., FieldNameRaw and FieldNameStd.
  • For the standardized ("mapped") fields, clearly state the full enumerated list and explicitly state these exact values are the only valid ones.

c-sayre avatar Mar 29 '15 02:03 c-sayre