datproject-discussions icon indicating copy to clipboard operation
datproject-discussions copied to clipboard

U.S. federal government data collection use case

Open joehand opened this issue 8 years ago • 1 comments

From @feomike on August 19, 2014 15:19

The intent of this issue is to offer a potential use case scenario for dat developers to consider for future development.

Federal government data collection (general)

The federal government collects all kinds of data. a typical data collection goes something like this; the government provides a data template (e.g. fixed field length or csv file, xml spec or other similar data specification), describes valid values and business rules for the data spec (e.g. field 1 can only be a number between 1 and 10), and then also builds a portal to manage users submitting data (data producers are given a login/password, must go to a site, enter metadata and upload a file, then wait for a response that data is accepted). one can only imagine the arcane scenarios for how this technology approach has been implemented across the government.

Generally speaking, dat might be useful to help solve problems which cost the federal government and data producers who must submit data. where dat could prove useful in solving these issues is along these line;

  • api for datasets which would allow reporting and error checking of valid values/business rules (i know this more or less exists, but what i am interested in here is an opportunity to throw a truckload of rule/queries for reporting purposes against a data set and allow aggregate statistics (errors and both summary) to be easily generated
  • might be interesting to identify opportunities to actually implement business rules in the data definitions, but i understand very well why this might not be something to take on
  • the real value could be gained in the opportunity to embrace distributed/federated data so that cumbersome and expensive 'portals' were not built. in this case, lets say a data producer has a local instance of dat with its dataset ready to go. the normal federal method would be to export said data into a csv/fixed field/xml file, then go to a bad java form page to login w/ a username and password the producer uses once a year, so they need to reset it or have been locked out, enter the metadata, upload the file (multiple times b/c of web time out issues), then wait for a bad response on the rejection/acceptance of the data. what if in the distributed/federated model, dat would allow for a data producer to allow access to a single data set (and that dataset was the federal government). in this capacity there is no bad portal. the transfer of data is a pull request (or similar) from the data producer to the data collector (the government). there is no management of users/passwords. there is a system of producers and consumers
  • often these data are sensitive in nature, and contain personally identifiable (pii) or other data which requires significant security issues. the federal government has a whole host of security guidelines which govern these kinds of things. commonly called FISMA, it would be advantageous for use in the federal government if some security controls which are outlined in FISMA could be implemented (caution this part is not for the faint of heart)
  • having dat as an underlying technology is the benefit. any data producer who actually needs to install, configure and manage dat might be overhead which is too much burden. some ability to configure dat w/o a burden the IT management burden would be interesting. some data providers are very small unique cases with little IT background.
  • resubmission / diff - once a dataset is collected, often there is a need for resubmission for data because of changes in the time period, data errors which are caught post facto on either the producer or collector side and/or audits. a resubmission which constitutes a data diff as a request/message would be helpful (again, this seems to be stock w/ dat)

A specific case

the home mortgage disclosure act (hmda), passed in 1975, collects data from financial institutions (banks and non-depository banks that loan money for mortgages). this data is used to help identify unfair lending practices, public access for sunlight and ensuring financial institutions are making enough credit available for communities.

hmda data is loan level data (e.g. all loans meeting certain criteria are required to be collected from the financial institutions from the government). loan level data includes the originator, borrower information (including race and ethnicity), loan amount, location and others. the full current specification of the data to be submitted is found at this link http://www.ffiec.gov/hmda/fileformats.htm.

hmda data is collected (from jan to march for the previous calendar year) and amounts to about 18 - 20 million rows of data annually. it is collected from roughly 7200 different financial institutions. institutions can range from large multi-billion corporations (eg wells fargo) to very small institutions (even limited liability corporations) who might not even hold deposits, but have carved out a business to loan for some specific need (eg mike's llc - not real).

capacity to produce the file format and all of the subsequent business rules for the data (http://www.ffiec.gov/hmda/edits.htm) are varied along the lines of size of the submitting institution (eg wells fargo likely has very very large IT capacity, mikes llc, perhaps is an ms excel user).

the government does a thorough post submission analysis of the data with more robust edits/business rules to evaluate the quality of the submitted data. these evaluation can result in resubmission of data (eg mikes llc 10,000 rows of loans were all made on april 1).

Copied from original issue: maxogden/dat#153

joehand avatar Jun 17 '16 18:06 joehand