311-data icon indicating copy to clipboard operation
311-data copied to clipboard

Discussion: data quality / data handling issues

Open mattyweb opened this issue 4 years ago • 5 comments

Was running some API tests and was noticing some discrepancies between some results which started me digging into the 311 data where I saw some issues. Apologies if this ground has already been well-trod but would like to understand how the project has decided to handle problems with the quality of data coming from the 311 system.

First, a quick recap of the current data loading approach as I understand it:

Automated (nightly) 311 data loading from Socrata:

  • Ingests data as-is from Socrata to stage table
  • Copy stage data to requests
  • Create/refresh materialized views with minor validation (e.g. lat/long not null)
  • API then queries the views

Manual NC geojson data loading:

  • File is exported manually
  • Transformation happens to clean names
  • File is included in client app bundle
  • App uses info in file to query the API (which in turn queries the DB)

Some problems:

  1. 311 data and NC file are updated on different schedules: could get out of sync
  2. 311 data and NC file have different NCs: file is missing NCs (2, 65, 117, 122, 123)
  3. NCs change over time but 311 data is not updated (e.g. if boundary changes should data reside with old or new NC?)
  4. 311 data and NC file have completely different sets of regions (NC groupings)
  5. Some 311 data have no location info at all (no lat/long, no NC, etc.)

I'm sure there are more that I'm missing.

Anyway, it feels like we need some general rules to abide by for handling bad data so we can get consistent results that we can defend when the city or council members point out something they don't expect. If these are already documented somewhere please let me know.

mattyweb avatar Sep 05 '20 19:09 mattyweb

Some more information about data quality issues...

  • 0.13% of the data is missing lat/long (probably OK to ignore?)
  • all data missing lat/long is also missing NC
  • 1.5% of the data is associated with a missing NC (not sure if it's OK to ignore)

The missing NCs (FYI, none appear here: https://empowerla.org/councils/):

  • (2,"OLD NORTHRIDGE CC") -- maybe this NC got split into multiple? could possibly re-encode
  • (65,"BRENTWOOD CC") -- maybe this is no longer in 311 data?
  • (117,"PACIFIC PALISADES NC") -- maybe this is no longer in 311 data?
  • (122,"HISTORIC FILIPINOTOWN NC") -- ?
  • (123,"UNITED FOR VICTORY") -- ?

Some weird NC geojson file stuff:

  • North Hills West is in the North East Valley region
  • North Hills East is in the North West Valley region
  • There are literally dozens of overlapping geometries in the file (this would make it difficult for us to encode data points to NCs ourselves if that needed to happen)

mattyweb avatar Sep 05 '20 19:09 mattyweb

Hey @mattyweb, thanks for starting this. It's an important discussion, and we should definitely address a lot of these issues in version 2. Adding other devs: @adamkendis @hannahlivnat @tan-nate @JRHutson

Your description of the process above is mostly complete. There's some minor data cleaning that happens in the staging table before we copy to the requests table. Also, the API calls are based on the constants file (client/components/common/CONSTANTS.js), rather than the geojson file, which I believe is currently only used inside the Map component. Ideally we would merge the constants with the geojson and move everything to the back end.

The constants file also has some additional notes about data issues, which I'm copying here:

  NC regions and names are from here:
    https://empowerla.org/councils-by-service-region/

  Note that Central Avenue Historic is listed there (Region 9 - South LA 2),
  but it isn't anywhere in the DB, so we don't know the id number. It also
  isn't in the nc-boundary json. It's in this list with an id of -1, which
  won't return any results but also won't hurt anything.

  Also note the 4 councils that are commented at the end of the list.
    Historic Filipinotown NC -- #122
    Old Northridge CC -- -- #2
    Brentwood CC -- -- #65
    Pacific Palisades CC --  -- #117
  These are NOT listed on empowerla's website, and they aren't in the
  nc-boundary json. But there are requests associated with them in the DB.

In addition, there are at least two other issues with the geojson. It gives ''NORTH WESTWOOD NC' an id of 0, even though there are requests in the city's database that give it an id of 127 (in addition to some requests that give it an id of 0). And it gives 'HISTORIC CULTURAL NORTH NC' an id of 0, even though the city's database gives it an id of 128 (and sometimes 0).

I don't think we've ever updated the geojson (at least not since I've been here) so we don't really have a process for that. And I'm not sure the city does either. We recently got a note from a user saying that the boundaries were off.

UPDATE: Here are "certified" versions of the neighborhood council boundaries: https://data.lacity.org/A-Well-Run-City/Neighborhood-Councils-Certified-/fu65-dz2f

And the city council boundaries: https://data.lacity.org/A-Well-Run-City/Council-Districts/5v3h-vptv

jmensch1 avatar Sep 05 '20 20:09 jmensch1

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in X days.

github-actions[bot] avatar May 04 '21 02:05 github-actions[bot]

This issue was never added to the v2 board, so it did not get prioritized. We will be continuing the cleanup of the board over the next few weeks and review this issue as part of that.

ExperimentsInHonesty avatar Aug 27 '21 04:08 ExperimentsInHonesty

@adamkendis what role should be responsible for leading this discussion?

EchoProject avatar Oct 15 '21 02:10 EchoProject