justice40-tool icon indicating copy to clipboard operation
justice40-tool copied to clipboard

Produce FIPS list for each dataset

Open emma-nechamkin opened this issue 2 years ago • 20 comments

Overview

The zip file contains all of the data for which tracts we push from end-to-end on our ETL (so only what makes it "out" alive). We should verify that this is all of the data we could conceivably have manually and by data source.

What does it mean to verify a data source? Here, our goal is to make sure that the data we see in the backend is comprehensive relative to what data we "should" have (see the Wisconsin error). We are not looking in code nor are we trying to "fix" anything -- this is purely a diagnostic exercise.

Files to look at

This issue has two comments below, each of which contain spreadsheets with information. In the ZIP file, there are Excel sheets per data source.

Each data source excel sheet has five columns:

  • fips_code: two digit census code
  • <DATASETNAME>__count_tracts: how many census tracts the dataset has within the fips code
  • state_name: name of the state represented by the fips code
  • state_abbreviation: abbreviation of the state represented by the fips code
  • region: the region of the state (census defines this)
  • division: the division of the state (census defines this)

Example of how to read it: in the sheet hud_housing__fips_check.xlsx, the first row tells me that the hud housing datasource includes 1181 tracts in Alabama.

There is also the whole_list.xlsx file, which allows us to compare across data sources. This file has a slightly more confusing structure.

  • each column has the count of tracts within a FIPS code
  • each row represents a data source's results

Example of how to read it: the first row tells me that census_acs_median_income_2019__count_tracts has 1178 tracts in fips code 01. This means that the data source census acs median income includes 1178 tracts in fips code 01.

What to do per data source

  • [ ] Check that the data that we have matches the source data in terms of coverage*: Start in the DATASOURCE__fips_check.csv file. For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data. The source data is either on S3 or publicly-available. (This checks for errors in the original data we have, a la the WI error)
  • [ ] Check that number of tracts included by FIPS code makes sense: Open the whole_list.xlsx file. The ground truth for all but Island Areas is the census_acs_2010__count_tracts. The ground truth for the Island Areas is census_dicennial_2010__count_tracts. Compare the number of tracts per FIPS code and note down any anomalies.

Template to fill out per data source

DATASOURCE NAME

High-level checks

  • [ ] Our data matches the source data in terms of state-level coverage
  • [ ] We have somewhat complete data for all 50 states
  • [ ] We have somewhat complete data for Puerto Rico
  • [ ] We have somewhat complete data for the Island Areas

More information

  • What aberrations did you see?
  • Is there anything else of note?

emma-nechamkin avatar Aug 18 '22 20:08 emma-nechamkin

fips_check.zip this has all of the tracts we have in the pipeline

emma-nechamkin avatar Aug 18 '22 21:08 emma-nechamkin

This has the "overall list" so you can see relatively WHERE data is coming from. whole_list.xlsx

emma-nechamkin avatar Aug 19 '22 14:08 emma-nechamkin

I think @katherinedm-usds and Sindu could do this!

emma-nechamkin avatar Aug 19 '22 21:08 emma-nechamkin

@emma-nechamkin - love the details here TY!

vim-usds avatar Aug 19 '22 22:08 vim-usds

@katherinedm-usds - thought I would update here so that we can keep all info on this ticket in one place! I'm writing as I'm discovering so please excuse all the things you may know already!

Catching up

Looking at Emma's instructions:

Check that the data that we have matches the source data in terms of coverage*: Start in the DATASOURCE__fips_check.csv file. For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data. The source data is either on S3 or publicly-available. (This checks for errors in the original data we have, a la the WI error)

When I download fips_check I get a list of xlsx files, per data source:

Screen Shot 2022-08-22 at 8 25 10 PM

Opening the first one census_acs_2010_fips_check, leads to the following FIPS having < 10 tracts:

  • 60 (AS)
  • 66 (GU)
  • 69 (MP)
  • 78 (VI)

(later realizing that those are the ones we skip in the case of census_acs_2010, so maybe this purposeful skipping changes the testing strategy? - not sure)

OK back to Emma's instructions:

For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data. The source data is either on S3 or publicly-available.

For each DATASOURCE_fips_check file shown above, below, is a link to the source data (S3 or publicly available) for each data source, followed by the FIPS codes that have < 10 tracts.

[DATASOURCE_fips_check] : [FIPS codes < 10 tracts]

geocorr__fips_check: AS, GU, MP, VI doe_energy_burden__fips_check: AS, GU, MP, VI cdc_places__fips_check: AS, GU, MP, VI housing_and_transportation_index__fips_check: AS, GU, MP, VI national_risk_index__fips_check: AS, GU, MP, VI fsf_flood_risk__fips_check: AS, GU, MP, VI nlcd_nature_deprived__fips_check: AS, GU, MP, VI, HI, PR travel_composite__fips_check: AS, GU, MP, VI, PR hud_housing__fips_check: AS, GU, MP, VI historic_redlining__fips_check: many - check file us_army_fuds__fips_check: VI, VT epa_rsei__fips_check: none fsf_wildfire_risk__fips_check: AK, AS, GU, MP, VI, HI, PR eamlis__fips_check: check file - there are many cdc_life_expectancy__fips_check: AS, GU, ME, MP, PR, VI, WI ejscreen__fips_check: AS, GU, MP, VI

Still working on these: census_acs_2010__fips_check census_acs_median_income_2019__fips_check census_decennial_2010__fips_check census_acs_2019__fips_check

I think these are obtained by some API call? I was trying to read the code and that's what I think it's doing. Not totally sure tho. If they are API calls, I bet there's a way to create static postman links to each? @esfoobar-usds

Let me know if that is good place to start!

EXTRA (how to figure out the source data URLs from the code)

Let's find out more about these source data. We can go to the sources folder (in the BE release branch, not main) here: https://github.com/usds/justice40-tool/tree/4bf7773797d2fdf28e01d4f466b15812c5578c4b/data/data-pipeline/data_pipeline/etl/sources

Here we can see a folder for each data source. When I click into the first one calenviroscreen, I see an option to click into the etl.py. This file will have a class in it that will tell us if the data source is coming from S3 or if it's from a public URL.

If we look within the class definition, we see that this one is defined by an AWS_JUSTICE40_DATASOURCES_URL, so we know this data source's source data can be found on S3 and we also know that it's named CalEnviroScreen_4.0_2021.zip.

The exact code is here:

class CalEnviroScreenETL(ExtractTransformLoad):
    def __init__(self):
        self.CALENVIROSCREEN_FTP_URL = (
            settings.AWS_JUSTICE40_DATASOURCES_URL
            + "/CalEnviroScreen_4.0_2021.zip"
        )

vim-usds avatar Aug 23 '22 05:08 vim-usds

@katherinedm-usds - to answer your question from stand up today, the base URL for S3 is

https://justice40-data.s3.amazonaws.com/data-sources/

So placing the source data file after this URL will allow you to download it:

https://justice40-data.s3.amazonaws.com/data-sources/CalEnviroScreen_4.0_2021.zip

Let me know if you have any other questions. I'd love to level 1 support and Emma/Jorge can be level 2! Thx!

vim-usds avatar Aug 23 '22 15:08 vim-usds

FYI @vim-usds: the census data is not requested directly via an API call, but through a Python library called censusdata which handles the URL endpoints and authentication necessary to retrieve the data.

esfoobar-usds avatar Aug 23 '22 18:08 esfoobar-usds

Test update, specifically looking at whole_list and within each FIPS comparing for each dataset the number of tracts with data available vs. the highest number of tracts that data should be available for, in order to understand if there's unexpected and problematic gaps in the data:

  1. I started to go through whole_list FIPS-by-FIPS and note which datasets seemed to have relatively large numbers of tracts missing
  • FIPS 01 Alabama 1101 tracts for cdc_life_expectancy__count_tracts vs. 1181 (6.8%)
  • FIPS 02 Alaska 137 tracts for epa_rsei__count_tracts vs. 167 (18% of tracts)
  • FIPS 02 Alaska missing nlcd_nature_deprived__count_tracts data
  • FIPS 02 Alaska missing fsf_wildfire_risk__count_tracts data
  • FIPS 04 Arizona 1477 tracts for epa_rsei__count_tracts vs. 1526 (3.2% of tracts)
  • FIPS 04 Arizona 1299 tracts for cdc_life_expectancy__count_tracts vs. 1526 (14.9% of tracts)
  • FIPS 05 Arkansas data seems good
  • FIPS 06 California 7516 tracts for cdc_life_expectancy__count_tracts vs. 8057 (6.7% of tracts)
  • FIPS 08 Colorado 1206 tracts for epa_rsei__count_tracts vs. 1249
  • FIPS 09 Connecticut 783 tracts for cdc_life_expectancy__count_tracts vs. 833
  1. Looked into cdc_life_expectancy__count_tracts which across multiple states often has a relatively large number of missing tracts
  • Based on USALEEP limitations https://usaleep.nptoolkit.org/wp-content/uploads/sites/19/2018/05/USALEEP-Data-Uses-and-Limitations_FINAL.pdf:

If no life expectancy estimate was included for your census tract, that was because the number of deaths and/or the estimated size of the population in your census tract was too small to calculate an accurate life table and estimate the life expectancy with a reasonable degree of confidence. The life expectancy estimate from a neighboring census tract may be useful as a substitute, especially if the two census tracts are similar in terms of the ages, race/ethnicity, and other characteristics (such as education and income) of the residents. Another option is to use county level life expectancy estimates, although these estimates will mask disparities among census tracts within the county.

  • Next steps - @emma-nechamkin either pull for me a small sample of tract IDs where there’s no data to confirm the above is why they are missing data, or look this up in raw data cdc_life_expectancy__fips_check. US_A 2 does not show tract IDs where there is no life expectancy age set, it just shows there are 65,535 tracts with life expectancy set
  1. Looked into epa_rsei__count_tracts which is also missing tracts https://www.epa.gov/system/files/documents/2022-06/RSEI%20Methodology%20V2.3.10.pdf

TRI reporting requirements require the reporting facility to indicate the waste management activity used at the off-site facility. If this information is not reported (despite the requirement), the off-site transfer is not evaluated in the RSEI algorithm, but is flagged as a missing value and assigned a zero quantity.

  • But their methodology has multiple ways to fill in missing weights, it doesn’t seem like we should be missing any tracts
  • Next step for @emma-nechamkin to either pull a few tract IDs with missing data for me to look at or she checks source data epa_rsei__fips_check

After this I plan to go through the rest of FIPS to see what additional dataset jump out, and also compare datasets with FIPS that have <10 tracts to source which Vim helpfully provided links o

katherinedm-usds avatar Aug 24 '22 16:08 katherinedm-usds

@katherinedm-usds - I am able to do some very super basic data manipulation using Python (specifically Jupyter notebooks). If I can help here please let me know. if I can't answer it, may we can push it up to @emma-nechamkin?

vim-usds avatar Aug 24 '22 16:08 vim-usds

I think you can stay on your other tasks @vim-usds bc it should be pretty quick for Emma to pull for me a few sample tract IDss to check - thank you for offering!

katherinedm-usds avatar Aug 24 '22 17:08 katherinedm-usds

Here are 10 tracts for CDC life expectancy that are nulls in our data and are in the FIPS codes above:

['04013092723',
 '06037186301',
 '06037221820',
 '06029003803',
 '09001045300',
 '06037433901',
 '06053980000',
 '04013111501',
 '06017030901',
 '04003001701']

For the RSEI data, we aren't using it in the actual score so I think we can ignore for now (or file away in our brains that yes, we have missing data, but it's not super urgent to identify exactly what that data is? wdyt?)

emma-nechamkin avatar Aug 24 '22 20:08 emma-nechamkin

for these:

FIPS 02 Alaska missing nlcd_nature_deprived__count_tracts data FIPS 02 Alaska missing fsf_wildfire_risk__count_tracts data

I believe this is correct! We know the NLCD data is missing in AK and HI (I believe!) and I think the wildfire also uses CONUS like NCLD (conus = contiguous us states).

Please lmk if this is what you need @katherinedm-usds, you are very much my hero!

emma-nechamkin avatar Aug 24 '22 20:08 emma-nechamkin

Test update, using attached spreadsheet which is an updated version of whole_list; second column below each FIPS code is the percentage of missing tracts in the respective dataset of that state and highlighted in green are those datasets/states with >3% missing tract data.

To add onto Katherine's running list of flagged FIPS/datasets (and excluding notes on CDC Life Expectancy dataset for now)--

  • FIPS 15 Hawaii: 8.55% missing tracts for census_acs_median_income_2019__count_tracts
  • FIPS 15 Hawaii: 8.55% missing tracts for fsf_flood_risk__count_tracts,
  • FIPS 15 Hawaii 5.70% missing tracts for travel_composite__count_tractss
  • FIPS 15 Hawaii: 9.97% missing tracts for cdc_places__count_tracts
  • FIPS 15 Hawaii: 11.11% missing tracts for doe_energy_burden__count_tract
  • FIPS 15 Hawaii: 8.55% missing tracts for geocorr__count_tracts
  • FIPS 15 Hawaii: 10.54 missing tracts for census_acs_2010__count_tracts
  • FIPS 15 Hawaii: 7.12% missing tracts for housing_and_transportation_index__count_tracts
  • FIPS 15 Hawaii: 7.12% missing tracts for national_risk_index__count_tracts
  • FIPS 21 Kentucky: 5.11% missing tracts for fsf_flood_risk__count_tracts and fsf_wildfire_risk__count_tracts
  • FIPS 22 Louisiana: 3.14% missing tracts for fsf_flood_risk__count_tracts and fsf_wildfire_risk__count_tracts
  • FIPS 21 Maine: 8.94% missing tracts for fsf_flood_risk__count_tracts and fsf_wildfire_risk__count_tracts
  • FIPS 30 Montana: 16.61% missing tracts for epa_rsei__count_tracts
  • FIPS 35 New Mexico: 7.41% missing tracts for epa_rsei__count_tracts
  • FIPS 38 North Dakota: 7.32% missing tracts for epa_rsei__count_tracts
  • FIPS 46 South Dakota: 10.81% missing tracts for epa_rsei__count_tracts
  • FIPS 56 Wyoming: 6.82% missing tracts for epa_rsei__count_tracts

Holding off on territories for now!

CEJST Dataset Tracts per FIPS TESTING.xlsx

sindu-ri avatar Aug 26 '22 17:08 sindu-ri

@sindu-ri this is fantastic, thank you for all these findings!

katherinedm-usds avatar Aug 26 '22 19:08 katherinedm-usds

For PR: We will be missing much data, but do have workforce and EJScreen for sure. We should also null out all of LINGUISTIC ISOLATION Islands: ONLY workforce data, all from Census 2010 AK and HI: no FSF wildfire

emma-nechamkin avatar Aug 26 '22 20:08 emma-nechamkin

Recapping notes and next steps coming out of Sindu's findings (thank you @sindu-ri!) and other discussion from Friday, to be continued this week

Hawaii

  1. Katherine look at data source documentation for the datasets Sindu listed to see if it aligns with the % of tracts Sindu found missing in HI
  • 8.55% missing tracts for census_acs_median_income_2019__count_tracts
  • 8.55% missing tracts for fsf_flood_risk__count_tracts,
  • 5.70% missing tracts for travel_composite__count_tractss
  • 9.97% missing tracts for cdc_places__count_tracts
  • 11.11% missing tracts for doe_energy_burden__count_tract
  • 8.55% missing tracts for geocorr__count_tracts
  • 10.54 missing tracts for census_acs_2010__count_tracts
  • 7.12% missing tracts for housing_and_transportation_index__count_tracts
  • 7.12% missing tracts for national_risk_index__count_tracts
  1. check with @emma-nechamkin if we can get total list of missing tracts for the datasets that have the exact same number of missing tracts (i.e. same 8.55% missing tracts for median_income and energy_burden), or otherwise try to understand why these have the shared # missing tracts - are all the tracts on very sparsely populate islands, etc.

Kentucky, Louisiana, Maine with relatively large % missing tracts for fsf_flood_risk__count_tracts and fsf_flood_risk__count_tracts

  • Next steps, sample tract IDs

Puerto Rico

The datasets that make it through the staging pipeline are

  • census_acs_median_income_2019__count_tracts
  • epa_rsei__count_tracts
  • fsf_flood_risk__count_tracts
  • doe_energy_burden__count_tracts
  • us_army_fuds__count_tracts
  • census_acs_2010__count_tracts
  • ejscreen__count_tracts
  • census_acs_2019__count_tracts
  • hud_housing__count_tracts but we are not incorporating doe_energy_burden__count_tracts and hud_housing__count_tracts into the score, and as @emma-nechamkin notes in her comments above, we "do have workforce and EJScreen for sure. We should also null out all of LINGUISTIC ISOLATION". New separate bug ticket to be be opened.

Re 3 new datasets, eamlis__count_tracts, historic_redlining__count_tracts, us_army_fuds__count_tracts

  • I won't go further in checking, Matt wrote a suite of tests

Re doublechecking FIPS with < 10 tracts are not represented in source data

  • i.e. make sure AS 60, GU 66, MP 69, VI 78 are not in https://justice40-data.s3.amazonaws.com/data-sources/DOE_LEAD_AMI_TRACT_2018_ALL.csv.zip
  • I'm still checking this as of 8/29; I'll need to use my new computer for this

katherinedm-usds avatar Aug 29 '22 07:08 katherinedm-usds

@katherinedm-usds - for the FUDS data, checked number of occurrences of VT and VI in the geojson file:

Came up with this

  • there are 14 rows that are for VT
  • there are 11 rows that are for VI

In case we're interested in how this was done:

Screen Shot 2022-08-30 at 10 36 15 AM

vim-usds avatar Aug 30 '22 20:08 vim-usds

@katherinedm-usds - here is the results for geocorr:

Screen Shot 2022-08-30 at 2 10 38 PM

vim-usds avatar Aug 30 '22 21:08 vim-usds

this is great thanks @vim-usds ! looks like there's a discrepancy between the number of tracts for VI and VT in data source us_army_fuds__fips_check vs. our fips_check that we can check with Emma about after standup

katherinedm-usds avatar Aug 31 '22 09:08 katherinedm-usds

okay so there are a few things going on here. The data in that geojson is not going to perfectly match the data that we see in usa.csv for a few reasons:

  1. Not every FUDS is eligible for inclusion
  2. We assign FUDS to their census tracts. Some FUDS cannot be assigned to a census tract because they are like, in the ocean. For those, we would not include them.

This seems okay to me.

Note that this matches what I see for fuds projects here: https://ags03.sec.usace.army.mil/portal/apps/webappviewer/index.html?id=5a541ac5c0064c01a685a72f16854fbf

emma-nechamkin avatar Aug 31 '22 16:08 emma-nechamkin

Last major comment to sum up and add details from remaining testing and checking.

  1. Checking that the data we have matches the source coverage

We skipped checking the following for the reasons specified

For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data

Additional work is needed for these

  1. Additional follow up on Hawaii after comments/testing approximately 9 days ago

  • We saw that percentages of missing tracts for the various data sources Sindu's analysis flagged seemed ok, not more than what was missing from the census
  • In checking mismatch between national_risk_index__count_tracts (visual estimate - not more than a dozen tracts) vs. hazards.fema.gov/nri/map, but it makes sense that most tracts are able to incorporate the subset of indicators from NRI - population loss rate, building value loss rate, and agricultural value loss rate are well represented in the HI data

katherinedm-usds avatar Sep 07 '22 06:09 katherinedm-usds

See also https://github.com/usds/justice40-tool/pull/1887/files for the refactored datasets now getting their state/territory FIPS codes checked as appropriate!

lucasmbrown-usds avatar Sep 09 '22 12:09 lucasmbrown-usds