justice40-tool
justice40-tool copied to clipboard
Produce FIPS list for each dataset
Overview
The zip file contains all of the data for which tracts we push from end-to-end on our ETL (so only what makes it "out" alive). We should verify that this is all of the data we could conceivably have manually and by data source.
What does it mean to verify a data source? Here, our goal is to make sure that the data we see in the backend is comprehensive relative to what data we "should" have (see the Wisconsin error). We are not looking in code nor are we trying to "fix" anything -- this is purely a diagnostic exercise.
Files to look at
This issue has two comments below, each of which contain spreadsheets with information. In the ZIP file, there are Excel sheets per data source.
Each data source excel sheet has five columns:
-
fips_code
: two digit census code -
<DATASETNAME>__count_tracts
: how many census tracts the dataset has within the fips code -
state_name
: name of the state represented by the fips code -
state_abbreviation
: abbreviation of the state represented by the fips code -
region
: the region of the state (census defines this) -
division
: the division of the state (census defines this)
Example of how to read it: in the sheet hud_housing__fips_check.xlsx
, the first row tells me that the hud housing datasource includes 1181 tracts in Alabama.
There is also the whole_list.xlsx
file, which allows us to compare across data sources. This file has a slightly more confusing structure.
- each column has the count of tracts within a FIPS code
- each row represents a data source's results
Example of how to read it: the first row tells me that census_acs_median_income_2019__count_tracts
has 1178 tracts in fips code 01. This means that the data source census acs median income
includes 1178 tracts in fips code 01.
What to do per data source
- [ ] Check that the data that we have matches the source data in terms of coverage*: Start in the
DATASOURCE__fips_check.csv
file. For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data. The source data is either on S3 or publicly-available. (This checks for errors in the original data we have, a la the WI error) - [ ] Check that number of tracts included by FIPS code makes sense: Open the
whole_list.xlsx
file. The ground truth for all but Island Areas is thecensus_acs_2010__count_tracts
. The ground truth for the Island Areas iscensus_dicennial_2010__count_tracts
. Compare the number of tracts per FIPS code and note down any anomalies.
Template to fill out per data source
DATASOURCE NAME
High-level checks
- [ ] Our data matches the source data in terms of state-level coverage
- [ ] We have somewhat complete data for all 50 states
- [ ] We have somewhat complete data for Puerto Rico
- [ ] We have somewhat complete data for the Island Areas
More information
- What aberrations did you see?
- Is there anything else of note?
fips_check.zip this has all of the tracts we have in the pipeline
This has the "overall list" so you can see relatively WHERE data is coming from. whole_list.xlsx
I think @katherinedm-usds and Sindu could do this!
@emma-nechamkin - love the details here TY!
@katherinedm-usds - thought I would update here so that we can keep all info on this ticket in one place! I'm writing as I'm discovering so please excuse all the things you may know already!
Catching up
Looking at Emma's instructions:
Check that the data that we have matches the source data in terms of coverage*: Start in the DATASOURCE__fips_check.csv file. For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data. The source data is either on S3 or publicly-available. (This checks for errors in the original data we have, a la the WI error)
When I download fips_check
I get a list of xlsx files, per data source:
Opening the first one census_acs_2010_fips_check
, leads to the following FIPS having < 10 tracts:
- 60 (AS)
- 66 (GU)
- 69 (MP)
- 78 (VI)
(later realizing that those are the ones we skip in the case of census_acs_2010, so maybe this purposeful skipping changes the testing strategy? - not sure)
OK back to Emma's instructions:
For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data. The source data is either on S3 or publicly-available.
For each DATASOURCE_fips_check file shown above, below, is a link to the source data (S3 or publicly available) for each data source, followed by the FIPS codes that have < 10 tracts.
[DATASOURCE_fips_check] : [FIPS codes < 10 tracts]
geocorr__fips_check: AS, GU, MP, VI doe_energy_burden__fips_check: AS, GU, MP, VI cdc_places__fips_check: AS, GU, MP, VI housing_and_transportation_index__fips_check: AS, GU, MP, VI national_risk_index__fips_check: AS, GU, MP, VI fsf_flood_risk__fips_check: AS, GU, MP, VI nlcd_nature_deprived__fips_check: AS, GU, MP, VI, HI, PR travel_composite__fips_check: AS, GU, MP, VI, PR hud_housing__fips_check: AS, GU, MP, VI historic_redlining__fips_check: many - check file us_army_fuds__fips_check: VI, VT epa_rsei__fips_check: none fsf_wildfire_risk__fips_check: AK, AS, GU, MP, VI, HI, PR eamlis__fips_check: check file - there are many cdc_life_expectancy__fips_check: AS, GU, ME, MP, PR, VI, WI ejscreen__fips_check: AS, GU, MP, VI
Still working on these: census_acs_2010__fips_check census_acs_median_income_2019__fips_check census_decennial_2010__fips_check census_acs_2019__fips_check
I think these are obtained by some API call? I was trying to read the code and that's what I think it's doing. Not totally sure tho. If they are API calls, I bet there's a way to create static postman links to each? @esfoobar-usds
Let me know if that is good place to start!
EXTRA (how to figure out the source data URLs from the code)
Let's find out more about these source data. We can go to the sources folder (in the BE release branch, not main) here: https://github.com/usds/justice40-tool/tree/4bf7773797d2fdf28e01d4f466b15812c5578c4b/data/data-pipeline/data_pipeline/etl/sources
Here we can see a folder for each data source. When I click into the first one calenviroscreen, I see an option to click into the etl.py. This file will have a class in it that will tell us if the data source is coming from S3 or if it's from a public URL.
If we look within the class definition, we see that this one is defined by an AWS_JUSTICE40_DATASOURCES_URL, so we know this data source's source data can be found on S3 and we also know that it's named CalEnviroScreen_4.0_2021.zip
.
The exact code is here:
class CalEnviroScreenETL(ExtractTransformLoad):
def __init__(self):
self.CALENVIROSCREEN_FTP_URL = (
settings.AWS_JUSTICE40_DATASOURCES_URL
+ "/CalEnviroScreen_4.0_2021.zip"
)
@katherinedm-usds - to answer your question from stand up today, the base URL for S3 is
https://justice40-data.s3.amazonaws.com/data-sources/
So placing the source data file after this URL will allow you to download it:
https://justice40-data.s3.amazonaws.com/data-sources/CalEnviroScreen_4.0_2021.zip
Let me know if you have any other questions. I'd love to level 1 support and Emma/Jorge can be level 2! Thx!
FYI @vim-usds: the census data is not requested directly via an API call, but through a Python library called censusdata which handles the URL endpoints and authentication necessary to retrieve the data.
Test update, specifically looking at whole_list and within each FIPS comparing for each dataset the number of tracts with data available vs. the highest number of tracts that data should be available for, in order to understand if there's unexpected and problematic gaps in the data:
- I started to go through whole_list FIPS-by-FIPS and note which datasets seemed to have relatively large numbers of tracts missing
- FIPS 01 Alabama 1101 tracts for cdc_life_expectancy__count_tracts vs. 1181 (6.8%)
- FIPS 02 Alaska 137 tracts for epa_rsei__count_tracts vs. 167 (18% of tracts)
- FIPS 02 Alaska missing nlcd_nature_deprived__count_tracts data
- FIPS 02 Alaska missing fsf_wildfire_risk__count_tracts data
- FIPS 04 Arizona 1477 tracts for epa_rsei__count_tracts vs. 1526 (3.2% of tracts)
- FIPS 04 Arizona 1299 tracts for cdc_life_expectancy__count_tracts vs. 1526 (14.9% of tracts)
- FIPS 05 Arkansas data seems good
- FIPS 06 California 7516 tracts for cdc_life_expectancy__count_tracts vs. 8057 (6.7% of tracts)
- FIPS 08 Colorado 1206 tracts for epa_rsei__count_tracts vs. 1249
- FIPS 09 Connecticut 783 tracts for cdc_life_expectancy__count_tracts vs. 833
- Looked into cdc_life_expectancy__count_tracts which across multiple states often has a relatively large number of missing tracts
- Based on USALEEP limitations https://usaleep.nptoolkit.org/wp-content/uploads/sites/19/2018/05/USALEEP-Data-Uses-and-Limitations_FINAL.pdf:
If no life expectancy estimate was included for your census tract, that was because the number of deaths and/or the estimated size of the population in your census tract was too small to calculate an accurate life table and estimate the life expectancy with a reasonable degree of confidence. The life expectancy estimate from a neighboring census tract may be useful as a substitute, especially if the two census tracts are similar in terms of the ages, race/ethnicity, and other characteristics (such as education and income) of the residents. Another option is to use county level life expectancy estimates, although these estimates will mask disparities among census tracts within the county.
- Next steps - @emma-nechamkin either pull for me a small sample of tract IDs where there’s no data to confirm the above is why they are missing data, or look this up in raw data cdc_life_expectancy__fips_check. US_A 2 does not show tract IDs where there is no life expectancy age set, it just shows there are 65,535 tracts with life expectancy set
- Looked into epa_rsei__count_tracts which is also missing tracts https://www.epa.gov/system/files/documents/2022-06/RSEI%20Methodology%20V2.3.10.pdf
TRI reporting requirements require the reporting facility to indicate the waste management activity used at the off-site facility. If this information is not reported (despite the requirement), the off-site transfer is not evaluated in the RSEI algorithm, but is flagged as a missing value and assigned a zero quantity.
- But their methodology has multiple ways to fill in missing weights, it doesn’t seem like we should be missing any tracts
- Next step for @emma-nechamkin to either pull a few tract IDs with missing data for me to look at or she checks source data epa_rsei__fips_check
After this I plan to go through the rest of FIPS to see what additional dataset jump out, and also compare datasets with FIPS that have <10 tracts to source which Vim helpfully provided links o
@katherinedm-usds - I am able to do some very super basic data manipulation using Python (specifically Jupyter notebooks). If I can help here please let me know. if I can't answer it, may we can push it up to @emma-nechamkin?
I think you can stay on your other tasks @vim-usds bc it should be pretty quick for Emma to pull for me a few sample tract IDss to check - thank you for offering!
Here are 10 tracts for CDC life expectancy that are nulls in our data and are in the FIPS codes above:
['04013092723',
'06037186301',
'06037221820',
'06029003803',
'09001045300',
'06037433901',
'06053980000',
'04013111501',
'06017030901',
'04003001701']
For the RSEI data, we aren't using it in the actual score so I think we can ignore for now (or file away in our brains that yes, we have missing data, but it's not super urgent to identify exactly what that data is? wdyt?)
for these:
FIPS 02 Alaska missing nlcd_nature_deprived__count_tracts data FIPS 02 Alaska missing fsf_wildfire_risk__count_tracts data
I believe this is correct! We know the NLCD data is missing in AK and HI (I believe!) and I think the wildfire also uses CONUS like NCLD (conus = contiguous us states).
Please lmk if this is what you need @katherinedm-usds, you are very much my hero!
Test update, using attached spreadsheet which is an updated version of whole_list; second column below each FIPS code is the percentage of missing tracts in the respective dataset of that state and highlighted in green are those datasets/states with >3% missing tract data.
To add onto Katherine's running list of flagged FIPS/datasets (and excluding notes on CDC Life Expectancy dataset for now)--
- FIPS 15 Hawaii: 8.55% missing tracts for census_acs_median_income_2019__count_tracts
- FIPS 15 Hawaii: 8.55% missing tracts for fsf_flood_risk__count_tracts,
- FIPS 15 Hawaii 5.70% missing tracts for travel_composite__count_tractss
- FIPS 15 Hawaii: 9.97% missing tracts for cdc_places__count_tracts
- FIPS 15 Hawaii: 11.11% missing tracts for doe_energy_burden__count_tract
- FIPS 15 Hawaii: 8.55% missing tracts for geocorr__count_tracts
- FIPS 15 Hawaii: 10.54 missing tracts for census_acs_2010__count_tracts
- FIPS 15 Hawaii: 7.12% missing tracts for housing_and_transportation_index__count_tracts
- FIPS 15 Hawaii: 7.12% missing tracts for national_risk_index__count_tracts
- FIPS 21 Kentucky: 5.11% missing tracts for fsf_flood_risk__count_tracts and fsf_wildfire_risk__count_tracts
- FIPS 22 Louisiana: 3.14% missing tracts for fsf_flood_risk__count_tracts and fsf_wildfire_risk__count_tracts
- FIPS 21 Maine: 8.94% missing tracts for fsf_flood_risk__count_tracts and fsf_wildfire_risk__count_tracts
- FIPS 30 Montana: 16.61% missing tracts for epa_rsei__count_tracts
- FIPS 35 New Mexico: 7.41% missing tracts for epa_rsei__count_tracts
- FIPS 38 North Dakota: 7.32% missing tracts for epa_rsei__count_tracts
- FIPS 46 South Dakota: 10.81% missing tracts for epa_rsei__count_tracts
- FIPS 56 Wyoming: 6.82% missing tracts for epa_rsei__count_tracts
Holding off on territories for now!
@sindu-ri this is fantastic, thank you for all these findings!
For PR: We will be missing much data, but do have workforce and EJScreen for sure. We should also null out all of LINGUISTIC ISOLATION Islands: ONLY workforce data, all from Census 2010 AK and HI: no FSF wildfire
Recapping notes and next steps coming out of Sindu's findings (thank you @sindu-ri!) and other discussion from Friday, to be continued this week
Hawaii
- Katherine look at data source documentation for the datasets Sindu listed to see if it aligns with the % of tracts Sindu found missing in HI
- 8.55% missing tracts for census_acs_median_income_2019__count_tracts
- 8.55% missing tracts for fsf_flood_risk__count_tracts,
- 5.70% missing tracts for travel_composite__count_tractss
- 9.97% missing tracts for cdc_places__count_tracts
- 11.11% missing tracts for doe_energy_burden__count_tract
- 8.55% missing tracts for geocorr__count_tracts
- 10.54 missing tracts for census_acs_2010__count_tracts
- 7.12% missing tracts for housing_and_transportation_index__count_tracts
- 7.12% missing tracts for national_risk_index__count_tracts
- check with @emma-nechamkin if we can get total list of missing tracts for the datasets that have the exact same number of missing tracts (i.e. same 8.55% missing tracts for median_income and energy_burden), or otherwise try to understand why these have the shared # missing tracts - are all the tracts on very sparsely populate islands, etc.
Kentucky, Louisiana, Maine with relatively large % missing tracts for fsf_flood_risk__count_tracts and fsf_flood_risk__count_tracts
- Next steps, sample tract IDs
Puerto Rico
The datasets that make it through the staging pipeline are
- census_acs_median_income_2019__count_tracts
- epa_rsei__count_tracts
- fsf_flood_risk__count_tracts
- doe_energy_burden__count_tracts
- us_army_fuds__count_tracts
- census_acs_2010__count_tracts
- ejscreen__count_tracts
- census_acs_2019__count_tracts
- hud_housing__count_tracts but we are not incorporating doe_energy_burden__count_tracts and hud_housing__count_tracts into the score, and as @emma-nechamkin notes in her comments above, we "do have workforce and EJScreen for sure. We should also null out all of LINGUISTIC ISOLATION". New separate bug ticket to be be opened.
Re 3 new datasets, eamlis__count_tracts, historic_redlining__count_tracts, us_army_fuds__count_tracts
- I won't go further in checking, Matt wrote a suite of tests
Re doublechecking FIPS with < 10 tracts are not represented in source data
- i.e. make sure AS 60, GU 66, MP 69, VI 78 are not in https://justice40-data.s3.amazonaws.com/data-sources/DOE_LEAD_AMI_TRACT_2018_ALL.csv.zip
- I'm still checking this as of 8/29; I'll need to use my new computer for this
@katherinedm-usds - for the FUDS data, checked number of occurrences of VT and VI in the geojson file:
Came up with this
- there are 14 rows that are for VT
- there are 11 rows that are for VI
In case we're interested in how this was done:
@katherinedm-usds - here is the results for geocorr:
this is great thanks @vim-usds ! looks like there's a discrepancy between the number of tracts for VI and VT in data source us_army_fuds__fips_check vs. our fips_check that we can check with Emma about after standup
okay so there are a few things going on here. The data in that geojson is not going to perfectly match the data that we see in usa.csv for a few reasons:
- Not every FUDS is eligible for inclusion
- We assign FUDS to their census tracts. Some FUDS cannot be assigned to a census tract because they are like, in the ocean. For those, we would not include them.
This seems okay to me.
Note that this matches what I see for fuds projects here: https://ags03.sec.usace.army.mil/portal/apps/webappviewer/index.html?id=5a541ac5c0064c01a685a72f16854fbf
Last major comment to sum up and add details from remaining testing and checking.
-
Checking that the data we have matches the source coverage
We skipped checking the following for the reasons specified
- geocorr__fips_check - we don't need to check
- housing_and_transportation_index__fips_check - we don't need to check
- fsf_flood_risk__fips_check - it's point data
- travel_composite__fips_check: it's shapefile data that @emma-nechamkin can check
- us_army_fuds__fips_check: it's point data
- epa_rsei__fips_check: no FIPS with <10 tracts, also Emma commented above that we're not using in score
- eamlis__fips_check: point data
For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data
- doe_energy_burden__fips_check: checked; AS, GU, MP, VI not in data source
- cdc_places__fips_check: checked; AS, GU, MP, VI not in data source
- national_risk_index__fips_check: checked; AS, GU, MP, VI not in data source
- fsf_wildfire_risk__fips_check: checked; AK, AS, GU, MP, VI, HI, PR are not there
- cdc_life_expectancy__fips_check: checked; AS, GU, ME, MP, PR, VI, WI are not in datasource, however (ME) shouldn't have 0 tracts with data, ME_A [XLS – 18 KB] from https://www.cdc.gov/nchs/nvss/usaleep/usaleep.html#life-expectancy includes Maine data; UPDATE this has been added to https://github.com/usds/justice40-tool/issues/1831
- ejscreen__fips_check: checked; AS, GU, MP, VI are not there
Additional work is needed for these
- hud_housing__fips_check: it's a lot of separate files to check so I could use help from @vim-usds to check to make sure AS, GU, MP, VI are not there
- historic_redlining__fips_check
- nlcd_nature_deprived__fips_check: need to check again for presence of AS, GU, MP, VI, HI, PR as attempting to open usa_conus_nat_dep__compiled_by_TPL.csv 885MB source file kept crashing my computer
-
Additional follow up on Hawaii after comments/testing approximately 9 days ago
- We saw that percentages of missing tracts for the various data sources Sindu's analysis flagged seemed ok, not more than what was missing from the census
- In checking mismatch between national_risk_index__count_tracts (visual estimate - not more than a dozen tracts) vs. hazards.fema.gov/nri/map, but it makes sense that most tracts are able to incorporate the subset of indicators from NRI - population loss rate, building value loss rate, and agricultural value loss rate are well represented in the HI data
See also https://github.com/usds/justice40-tool/pull/1887/files for the refactored datasets now getting their state/territory FIPS codes checked as appropriate!