Overview

The zip file contains all of the data for which tracts we push from end-to-end on our ETL (so only what makes it "out" alive). We should verify that this is all of the data we could conceivably have manually and by data source.

What does it mean to verify a data source? Here, our goal is to make sure that the data we see in the backend is comprehensive relative to what data we "should" have (see the Wisconsin error). We are not looking in code nor are we trying to "fix" anything -- this is purely a diagnostic exercise.

Files to look at

This issue has two comments below, each of which contain spreadsheets with information. In the ZIP file, there are Excel sheets per data source.

Each data source excel sheet has five columns:

fips_code: two digit census code
<DATASETNAME>__count_tracts: how many census tracts the dataset has within the fips code
state_name: name of the state represented by the fips code
state_abbreviation: abbreviation of the state represented by the fips code
region: the region of the state (census defines this)
division: the division of the state (census defines this)

Example of how to read it: in the sheet hud_housing__fips_check.xlsx, the first row tells me that the hud housing datasource includes 1181 tracts in Alabama.

There is also the `whole_list.xlsx` file, which allows us to compare across data sources. This file has a slightly more confusing structure.

each column has the count of tracts within a FIPS code
each row represents a data source's results

Example of how to read it: the first row tells me that census_acs_median_income_2019__count_tracts has 1178 tracts in fips code 01. This means that the data source census acs median income includes 1178 tracts in fips code 01.

What to do per data source

[ ] Check that the data that we have matches the source data in terms of coverage*: Start in the DATASOURCE__fips_check.csv file. For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data. The source data is either on S3 or publicly-available. (This checks for errors in the original data we have, a la the WI error)
[ ] Check that number of tracts included by FIPS code makes sense: Open the whole_list.xlsx file. The ground truth for all but Island Areas is the census_acs_2010__count_tracts. The ground truth for the Island Areas is census_dicennial_2010__count_tracts. Compare the number of tracts per FIPS code and note down any anomalies.

Template to fill out per data source

DATASOURCE NAME

High-level checks

[ ] Our data matches the source data in terms of state-level coverage
[ ] We have somewhat complete data for all 50 states
[ ] We have somewhat complete data for Puerto Rico
[ ] We have somewhat complete data for the Island Areas

More information

What aberrations did you see?
Is there anything else of note?

Aug 18 '22 20:08 emma-nechamkin

fips_check.zip this has all of the tracts we have in the pipeline

Aug 18 '22 21:08 emma-nechamkin

This has the "overall list" so you can see relatively WHERE data is coming from. whole_list.xlsx

Aug 19 '22 14:08 emma-nechamkin

I think @katherinedm-usds and Sindu could do this!

Aug 19 '22 21:08 emma-nechamkin

@emma-nechamkin - love the details here TY!

Aug 19 '22 22:08 vim-usds

@katherinedm-usds - thought I would update here so that we can keep all info on this ticket in one place! I'm writing as I'm discovering so please excuse all the things you may know already!

Catching up

Looking at Emma's instructions:

Check that the data that we have matches the source data in terms of coverage*: Start in the DATASOURCE__fips_check.csv file. For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data. The source data is either on S3 or publicly-available. (This checks for errors in the original data we have, a la the WI error)

When I download fips_check I get a list of xlsx files, per data source:

Screen Shot 2022-08-22 at 8 25 10 PM

Opening the first one census_acs_2010_fips_check, leads to the following FIPS having < 10 tracts:

60 (AS)
66 (GU)
69 (MP)
78 (VI)

(later realizing that those are the ones we skip in the case of census_acs_2010, so maybe this purposeful skipping changes the testing strategy? - not sure)

OK back to Emma's instructions:

For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data. The source data is either on S3 or publicly-available.

For each DATASOURCE_fips_check file shown above, below, is a link to the source data (S3 or publicly available) for each data source, followed by the FIPS codes that have < 10 tracts.

[DATASOURCE_fips_check] : [FIPS codes < 10 tracts]

Still working on these: census_acs_2010__fips_check census_acs_median_income_2019__fips_check census_decennial_2010__fips_check census_acs_2019__fips_check

I think these are obtained by some API call? I was trying to read the code and that's what I think it's doing. Not totally sure tho. If they are API calls, I bet there's a way to create static postman links to each? @esfoobar-usds

Let me know if that is good place to start!

EXTRA (how to figure out the source data URLs from the code)

Let's find out more about these source data. We can go to the sources folder (in the BE release branch, not main) here: https://github.com/usds/justice40-tool/tree/4bf7773797d2fdf28e01d4f466b15812c5578c4b/data/data-pipeline/data_pipeline/etl/sources

Here we can see a folder for each data source. When I click into the first one calenviroscreen, I see an option to click into the etl.py. This file will have a class in it that will tell us if the data source is coming from S3 or if it's from a public URL.

If we look within the class definition, we see that this one is defined by an AWS_JUSTICE40_DATASOURCES_URL, so we know this data source's source data can be found on S3 and we also know that it's named CalEnviroScreen_4.0_2021.zip.

The exact code is here:

class CalEnviroScreenETL(ExtractTransformLoad):
    def __init__(self):
        self.CALENVIROSCREEN_FTP_URL = (
            settings.AWS_JUSTICE40_DATASOURCES_URL
            + "/CalEnviroScreen_4.0_2021.zip"
        )

Aug 23 '22 05:08 vim-usds

@katherinedm-usds - to answer your question from stand up today, the base URL for S3 is

https://justice40-data.s3.amazonaws.com/data-sources/

So placing the source data file after this URL will allow you to download it:

https://justice40-data.s3.amazonaws.com/data-sources/CalEnviroScreen_4.0_2021.zip

Let me know if you have any other questions. I'd love to level 1 support and Emma/Jorge can be level 2! Thx!

Aug 23 '22 15:08 vim-usds

FYI @vim-usds: the census data is not requested directly via an API call, but through a Python library called censusdata which handles the URL endpoints and authentication necessary to retrieve the data.

Aug 23 '22 18:08 esfoobar-usds

Test update, specifically looking at whole_list and within each FIPS comparing for each dataset the number of tracts with data available vs. the highest number of tracts that data should be available for, in order to understand if there's unexpected and problematic gaps in the data:

I started to go through whole_list FIPS-by-FIPS and note which datasets seemed to have relatively large numbers of tracts missing

FIPS 01 Alabama 1101 tracts for cdc_life_expectancy__count_tracts vs. 1181 (6.8%)
FIPS 02 Alaska 137 tracts for epa_rsei__count_tracts vs. 167 (18% of tracts)
FIPS 02 Alaska missing nlcd_nature_deprived__count_tracts data
FIPS 02 Alaska missing fsf_wildfire_risk__count_tracts data
FIPS 04 Arizona 1477 tracts for epa_rsei__count_tracts vs. 1526 (3.2% of tracts)
FIPS 04 Arizona 1299 tracts for cdc_life_expectancy__count_tracts vs. 1526 (14.9% of tracts)
FIPS 05 Arkansas data seems good
FIPS 06 California 7516 tracts for cdc_life_expectancy__count_tracts vs. 8057 (6.7% of tracts)
FIPS 08 Colorado 1206 tracts for epa_rsei__count_tracts vs. 1249
FIPS 09 Connecticut 783 tracts for cdc_life_expectancy__count_tracts vs. 833

Looked into cdc_life_expectancy__count_tracts which across multiple states often has a relatively large number of missing tracts

Based on USALEEP limitations https://usaleep.nptoolkit.org/wp-content/uploads/sites/19/2018/05/USALEEP-Data-Uses-and-Limitations_FINAL.pdf:

If no life expectancy estimate was included for your census tract, that was because the number of deaths and/or the estimated size of the population in your census tract was too small to calculate an accurate life table and estimate the life expectancy with a reasonable degree of confidence. The life expectancy estimate from a neighboring census tract may be useful as a substitute, especially if the two census tracts are similar in terms of the ages, race/ethnicity, and other characteristics (such as education and income) of the residents. Another option is to use county level life expectancy estimates, although these estimates will mask disparities among census tracts within the county.

Next steps - @emma-nechamkin either pull for me a small sample of tract IDs where there’s no data to confirm the above is why they are missing data, or look this up in raw data cdc_life_expectancy__fips_check. US_A 2 does not show tract IDs where there is no life expectancy age set, it just shows there are 65,535 tracts with life expectancy set

Looked into epa_rsei__count_tracts which is also missing tracts https://www.epa.gov/system/files/documents/2022-06/RSEI%20Methodology%20V2.3.10.pdf

TRI reporting requirements require the reporting facility to indicate the waste management activity used at the off-site facility. If this information is not reported (despite the requirement), the off-site transfer is not evaluated in the RSEI algorithm, but is flagged as a missing value and assigned a zero quantity.

But their methodology has multiple ways to fill in missing weights, it doesn’t seem like we should be missing any tracts
Next step for @emma-nechamkin to either pull a few tract IDs with missing data for me to look at or she checks source data epa_rsei__fips_check

After this I plan to go through the rest of FIPS to see what additional dataset jump out, and also compare datasets with FIPS that have <10 tracts to source which Vim helpfully provided links o

Aug 24 '22 16:08 katherinedm-usds

@katherinedm-usds - I am able to do some very super basic data manipulation using Python (specifically Jupyter notebooks). If I can help here please let me know. if I can't answer it, may we can push it up to @emma-nechamkin?

Aug 24 '22 16:08 vim-usds

I think you can stay on your other tasks @vim-usds bc it should be pretty quick for Emma to pull for me a few sample tract IDss to check - thank you for offering!

Aug 24 '22 17:08 katherinedm-usds

Here are 10 tracts for CDC life expectancy that are nulls in our data and are in the FIPS codes above:

['04013092723',
 '06037186301',
 '06037221820',
 '06029003803',
 '09001045300',
 '06037433901',
 '06053980000',
 '04013111501',
 '06017030901',
 '04003001701']

For the RSEI data, we aren't using it in the actual score so I think we can ignore for now (or file away in our brains that yes, we have missing data, but it's not super urgent to identify exactly what that data is? wdyt?)

Aug 24 '22 20:08 emma-nechamkin

for these:

FIPS 02 Alaska missing nlcd_nature_deprived__count_tracts data FIPS 02 Alaska missing fsf_wildfire_risk__count_tracts data

I believe this is correct! We know the NLCD data is missing in AK and HI (I believe!) and I think the wildfire also uses CONUS like NCLD (conus = contiguous us states).

Please lmk if this is what you need @katherinedm-usds, you are very much my hero!

Aug 24 '22 20:08 emma-nechamkin

Test update, using attached spreadsheet which is an updated version of whole_list; second column below each FIPS code is the percentage of missing tracts in the respective dataset of that state and highlighted in green are those datasets/states with >3% missing tract data.

To add onto Katherine's running list of flagged FIPS/datasets (and excluding notes on CDC Life Expectancy dataset for now)--

FIPS 15 Hawaii: 8.55% missing tracts for census_acs_median_income_2019__count_tracts
FIPS 15 Hawaii: 8.55% missing tracts for fsf_flood_risk__count_tracts,
FIPS 15 Hawaii 5.70% missing tracts for travel_composite__count_tractss
FIPS 15 Hawaii: 9.97% missing tracts for cdc_places__count_tracts
FIPS 15 Hawaii: 11.11% missing tracts for doe_energy_burden__count_tract
FIPS 15 Hawaii: 8.55% missing tracts for geocorr__count_tracts
FIPS 15 Hawaii: 10.54 missing tracts for census_acs_2010__count_tracts
FIPS 15 Hawaii: 7.12% missing tracts for housing_and_transportation_index__count_tracts
FIPS 15 Hawaii: 7.12% missing tracts for national_risk_index__count_tracts
FIPS 21 Kentucky: 5.11% missing tracts for fsf_flood_risk__count_tracts and fsf_wildfire_risk__count_tracts
FIPS 22 Louisiana: 3.14% missing tracts for fsf_flood_risk__count_tracts and fsf_wildfire_risk__count_tracts
FIPS 21 Maine: 8.94% missing tracts for fsf_flood_risk__count_tracts and fsf_wildfire_risk__count_tracts
FIPS 30 Montana: 16.61% missing tracts for epa_rsei__count_tracts
FIPS 35 New Mexico: 7.41% missing tracts for epa_rsei__count_tracts
FIPS 38 North Dakota: 7.32% missing tracts for epa_rsei__count_tracts
FIPS 46 South Dakota: 10.81% missing tracts for epa_rsei__count_tracts
FIPS 56 Wyoming: 6.82% missing tracts for epa_rsei__count_tracts

Holding off on territories for now!

CEJST Dataset Tracts per FIPS TESTING.xlsx

Aug 26 '22 17:08 sindu-ri

@sindu-ri this is fantastic, thank you for all these findings!

Aug 26 '22 19:08 katherinedm-usds

For PR: We will be missing much data, but do have workforce and EJScreen for sure. We should also null out all of LINGUISTIC ISOLATION Islands: ONLY workforce data, all from Census 2010 AK and HI: no FSF wildfire

Aug 26 '22 20:08 emma-nechamkin

Recapping notes and next steps coming out of Sindu's findings (thank you @sindu-ri!) and other discussion from Friday, to be continued this week

Hawaii

Katherine look at data source documentation for the datasets Sindu listed to see if it aligns with the % of tracts Sindu found missing in HI

8.55% missing tracts for census_acs_median_income_2019__count_tracts
8.55% missing tracts for fsf_flood_risk__count_tracts,
5.70% missing tracts for travel_composite__count_tractss
9.97% missing tracts for cdc_places__count_tracts
11.11% missing tracts for doe_energy_burden__count_tract
8.55% missing tracts for geocorr__count_tracts
10.54 missing tracts for census_acs_2010__count_tracts
7.12% missing tracts for housing_and_transportation_index__count_tracts
7.12% missing tracts for national_risk_index__count_tracts

check with @emma-nechamkin if we can get total list of missing tracts for the datasets that have the exact same number of missing tracts (i.e. same 8.55% missing tracts for median_income and energy_burden), or otherwise try to understand why these have the shared # missing tracts - are all the tracts on very sparsely populate islands, etc.

Kentucky, Louisiana, Maine with relatively large % missing tracts for fsf_flood_risk__count_tracts and fsf_flood_risk__count_tracts

Next steps, sample tract IDs

Puerto Rico

The datasets that make it through the staging pipeline are

census_acs_median_income_2019__count_tracts
epa_rsei__count_tracts
fsf_flood_risk__count_tracts
doe_energy_burden__count_tracts
us_army_fuds__count_tracts
census_acs_2010__count_tracts
ejscreen__count_tracts
census_acs_2019__count_tracts
hud_housing__count_tracts but we are not incorporating doe_energy_burden__count_tracts and hud_housing__count_tracts into the score, and as @emma-nechamkin notes in her comments above, we "do have workforce and EJScreen for sure. We should also null out all of LINGUISTIC ISOLATION". New separate bug ticket to be be opened.

Re 3 new datasets, eamlis__count_tracts, historic_redlining__count_tracts, us_army_fuds__count_tracts

I won't go further in checking, Matt wrote a suite of tests

Re doublechecking FIPS with < 10 tracts are not represented in source data

i.e. make sure AS 60, GU 66, MP 69, VI 78 are not in https://justice40-data.s3.amazonaws.com/data-sources/DOE_LEAD_AMI_TRACT_2018_ALL.csv.zip
I'm still checking this as of 8/29; I'll need to use my new computer for this

Aug 29 '22 07:08 katherinedm-usds

@katherinedm-usds - for the FUDS data, checked number of occurrences of VT and VI in the geojson file:

Came up with this

there are 14 rows that are for VT
there are 11 rows that are for VI

In case we're interested in how this was done:

Screen Shot 2022-08-30 at 10 36 15 AM

Aug 30 '22 20:08 vim-usds

@katherinedm-usds - here is the results for geocorr:

Screen Shot 2022-08-30 at 2 10 38 PM

Aug 30 '22 21:08 vim-usds

this is great thanks @vim-usds ! looks like there's a discrepancy between the number of tracts for VI and VT in data source us_army_fuds__fips_check vs. our fips_check that we can check with Emma about after standup

Aug 31 '22 09:08 katherinedm-usds

okay so there are a few things going on here. The data in that geojson is not going to perfectly match the data that we see in usa.csv for a few reasons:

Not every FUDS is eligible for inclusion
We assign FUDS to their census tracts. Some FUDS cannot be assigned to a census tract because they are like, in the ocean. For those, we would not include them.

This seems okay to me.

Note that this matches what I see for fuds projects here: https://ags03.sec.usace.army.mil/portal/apps/webappviewer/index.html?id=5a541ac5c0064c01a685a72f16854fbf

Aug 31 '22 16:08 emma-nechamkin

Last major comment to sum up and add details from remaining testing and checking.

Checking that the data we have matches the source coverage

We skipped checking the following for the reasons specified

geocorr__fips_check - we don't need to check
housing_and_transportation_index__fips_check - we don't need to check
fsf_flood_risk__fips_check - it's point data
travel_composite__fips_check: it's shapefile data that @emma-nechamkin can check
us_army_fuds__fips_check: it's point data
epa_rsei__fips_check: no FIPS with <10 tracts, also Emma commented above that we're not using in score
eamlis__fips_check: point data

For every FIPS code with <10 tracts represented, verify that those states are not represented in the source data

doe_energy_burden__fips_check: checked; AS, GU, MP, VI not in data source
cdc_places__fips_check: checked; AS, GU, MP, VI not in data source
national_risk_index__fips_check: checked; AS, GU, MP, VI not in data source
fsf_wildfire_risk__fips_check: checked; AK, AS, GU, MP, VI, HI, PR are not there
cdc_life_expectancy__fips_check: checked; AS, GU, ME, MP, PR, VI, WI are not in datasource, however (ME) shouldn't have 0 tracts with data, ME_A [XLS – 18 KB] from https://www.cdc.gov/nchs/nvss/usaleep/usaleep.html#life-expectancy includes Maine data; UPDATE this has been added to https://github.com/usds/justice40-tool/issues/1831
ejscreen__fips_check: checked; AS, GU, MP, VI are not there

Additional work is needed for these

hud_housing__fips_check: it's a lot of separate files to check so I could use help from @vim-usds to check to make sure AS, GU, MP, VI are not there
historic_redlining__fips_check
nlcd_nature_deprived__fips_check: need to check again for presence of AS, GU, MP, VI, HI, PR as attempting to open usa_conus_nat_dep__compiled_by_TPL.csv 885MB source file kept crashing my computer

Additional follow up on Hawaii after comments/testing approximately 9 days ago

We saw that percentages of missing tracts for the various data sources Sindu's analysis flagged seemed ok, not more than what was missing from the census
In checking mismatch between national_risk_index__count_tracts (visual estimate - not more than a dozen tracts) vs. hazards.fema.gov/nri/map, but it makes sense that most tracts are able to incorporate the subset of indicators from NRI - population loss rate, building value loss rate, and agricultural value loss rate are well represented in the HI data

Sep 07 '22 06:09 katherinedm-usds

See also https://github.com/usds/justice40-tool/pull/1887/files for the refactored datasets now getting their state/territory FIPS codes checked as appropriate!

Sep 09 '22 12:09 lucasmbrown-usds

justice40-tool
justice40-tool copied to clipboard

Produce FIPS list for each dataset

Overview

Files to look at

Each data source excel sheet has five columns:

There is also the `whole_list.xlsx` file, which allows us to compare across data sources. This file has a slightly more confusing structure.

What to do per data source