Add DQ scoring to FBS
Major changes:
- Data quality scoring implemented for FBS
- New
adjust_dqi_reliability_collection_scores()to modify data reliability and data collection based on source and target sector levels assign_temporal_correlation()assigns temporal DQ based on difference between year of data and target year of FBSassign_geographical_correlation()assigns DQ for geoscale based on data geoscale vs target FBS geoscaleassign_technological_correlation()assigns DQ scores based on difference between source and target sectors
- New
- Modified how data are merged on location so we can correctly merge state with county data
- Modified how activities are mapped to sectors
- Changed how activities are mapped to properly account for data quality scores - Technological scores - Modify data reliability and data collection scores after mapping
- First map to sector year identified in data crosswalk, then later convert to target sector year, previously we immediately converted the crosswalk to target sector year
- Modified NAICS year conversion method - Pull all NAICS6 and determine mapping changes for child naics to parent naics in generate_naics_crosswalk_conversion_ratios() - For example, if we are converting NAICS4 across years, we identify all child NAICS6 and determine how those NAICS6 map between years. If there are 4 child NAICS6 and one child NAICS6 maps to a different parent NAICS4 in the target year, than ¼ of the original NAICS4 parent value is mapped to a different NAICS4 in the target year • Conversion is not based on numeric values within the FBS because we might only have NAICS4 values, not NAICS6 and therefore do not have the data to create proportional conversions - We previously mapped all activities to NAICS6+, then converted, then aggregated. This is not a good method for a multitude of reasons, but especially problematic when assigning DQ scores
- New subset_sector_key() - Subsets sector key to return industry that most closely maps activity/source sectors to target sectors – drops parent sectors within crosswalk and assigns tech corr scoring, modifies datareliability and datacollection scores based on mapping
- Modified how naics are converted to target naics years
- Had a data check that checked if a sector-like activity was found in any naics year outside of the target year and if so, mapped to target year. Did not always map correctly because sector could be found in multiple NAICS years, and the NAICS years map differently to target year - Revised this function to check for the closest NAICS year to the target year and use that year to map to target NAICS
Minor changes:
- Correct error in attribute_flows_to_sectors()
- Original group_total assignment was based on original df FlowAmount values, but we reset the index, so needed to base group_total on new index of the df
- Adds FIPS scale (1,3,5) to FIPS_Crosswalk
- Add NAICS 2002, 2007, 2022 crosswalks
- Expand NAICS_Crosswalk_TimeSeries to include NAICS 2022
- New NAICS_Year_Concordance which maps published 6-digit sectors across years
- New Sector_Levels csv which lables sector level and sector length for all sectors
- In source_catalog.ymal
- Correct BLS_QCEW NAICS years for 2011, 2022, and 2023
- BLS QCEW estimate_suppressed_qcew()
- Update the function to only estimate suppressed data up to max sector level. No longer estimate suppressed 6-digit sectors, when our target is 3-digit
- Data Quality scores
- Update GHGI scores
- Consistent fips scale assignments. National = 5, state = 2, county = 1
- url updates to government FBA links
FBA changes
- BLS_QCEW: expand to include 2000 – 2023, add county FBS, some changes to target_naics_year to match those of the FBA
I reviewed the FBS generation in the action at 59d24a9, for the CRHW national FBS, the facilities that come in as 5 digit NAICS instead of 6 are getting dropped. I think this is only when there is a single 6 digit child for that 5 digit.
Also seeing that the 5 digit NAICS with multiple children are not being handled correctly:
Old (correct): (21222 split evenly between 212221 and 212222)
New (incorrect): All of 21222 is assigned to 212222
for the CRHW national FBS, the facilities that come in as 5 digit NAICS instead of 6 are getting dropped. I think this is only when there is a single 6 digit child for that 5 digit.
This was resolved by c09f6f3
In the revised map_to_sectors() under proportional attribution, the grouped df that enters indicates the group_id, which later is used during proportional attribution as the groupby_col. Somehwere in map_to_sectors() this value is getting reset.
In the revised
map_to_sectors()under proportional attribution, the grouped df that enters indicates thegroup_id, which later is used during proportional attribution as the groupby_col. Somehwere inmap_to_sectors()this value is getting reset.
I believe that ebe8ae6 addresses this, though need to confirm it doesn't impact other methods negatively. I was reviewing this in the context of GHG_national, which was showing major diffs (and duplicate values). It now looks correct and shows no change from remote.
We decided to drop the county employment FBS (or perhaps all but one example). As well as the interim national and state employment FBS files (like 2000-2012), right?
using collapse_FlowBySector() is causing DQI info to be dropped
merging with develop to consolidate changes for v2.1 release @bl-young - moving documentation to new PR #455