flowsa icon indicating copy to clipboard operation
flowsa copied to clipboard

Add DQ scoring to FBS

Open catherinebirney opened this issue 10 months ago • 4 comments

Major changes:

  • Data quality scoring implemented for FBS
    • New adjust_dqi_reliability_collection_scores() to modify data reliability and data collection based on source and target sector levels
    • assign_temporal_correlation() assigns temporal DQ based on difference between year of data and target year of FBS
    • assign_geographical_correlation() assigns DQ for geoscale based on data geoscale vs target FBS geoscale
    • assign_technological_correlation() assigns DQ scores based on difference between source and target sectors
  • Modified how data are merged on location so we can correctly merge state with county data
  • Modified how activities are mapped to sectors
    • Changed how activities are mapped to properly account for data quality scores - Technological scores - Modify data reliability and data collection scores after mapping
    • First map to sector year identified in data crosswalk, then later convert to target sector year, previously we immediately converted the crosswalk to target sector year
    • Modified NAICS year conversion method - Pull all NAICS6 and determine mapping changes for child naics to parent naics in generate_naics_crosswalk_conversion_ratios() - For example, if we are converting NAICS4 across years, we identify all child NAICS6 and determine how those NAICS6 map between years. If there are 4 child NAICS6 and one child NAICS6 maps to a different parent NAICS4 in the target year, than ¼ of the original NAICS4 parent value is mapped to a different NAICS4 in the target year • Conversion is not based on numeric values within the FBS because we might only have NAICS4 values, not NAICS6 and therefore do not have the data to create proportional conversions - We previously mapped all activities to NAICS6+, then converted, then aggregated. This is not a good method for a multitude of reasons, but especially problematic when assigning DQ scores
    • New subset_sector_key() - Subsets sector key to return industry that most closely maps activity/source sectors to target sectors – drops parent sectors within crosswalk and assigns tech corr scoring, modifies datareliability and datacollection scores based on mapping
  • Modified how naics are converted to target naics years
    • Had a data check that checked if a sector-like activity was found in any naics year outside of the target year and if so, mapped to target year. Did not always map correctly because sector could be found in multiple NAICS years, and the NAICS years map differently to target year - Revised this function to check for the closest NAICS year to the target year and use that year to map to target NAICS

Minor changes:

  • Correct error in attribute_flows_to_sectors()
    • Original group_total assignment was based on original df FlowAmount values, but we reset the index, so needed to base group_total on new index of the df
  • Adds FIPS scale (1,3,5) to FIPS_Crosswalk
  • Add NAICS 2002, 2007, 2022 crosswalks
  • Expand NAICS_Crosswalk_TimeSeries to include NAICS 2022
  • New NAICS_Year_Concordance which maps published 6-digit sectors across years
  • New Sector_Levels csv which lables sector level and sector length for all sectors
  • In source_catalog.ymal
    • Correct BLS_QCEW NAICS years for 2011, 2022, and 2023
  • BLS QCEW estimate_suppressed_qcew()
    • Update the function to only estimate suppressed data up to max sector level. No longer estimate suppressed 6-digit sectors, when our target is 3-digit
  • Data Quality scores
    • Update GHGI scores
  • Consistent fips scale assignments. National = 5, state = 2, county = 1
  • url updates to government FBA links

FBA changes

  • BLS_QCEW: expand to include 2000 – 2023, add county FBS, some changes to target_naics_year to match those of the FBA

catherinebirney avatar Feb 28 '25 21:02 catherinebirney

I reviewed the FBS generation in the action at 59d24a9, for the CRHW national FBS, the facilities that come in as 5 digit NAICS instead of 6 are getting dropped. I think this is only when there is a single 6 digit child for that 5 digit. image

Also seeing that the 5 digit NAICS with multiple children are not being handled correctly:

image

Old (correct): (21222 split evenly between 212221 and 212222)

image

New (incorrect): All of 21222 is assigned to 212222

image

bl-young avatar May 09 '25 13:05 bl-young

for the CRHW national FBS, the facilities that come in as 5 digit NAICS instead of 6 are getting dropped. I think this is only when there is a single 6 digit child for that 5 digit.

This was resolved by c09f6f3

bl-young avatar May 09 '25 18:05 bl-young

In the revised map_to_sectors() under proportional attribution, the grouped df that enters indicates the group_id, which later is used during proportional attribution as the groupby_col. Somehwere in map_to_sectors() this value is getting reset.

bl-young avatar May 09 '25 18:05 bl-young

In the revised map_to_sectors() under proportional attribution, the grouped df that enters indicates the group_id, which later is used during proportional attribution as the groupby_col. Somehwere in map_to_sectors() this value is getting reset.

I believe that ebe8ae6 addresses this, though need to confirm it doesn't impact other methods negatively. I was reviewing this in the context of GHG_national, which was showing major diffs (and duplicate values). It now looks correct and shows no change from remote.

bl-young avatar May 09 '25 19:05 bl-young

We decided to drop the county employment FBS (or perhaps all but one example). As well as the interim national and state employment FBS files (like 2000-2012), right?

bl-young avatar Jun 04 '25 13:06 bl-young

using collapse_FlowBySector() is causing DQI info to be dropped

bl-young avatar Jun 06 '25 01:06 bl-young

merging with develop to consolidate changes for v2.1 release @bl-young - moving documentation to new PR #455

catherinebirney avatar Jul 02 '25 17:07 catherinebirney