ElectricityLCI icon indicating copy to clipboard operation
ElectricityLCI copied to clipboard

Make it to the Milestone

Open dt-woods opened this issue 1 year ago • 16 comments

The road to Version 2.0:

  • [x] #200
  • [x] #196
    • [x] #195
    • [x] #184
  • [ ] #204
    • [ ] #214
      • Unanswered.
  • [ ] Test existing configurations
    • ELCI_1
      • WORKING!
    • ELCI_2
      • [x] #211
    • ELCI_3 (untested)
  • [x] Create 2020--2022 configuration settings
  • [ ] Test new configurations
    • [ ] 2020
      • [ ] #207
        • Temporary fix with EPA API!
      • [ ] #212
        • SOLVED!
      • [ ] #232
        • In progress
    • [ ] 2021
      • [ ] #230
        • Temporary fix!
      • [ ] #231
        • In progress
    • [x] 2022
  • [ ] Run main.py to generate new inventories
    • [x] 2020
    • [x] 2022
  • [ ] Fix errors after running main.py
    • [ ] #238
      • (should be 2300-7650 V, not 120 V)
    • [ ] Flow metadata in JSON-LD is limited to just UUID; should it include CAS number of FEDEFL version number (e.g., version 1.2.2)? See discussion: https://github.com/USEPA/ElectricityLCI/discussions/239
    • [ ] Locations appear in the process and product system names, which "appear to be in duplicate"; this may just be a quirk of openLCA v2
    • [x] Uncertainty quantities are erroneous in exchange tables
      • See https://github.com/USEPA/ElectricityLCI/discussions/240 for implementation
    • [x] Update CEMS data URL for API calls
  • [x] Post-process inventories
    • [ ] #220
      • SOLVED! Do not include impact assessments with baseline.
    • [ ] #217
      • FIXED! Now in run_post_process
    • [ ] #216
      • FIXED! Now in run_post_process
  • [ ] Miscellaneous data updates
    • [ ] #215

The 2020 model config

Notes following the testing of ElectricityLCI development branch using the ELCI_2020_config.yml (2/22/24).

  • fixed missing fuel category in stewi's getInventoryFacilities from local parquet data store
  • pandas 2.2 has a new slew of complaints
    • primary_fuel["primary fuel percent gen"].fillna(value=0, inplace=True) in eia923_generation.py
      • Setting a value to a copy.
      • For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
    • database_reg = database_reg[database_filt] in eia_io_trading.py:989
      • Boolean series key reindexed to match dataframe index
  • unmatched aggregation areas during transmission and losses step (no consumption mix at user created)
    • Gridforce South
    • Griffith Energy, LLC
    • NaturEner Power Watch, LLC (GWA)
    • NaturEner Wind Watch, LLC
    • New Brunswick System Operator
    • Southeastern Power Administration
    • Canada
  • Import into openLCA 2.0
    • Run validation
      • Found 'Water, saline (Elementary Flows/resource/water)' and 'Water, fresh (Elementary Flows/resource/water)' used as an output, which, obviously, should be their 'Elementary Flows/emission/water' versions.
      • #233
    • Build product systems
      • Working

dt-woods avatar Jun 14 '23 16:06 dt-woods

The List of Why's

  • Why are we using esupy's Path in globals.py?

dt-woods avatar Jul 27 '23 18:07 dt-woods

The List of Why's

  • Why are we using esupy's Path in globals.py?

Our team moved all StEWI path management to our common esupy library which serves all the tools in our ecosystems. Most likely this is an update to provide the same functionality as before from StEWI

WesIngwersen avatar Jul 31 '23 19:07 WesIngwersen

Sure, @WesIngwersen, but it's only good if it works. See error message below:

>>> upstream_df = electricitylci.get_upstream_process_df(
...             config.model_specs.eia_gen_year)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/ElectricityLCI/electricitylci/__init__.py", line 301, in get_upstream_process_df
    from electricitylci.combinator import concat_map_upstream_databases
  File "~/ElectricityLCI/electricitylci/combinator.py", line 4, in <module>
    import electricitylci.generation as gen
  File "~/ElectricityLCI/electricitylci/generation.py", line 21, in <module>
    from electricitylci.elementaryflows import map_emissions_to_fedelemflows
  File "~/ElectricityLCI/electricitylci/elementaryflows.py", line 2, in <module>
    import fedelemflowlist
  File "~/Envs/ebm/lib/python3.11/site-packages/fedelemflowlist/__init__.py", line 11, in <module>
    from fedelemflowlist.flowlist import read_in_flowclass_file
  File "~/Envs/ebm/lib/python3.11/site-packages/fedelemflowlist/flowlist.py", line 8, in <module>
    from fedelemflowlist.globals import log, inputpath, flow_list_specs,\
  File "~/Envs/ebm/lib/python3.11/site-packages/fedelemflowlist/globals.py", line 23, in <module>
    fedefl_path.local_path + "/fedelemflowlist/")
    ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for +: 'PosixPath' and 'str'

dt-woods avatar Aug 03 '23 18:08 dt-woods

Current status of testing ELCI_1.

  • Something's wrong with the following BA's Electricity; at grid; generation mix processes (they are missing a provider in openLCA):
    • Avista Corporation
    • Alcoa Power Generating, Inc.
    • Arizona Public Service Company
    • Arlington Valley, LLC
    • Associated Electric Cooperative, Inc.
    • Balancing Authority of Northern California
  • New Smyrna Beach, Utilities Commission of, has no input for its Electricity; at grid; consumption mix!
  • eia_trans_dist_grid_loss is having trouble matching dictionary for:
    • Alcoa Power Generating, Inc. - Yadkin Division
    • NaturEner Power Watch, LLC (GWA)
    • NaturEner Wind Watch, LLC
    • New Brunswick System Operator
    • New Harquahala Generating Company, LLC - HGBA
    • Southeastern Power Administration
    • Canada

Testing 'Electricity - HYDRO - Alcoa Power Generating, Inc. - Yadkin Division' in olcaschema_genmix in generation_mix.py.

Why are there two Carbon dioxide emissions to air (internalIds 1 and 2)? I thought aggregation included combining same/similar emissions. Same concern with methane (internalIds 4 and 5).

>>> matching_dict['exchanges']
[{'amount': 5.3867483297713465e-21,
  'description': 'ap42, 2016',
  'dqEntry': '(1;1;1;1;2)',
  'flow': {'@type': 'Flow',
   '@id': 'b6f010fb-a764-3063-af2d-bcb8309a97b7',
   'category': 'Elementary Flows/emission/air',
   'name': 'Carbon dioxide'},
  'flowProperty': {'@type': 'FlowProperty',
   '@id': '93a60a56-a3c8-11da-a746-0800200b9a66',
   'category': 'Technical flow properties',
   'name': 'Mass'},
  'internalId': 1,
  'isAvoidedProduct': False,
  'isInput': False,
  'isQuantitativeReference': False,
  'unit': {'@type': 'Unit',
   '@id': '20aadc24-a391-41cf-b340-3e4529f44bde',
   'name': 'kg'}},
 {'amount': 4.310418323231294,
  'description': 'netl, 2016',
  'dqEntry': '(1;1;1;1;2)',
  'flow': {'@type': 'Flow',
   '@id': 'b6f010fb-a764-3063-af2d-bcb8309a97b7',
   'category': 'Elementary Flows/emission/air',
   'name': 'Carbon dioxide'},
  'flowProperty': {'@type': 'FlowProperty',
   '@id': '93a60a56-a3c8-11da-a746-0800200b9a66',
   'category': 'Technical flow properties',
   'name': 'Mass'},
  'internalId': 2,
  'isAvoidedProduct': False,
  'isInput': False,
  'isQuantitativeReference': False,
  'unit': {'@type': 'Unit',
   '@id': '20aadc24-a391-41cf-b340-3e4529f44bde',
   'name': 'kg'}},
 {'amount': 9740.278957008288,
  'description': 'eGRID, 2016',
  'dqEntry': '(1;1;1;1;2)',
  'flow': {'@type': 'Flow',
   '@id': '8c8fb17a-8d64-3fd0-9a4c-9acbca4abe07',
   'category': 'Elementary Flows/resource',
   'name': 'Heat'},
  'flowProperty': {'@type': 'FlowProperty',
   '@id': 'f6811440-ee37-11de-8a39-0800200c9a66',
   'category': 'Technical flow properties',
   'name': 'Energy'},
  'internalId': 3,
  'isAvoidedProduct': False,
  'isInput': True,
  'isQuantitativeReference': False,
  'unit': {'@type': 'Unit',
   '@id': '52765a6c-3896-43c2-b2f4-c679acf13efe',
   'name': 'MJ'}},
 {'amount': 5.3867483297713465e-21,
  'description': 'ap42, 2016',
  'dqEntry': '(1;1;1;1;2)',
  'flow': {'@type': 'Flow',
   '@id': 'aab83476-ec6c-3742-af85-15d320b7ce80',
   'category': 'Elementary Flows/emission/air',
   'name': 'Methane'},
  'flowProperty': {'@type': 'FlowProperty',
   '@id': '93a60a56-a3c8-11da-a746-0800200b9a66',
   'category': 'Technical flow properties',
   'name': 'Mass'},
  'internalId': 4,
  'isAvoidedProduct': False,
  'isInput': False,
  'isQuantitativeReference': False,
  'unit': {'@type': 'Unit',
   '@id': '20aadc24-a391-41cf-b340-3e4529f44bde',
   'name': 'kg'}},
 {'amount': 0.009277911629821063,
  'description': 'netl, 2016',
  'dqEntry': '(1;1;1;1;2)',
  'flow': {'@type': 'Flow',
   '@id': 'aab83476-ec6c-3742-af85-15d320b7ce80',
   'category': 'Elementary Flows/emission/air',
   'name': 'Methane'},
  'flowProperty': {'@type': 'FlowProperty',
   '@id': '93a60a56-a3c8-11da-a746-0800200b9a66',
   'category': 'Technical flow properties',
   'name': 'Mass'},
  'internalId': 5,
  'isAvoidedProduct': False,
  'isInput': False,
  'isQuantitativeReference': False,
  'unit': {'@type': 'Unit',
   '@id': '20aadc24-a391-41cf-b340-3e4529f44bde',
   'name': 'kg'}},
 {'amount': 5.3867483297713465e-21,
  'description': 'ap42, 2016',
  'dqEntry': '(1;1;1;1;2)',
  'flow': {'@type': 'Flow',
   '@id': '4382ba18-dd21-3837-80b2-94283ef5490e',
   'category': 'Elementary Flows/emission/air',
   'name': 'Nitrogen oxides'},
  'flowProperty': {'@type': 'FlowProperty',
   '@id': '93a60a56-a3c8-11da-a746-0800200b9a66',
   'category': 'Technical flow properties',
   'name': 'Mass'},
  'internalId': 6,
  'isAvoidedProduct': False,
  'isInput': False,
  'isQuantitativeReference': False,
  'unit': {'@type': 'Unit',
   '@id': '20aadc24-a391-41cf-b340-3e4529f44bde',
   'name': 'kg'}},
 {'amount': 5.3867483297713465e-21,
  'description': 'ap42, 2016',
  'dqEntry': '(1;1;1;1;2)',
  'flow': {'@type': 'Flow',
   '@id': 'cfee0524-7ad6-300b-b050-6249135a2492',
   'category': 'Elementary Flows/emission/air',
   'name': 'Nitrous oxide'},
  'flowProperty': {'@type': 'FlowProperty',
   '@id': '93a60a56-a3c8-11da-a746-0800200b9a66',
   'category': 'Technical flow properties',
   'name': 'Mass'},
  'internalId': 7,
  'isAvoidedProduct': False,
  'isInput': False,
  'isQuantitativeReference': False,
  'unit': {'@type': 'Unit',
   '@id': '20aadc24-a391-41cf-b340-3e4529f44bde',
   'name': 'kg'}},
 {'amount': 5.3867483297713465e-21,
  'description': 'ap42, 2016',
  'dqEntry': '(1;1;1;1;2)',
  'flow': {'@type': 'Flow',
   '@id': 'f4973035-59f5-3bdc-b257-b274dcc04e0f',
   'category': 'Elementary Flows/emission/air',
   'name': 'Sulfur dioxide'},
  'flowProperty': {'@type': 'FlowProperty',
   '@id': '93a60a56-a3c8-11da-a746-0800200b9a66',
   'category': 'Technical flow properties',
   'name': 'Mass'},
  'internalId': 8,
  'isAvoidedProduct': False,
  'isInput': False,
  'isQuantitativeReference': False,
  'unit': {'@type': 'Unit',
   '@id': '20aadc24-a391-41cf-b340-3e4529f44bde',
   'name': 'kg'}},
 {'amount': 1.0,
  'flow': {'@type': 'Flow',
   '@id': 'fc406690-160c-37d5-bf36-added9542164',
   'category': 'Technosphere Flows/22: Utilities/2211: Electric Power Generation, Transmission and Distribution',
   'name': 'Electricity, AC, 2300-7650 V'},
  'flowProperty': {'@type': 'FlowProperty',
   '@id': 'f6811440-ee37-11de-8a39-0800200c9a66',
   'category': 'Technical flow properties',
   'name': 'Energy'},
  'internalId': 9,
  'isAvoidedProduct': False,
  'isInput': False,
  'isQuantitativeReference': True,
  'unit': {'@type': 'Unit',
   '@id': '92e3bd49-8ed5-4885-9db6-fc88c7afcfcb',
   'name': 'MWh'}}]

dt-woods avatar Nov 29 '23 22:11 dt-woods

Regarding the "duplicate" emissions - those emissions are from two different sources. The reported emissions from the hydro plants are 0, converted here to X x 10^-21 to avoid issues with generating log-normal distributions.

Thet NETL-sourced data is from our own analysis of hydro plants.

I think because I wanted to make clear that the non-zero emissions were coming from an NETL analysis, I left this as-is. I'm open to changing. Can maybe add something to the hydro process metadata that describes all of this.

m-jamieson avatar Nov 30 '23 17:11 m-jamieson

My guess as to the missing BAAs, etc. is that the BA_Codes_930.xlsx needs an update. Some of the BAAs have been retired since the 2016 results. That's at least a first pass.

https://www.eia.gov/electricity/gridmonitor/about

m-jamieson avatar Nov 30 '23 17:11 m-jamieson

I'm still testing ELCI_1, so it's the 2016 data that's not working.

dt-woods avatar Nov 30 '23 18:11 dt-woods

My guess as to the missing BAAs, etc. is that the BA_Codes_930.xlsx needs an update. Some of the BAAs have been retired since the 2016 results. That's at least a first pass.

https://www.eia.gov/electricity/gridmonitor/about

I see "BA_Codes_930.xlsx" referenced in combinatory.py to create the ba_codes data frame, which is used for mapping names to codes, and in eia_io_trading.py to create the US and NA balancing authority data frames. I only see "US" and "Canada" worksheets being referenced.

In the latest EIA-930 data reference tables you linked to has these data in a single worksheet, "BAs" that includes US, CAN, and MEX. There are activation and retirement dates that could be used to filter the list based on the analysis year (or eGRID/EIA year).

I can create a method for retrieving this Excel workbook, rather than use the dated workbook provided in the data folder, and write methods for filtering the BAs so they can be easily parsed as US and non-US. Is that the path you want to take?

dt-woods avatar Nov 30 '23 18:11 dt-woods

I'd say it's probably the best way forward - provided they don't move/change that excel workbook!

m-jamieson avatar Nov 30 '23 18:11 m-jamieson

Regarding the "duplicate" emissions - those emissions are from two different sources. The reported emissions from the hydro plants are 0, converted here to X x 10^-21 to avoid issues with generating log-normal distributions.

Thet NETL-sourced data is from our own analysis of hydro plants.

I think because I wanted to make clear that the non-zero emissions were coming from an NETL analysis, I left this as-is. I'm open to changing. Can maybe add something to the hydro process metadata that describes all of this.

Okay. So, it's intentional and it's not the problem. I see it now in openLCA: Screen Shot 2023-11-30 at 13 25 01

dt-woods avatar Nov 30 '23 18:11 dt-woods

I'd say it's probably the best way forward - provided they don't move/change that excel workbook!

I can make it an untracked data file in the repository, so the user only needs to generate it once, it'll stay local to limit the number (and need) for internet connection.

I'm surprised we don't have something from stewi that could be used to get a full list of BA names and codes for a given generation year!

dt-woods avatar Nov 30 '23 18:11 dt-woods

Regarding the "duplicate" emissions - those emissions are from two different sources. The reported emissions from the hydro plants are 0, converted here to X x 10^-21 to avoid issues with generating log-normal distributions.

Regarding those log-normal distributions, they feel a little bogus. I'm seeing about 20% success rate in their calculation and I'm not sure what uncertainty they are capturing.

dt-woods avatar Nov 30 '23 18:11 dt-woods

Hit a point of constant segmentation faults, which appeared to be linked to combinator.py. Tried importing required modules and got errors during stewi phase. Deleted all facilitymatcher, stewi, and stewicombo folders and hit the following error message (after including the download_if_missing parameter).

The linkage is:

  • combinator.py --> generation.py
  • generation.py --> egrid_filter.py
  • egrid_filter.py --> egrid_emissions_and_waste_by_facility.py
  • egrid_emissions_and_waste_by_facility.py --> stewicombo.combineInventoriesforFacilitiesinBaseInventory
2024-01-09 16:24:58.377:INFO:processed_data_mgmt:load_preprocessed_output:Returning /Users/davi/Library/Application Support/facilitymatcher/FRS Data Files/NATIONAL_ENVIRONMENTAL_INTEREST_FILE.CSV
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[5], line 1
----> 1 gen_df, gen_dict = run_generation()

File ~/Repositories/keylogiclca/ElectricityLCI/electricitylci/main.py:168, in run_generation()
    162 if config.model_specs.include_upstream_processes is True:
    163     # Create dataframe with all generation process data; includes
    164     # upstream and Canadian data.
    165     # NOTE: Only nuclear ('NUC') stage codes have electricity data;
    166     #       all others are nans.
    167     logging.info("get upstream process")
--> 168     upstream_df = get_upstream_process_df(config.model_specs.eia_gen_year)
    169     logging.info("write upstream process to dict")
    170     upstream_dict = write_upstream_process_database_to_dict(upstream_df)

File ~/Repositories/keylogiclca/ElectricityLCI/electricitylci/__init__.py:450, in get_upstream_process_df(eia_gen_year)
    448 import electricitylci.nuclear_upstream as nuke
    449 import electricitylci.power_plant_construction as ppc
--> 450 import electricitylci.combinator as combine
    452 logging.info("Generating upstream inventories...")
    453 coal_df = coal.generate_upstream_coal(eia_gen_year)

File ~/Repositories/keylogiclca/ElectricityLCI/electricitylci/combinator.py:18
     16 from electricitylci.model_config import model_specs
     17 from electricitylci.eia860_facilities import eia860_balancing_authority
---> 18 from electricitylci.generation import add_temporal_correlation_score
     19 import fedelemflowlist as fedefl
     22 

File ~/Repositories/keylogiclca/ElectricityLCI/electricitylci/generation.py:42
     40 from electricitylci.utils import make_valid_version_num
     41 from electricitylci.utils import set_dir
---> 42 from electricitylci.egrid_filter import (
     43     electricity_for_selected_egrid_facilities,
     44     emissions_and_waste_for_selected_egrid_facilities,
     45 )
     46 from electricitylci.egrid_emissions_and_waste_by_facility import (
     47     emissions_and_wastes_by_facility,
     48 )
     49 import facilitymatcher.globals as fmglob  # package under development

File ~/Repositories/keylogiclca/ElectricityLCI/electricitylci/egrid_filter.py:24
     15 from electricitylci.egrid_facilities import (
     16     egrid_facilities,
     17     list_facilities_w_percent_generation_from_primary_fuel_category_greater_than_min
     18 )
     19 from electricitylci.egrid_energy import (
     20     list_egrid_facilities_with_positive_generation,
     21     list_egrid_facilities_in_efficiency_range,
     22     egrid_net_generation
     23 )
---> 24 from electricitylci.egrid_emissions_and_waste_by_facility import (
     25     emissions_and_wastes_by_facility
     26 )
     27 from electricitylci.egrid_FRS_matches import list_FRS_ids_filtered_for_NAICS

File ~/Repositories/keylogiclca/ElectricityLCI/electricitylci/egrid_emissions_and_waste_by_facility.py:76
     72     base_inventory = "RCRAInfo"
     74 # HOTFIX: work-around ParseError [2023-12-19; TWD]
     75 # Ref: https://github.com/USEPA/standardizedinventories/issues/151
---> 76 emissions_and_wastes_by_facility = cbi(
     77     base_inventory,
     78     model_specs.inventories_of_interest,
     79     filter_for_LCI=True,
     80     download_if_missing=True,
     81 )
     82 # Drop SRS fields
     83 emissions_and_wastes_by_facility = emissions_and_wastes_by_facility.drop(
     84     columns=['SRS_ID', 'SRS_CAS'])

File ~/Envs/ebm/lib/python3.11/site-packages/stewicombo/__init__.py:83, in combineInventoriesforFacilitiesinBaseInventory(base_inventory, inventory_dict, filter_for_LCI, remove_overlap, keep_sec_cntx, **kwargs)
     69 """Combine stewi inventories for all facilities present in base_inventory.
     70 
     71 The base_inventory must be in the inventory_dict
   (...)
     80 :return: Flow-By-Facility Combined Format
     81 """
     82 inventory_acronyms = list(inventory_dict.keys())
---> 83 facilitymatches = facilitymatcher.get_matches_for_inventories(
     84     inventory_acronyms)
     85 inventories = getInventoriesforFacilityMatches(inventory_dict,
     86                                                facilitymatches,
     87                                                filter_for_LCI,
     88                                                base_inventory,
     89                                                keep_sec_cntx=keep_sec_cntx,
     90                                                **kwargs)
     91 inventories = addChemicalMatches(inventories)

File ~/Envs/ebm/lib/python3.11/site-packages/facilitymatcher/__init__.py:20, in get_matches_for_inventories(inventory_list)
     13 def get_matches_for_inventories(inventory_list=stewi_inventories):
     14     """Return all facility matches for given inventories.
     15 
     16     :param inventory_list: list of inventories for desired matches using
     17         StEWI inventory names e.g. ['NEI','TRI']
     18     :return: dataframe in FacilityMatches standard output format
     19     """
---> 20     facilitymatches = get_fm_file('FacilityMatchList_forStEWI')
     21     facilitymatches = filter_by_inventory_list(facilitymatches, inventory_list)
     22     return facilitymatches

File ~/Envs/ebm/lib/python3.11/site-packages/facilitymatcher/globals.py:116, in get_fm_file(file_name, download_if_missing)
    114     download_from_remote(file_meta, paths)
    115 elif file_name == 'FacilityMatchList_forStEWI':
--> 116     write_fm.write_facility_matches()
    117 elif file_name == 'FRS_NAICSforStEWI':
    118     write_naics.write_NAICS_matches()

File ~/Envs/ebm/lib/python3.11/site-packages/facilitymatcher/WriteFacilityMatchesforStEWI.py:24, in write_facility_matches()
     20 # Import FRS bridge which provides ID matches
     21 col_dict = {'REGISTRY_ID': "str",
     22             'PGM_SYS_ACRNM': "str",
     23             'PGM_SYS_ID': "str"}
---> 24 FRS_Bridges = fmg.read_FRS_file(file, col_dict)
     26 # Programs of interest
     27 stewi_programs = fmg.get_programs_for_inventory_list(fmg.stewi_inventories)

File ~/Envs/ebm/lib/python3.11/site-packages/facilitymatcher/globals.py:74, in read_FRS_file(file_name, col_dict)
     72 file_meta.name_data = strip_file_extension(file_meta.name_data)
     73 file_meta.ext = 'csv'
---> 74 df = load_preprocessed_output(file_meta, paths)
     75 df_FRS = pd.DataFrame()
     76 for k, v in col_dict.items():

File ~/Envs/ebm/lib/python3.11/site-packages/esupy/processed_data_mgmt.py:51, in load_preprocessed_output(file_meta, paths)
     49 if isinstance(f, Path):
     50     log.info(f'Returning {f}')
---> 51     df = read_into_df(f)
     52     return df
     53 else:

File ~/Envs/ebm/lib/python3.11/site-packages/esupy/processed_data_mgmt.py:233, in read_into_df(fpath)
    231     df = pd.read_parquet(fpath)
    232 elif ext == '.csv':
--> 233     df = pd.read_csv(fpath)
    234 elif ext == '.rds':
    235     try:

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    944     dtype_backend=dtype_backend,
    945 )
    946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:617, in _read(filepath_or_buffer, kwds)
    614     return parser
    616 with parser:
--> 617     return parser.read(nrows)

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1748, in TextFileReader.read(self, nrows)
   1741 nrows = validate_integer("nrows", nrows)
   1742 try:
   1743     # error: "ParserBase" has no attribute "read"
   1744     (
   1745         index,
   1746         columns,
   1747         col_dict,
-> 1748     ) = self._engine.read(  # type: ignore[attr-defined]
   1749         nrows
   1750     )
   1751 except Exception:
   1752     self.close()

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
    232 try:
    233     if self.low_memory:
--> 234         chunks = self._reader.read_low_memory(nrows)
    235         # destructive to chunks
    236         data = _concatenate_chunks(chunks)

File parsers.pyx:843, in pandas._libs.parsers.TextReader.read_low_memory()

File parsers.pyx:904, in pandas._libs.parsers.TextReader._read_rows()

File parsers.pyx:879, in pandas._libs.parsers.TextReader._tokenize_rows()

File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()

File parsers.pyx:2050, in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 16484: invalid start byte

dt-woods avatar Jan 09 '24 21:01 dt-woods

^^^ The error appears to be in the 1 GB CSV, NATIONAL_ENVIRONMENTAL_INTEREST_FILE.CSV

dt-woods avatar Jan 09 '24 21:01 dt-woods

^^^ The error appears to be in the 1 GB CSV, NATIONAL_ENVIRONMENTAL_INTEREST_FILE.CSV

This is "new" right? You had gotten the 2016 data to run (existing ELCI_1) without hitting this correct? So this is something we run into with new data? Also interesting because I've run a 2020 result before and not run into this.

m-jamieson avatar Jan 09 '24 22:01 m-jamieson

If this is a stewi issue we can post it there and I can take a closer look - especially if you can show a reproducible example within stewi/facilitymatcher

bl-young avatar Jan 09 '24 22:01 bl-young