standardizedinventories
standardizedinventories copied to clipboard
ParserError in RCRAInfo
So I tried accessing other years of RCRAInfo data (2013, 2015, 2017, and 2019). All worked except for one (2017), which produced the following errors. I wasn't able to track down the CSV file it keeps crashing on. Maybe there's a debug statement that points to it.
INFO RCRAInfo_2017 not found in ~/stewi/flowbyfacility
INFO requested inventory does not exist in local directory, it will be generated...
INFO file extraction complete
INFO organizing data for BR_REPORTING from 2017...
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_0.csv
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_1.csv
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_2.csv
INFO saving to ~/stewi/RCRAInfo Data Files/RCRAInfo_by_year/br_reporting_2017.csv...
INFO generating inventory files for 2017
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
Cell In[9], line 1
----> 1 stewi.getInventory('RCRAInfo', 2017)
File ~/Envs/ebm/lib/python3.11/site-packages/stewi/__init__.py:82, in getInventory(inventory_acronym, year, stewiformat, filters, filter_for_LCI, US_States_Only, download_if_missing, keep_sec_cntx)
66 """Return or generate an inventory in a standard output format.
67
68 :param inventory_acronym: like 'TRI'
(...)
79 :return: dataframe with standard fields depending on output format
80 """
81 f = ensure_format(stewiformat)
---> 82 inventory = read_inventory(inventory_acronym, year, f,
83 download_if_missing)
85 if (not keep_sec_cntx) and ('Compartment' in inventory):
86 inventory['Compartment'] = (inventory['Compartment']
87 .str.partition('/')[0])
File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:268, in read_inventory(inventory_acronym, year, f, download_if_missing)
265 else:
266 log.info('requested inventory does not exist in local directory, '
267 'it will be generated...')
--> 268 generate_inventory(inventory_acronym, year)
269 inventory = load_preprocessed_output(meta, paths)
270 if inventory is None:
File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:313, in generate_inventory(inventory_acronym, year)
309 RCRAInfo.main(Option = 'A', Year = [year],
310 Tables = ['BR_REPORTING', 'HD_LU_WASTE_CODE'])
311 RCRAInfo.main(Option = 'B', Year = [year],
312 Tables = ['BR_REPORTING'])
--> 313 RCRAInfo.main(Option = 'C', Year = [year])
314 elif inventory_acronym == 'TRI':
315 import stewi.TRI as TRI
File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:444, in main(**kwargs)
441 organize_br_reporting_files_by_year(kwargs['Tables'], year)
443 elif kwargs['Option'] == 'C':
--> 444 Generate_RCRAInfo_files_csv(year)
446 elif kwargs['Option'] == 'D':
447 """State totals are compiled from the Trends Analysis website
448 and stored as csv. New years will be added as data becomes
449 available"""
File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:219, in Generate_RCRAInfo_files_csv(report_year)
216 fieldstokeep = pd.read_csv(RCRA_DATA_PATH.joinpath('RCRA_required_fields.txt'),
217 header=None)
218 # on_bad_lines requires pandas >= 1.3
--> 219 df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
220 low_memory=False, on_bad_lines='skip',
221 encoding='ISO-8859-1')
223 log.info(f'completed reading {filepath}')
224 # Checking the Waste Generation Data Health
File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
935 kwds_defaults = _refine_defaults_read(
936 dialect,
937 delimiter,
(...)
944 dtype_backend=dtype_backend,
945 )
946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)
File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:617, in _read(filepath_or_buffer, kwds)
614 return parser
616 with parser:
--> 617 return parser.read(nrows)
File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1748, in TextFileReader.read(self, nrows)
1741 nrows = validate_integer("nrows", nrows)
1742 try:
1743 # error: "ParserBase" has no attribute "read"
1744 (
1745 index,
1746 columns,
1747 col_dict,
-> 1748 ) = self._engine.read( # type: ignore[attr-defined]
1749 nrows
1750 )
1751 except Exception:
1752 self.close()
File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:239, in CParserWrapper.read(self, nrows)
236 data = _concatenate_chunks(chunks)
238 else:
--> 239 data = self._reader.read(nrows)
240 except StopIteration:
241 if self._first_chunk:
File parsers.pyx:825, in pandas._libs.parsers.TextReader.read()
File parsers.pyx:913, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Originally posted by @dt-woods in https://github.com/USEPA/standardizedinventories/issues/146#issuecomment-1819942869
@bl-young, any updates on this front? I'm still getting the ParseError.
So I took a look at the CSV file that is generated. If you provide pandas.read_csv with nrows, it successfully reads the data up to a point. I tried reading the number of lines in the CSV using a basic approach:
>>> from stewi.RCRAInfo import DIR_RCRA_BY_YEAR
>>> report_year = 2017
>>> filepath = DIR_RCRA_BY_YEAR.joinpath(f'br_reporting_{str(report_year)}.csv')
>>> with open(filepath, 'r') as f:
>>> count = sum(1 for _ in f)
>>> print(count)
2119285
I can open this in pandas.
>>> from stewi.RCRAInfo import RCRA_DATA_PATH
>>> fieldstokeep = pd.read_csv(RCRA_DATA_PATH.joinpath('RCRA_required_fields.txt'),
... header=None)
>>> df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
... low_memory=False, on_bad_lines='skip', encoding='ISO-8859-1', sep=",",
... nrows=2119285)
>>> df.head()
Handler ID State ... Generation Tons Waste Code Group
0 AK0000384040 AK ... 12.25 K171
1 AK0000384040 AK ... 0.20 K171
2 AK0000384040 AK ... 0.40 K050
3 AK0000384040 AK ... 1.50 K050
4 AK0000384040 AK ... 0.05 K050
>>> df.tail(1).to_dict()
{'Handler ID': {2119284: 'IDD073114654'},
'State': {2119284: 'ID'},
'Handler Name': {2119284: 'US ECOLOGY IDAHO INC SITE B'},
'Location Street Number': {2119284: '20400'},
'Location Street 1': {2119284: 'LEMLEY RD'},
'Location Street 2': {2119284: nan},
'Location City': {2119284: 'GRAND VIEW'},
'Location State': {2119284: 'ID'},
'Location Zip': {2119284: '83624'},
'County Name': {2119284: 'OWYHEE'},
'Generator ID Included in NBR': {2119284: 'Y'},
'Generator Waste Stream Included in NBR': {2119284: 'N'},
'Waste Description': {2119284: '43435-0'},
'Primary NAICS': {2119284: nan},
'Source Code': {2119284: nan},
'Form Code': {2119284: nan},
'Management Method': {2119284: nan},
'Federal Waste Flag': {2119284: nan},
'Generation Tons': {2119284: nan},
'Waste Code Group': {2119284: nan}}
I'm not certain this count is accurate because I was able to read more than that with pandas. I can go higher!
>>> df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
... low_memory=False, on_bad_lines='skip', encoding='ISO-8859-1', sep=",",
... nrows=236700)
>>> df.tail(1).to_dict()
{'Handler ID': {2366999: 'IDD073114654'},
'State': {2366999: 'ID'},
'Handler Name': {2366999: 'US ECOLOGY IDAHO INC SITE B'},
'Location Street Number': {2366999: '20400'},
'Location Street 1': {2366999: 'LEMLEY RD'},
'Location Street 2': {2366999: nan},
'Location City': {2366999: 'GRAND VIEW'},
'Location State': {2366999: 'ID'},
'Location Zip': {2366999: '83624'},
'County Name': {2366999: 'OWYHEE'},
'Generator ID Included in NBR': {2366999: 'Y'},
'Generator Waste Stream Included in NBR': {2366999: 'N'},
'Waste Description': {2366999: '43435-0'},
'Primary NAICS': {2366999: nan},
'Source Code': {2366999: nan},
'Form Code': {2366999: nan},
'Management Method': {2366999: nan},
'Federal Waste Flag': {2366999: nan},
'Generation Tons': {2366999: nan},
'Waste Code Group': {2366999: nan}}
Not sure where the upper limit is for nrows, and not sure what happens when you overload nrows.
@bl-young, any updates on this front? I'm still getting the ParseError.
No I have not had a chance to look closely yet. These ParseErrors can be tricky to track down.
For consistency, and in the meantime, I would recommend using the already processed versions, such as via
getInventory(..., download_if_missing=True)
if that works for your application.
Yep. That seems to work! Thanks again for supporting the daisy chain of kwargs down through stewicombo to getInventory.