statcast throwing KeyError on certain dates in 2023
While getting all of the statcast data, I kept getting an error around 98%. So I eventually was able to narrow it down to 2023-06-25 being the first problematic one day. Other day(s) past this one also cause the error, but I've stopped at 06-25 because this amount of data is good enough for my current purposes.
The code I'm executing is this:
stats = statcast(start_dt="2023-06-25")
Upon execution, my terminal looks like this:
This is a large query, it may take a moment to complete
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "c:\Users\nosoa\Documents\glb\getstats.py", line 6, in <module>
stats = statcast(start_dt="2023-06-25")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\statcast.py", line 113, in statcast
return _handle_request(start_dt_date, end_dt_date, 1, verbose=verbose,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\statcast.py", line 76, in _handle_request
dataframe_list.append(future.result())
^^^^^^^^^^^^^^^
File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\_base.py", line 401, in __get_result
raise self._exception
File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache.py", line 58, in _cached
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\statcast.py", line 31, in _small_request
data = data.sort_values(
^^^^^^^^^^^^^^^^^
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pandas\core\frame.py", line 6740, in sort_values
keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pandas\core\frame.py", line 6740, in <listcomp>
keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pandas\core\generic.py", line 1778, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'game_date'
I have tested the stats = statcast(start_dt="2023-06-25") code on both Colab (python 3.10.12) and my local environment (3.11.2) and they worked fine.
I guess maybe something went wrong in concurrent mode according to the error message dataframe_list.append(future.result())
Maybe try to turn off the parallel will work?
stats = statcast(start_dt="2023-06-25", parallel=False)
I discovered the issue -- there must have been something corrupted in the cache. Disabling the cache fixed the problem. But attempting to purge the cache also results in an error.
Traceback (most recent call last):
File "c:\Users\nosoa\Documents\glb\getstats.py", line 5, in <module>
pybaseball.cache.purge()
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache.py", line 31, in purge
records = [cache_record.CacheRecord(filename) for filename in record_files]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache.py", line 31, in <listcomp>
records = [cache_record.CacheRecord(filename) for filename in record_files]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache_record.py", line 23, in __init__
self.data = cast(Dict[str, Any], file_utils.load_json(filename))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\file_utils.py", line 28, in load_json
return cast(JSONData, json.load(json_file))
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^^^^^^^^
File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 37 (char 36)
I found that some cache file not save completely that cause cache.purge() cannot parse them.
In my case, file name with prefix _small_request all only contain
{"func": "_small_request", "args": [
Because it is not valid json so it will raise decode error.
You can find the cache files from /Users/{user_name}/.pybaseball/cache or in colab /root/.pybaseball/cache
IMO, currently we can only delete those invalid cache file manually since they also do not contain expire time
Should be fixed in #438