pybaseball icon indicating copy to clipboard operation
pybaseball copied to clipboard

batting_stats() and pitching_stats() errors out for certain players/years

Open yzhang3283 opened this issue 2 years ago • 1 comments

running pitching_stats with Kent Tekulve's playerid results in this error:

>>> pitching_stats(start_season=1979, end_season=1980, players=1012905)   
Traceback (most recent call last):
  File "/home/yzhang/.envs/deadball_env/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 934, in _finalize_columns_and_data
    columns alidate_or_indexify_columns(contents, columns)
  File "/home/yzhang/.envs/deadball_env/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 981, in _validate_or_indexify_columns
    raise AssertionError(
AssertionError: 334 col= _validate_or_indexify_columns(contents, columns)
  File "/home/yzhang/.envs/deadball_env/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 981, in _validate_or_indexify_columns
    raise AssertionError(
AssertionError: 334 columns passed, passed data had 1 columns

same type of error also seems to happen when calling batting_stats for 1970 Roberto Clemente as well. 1969 and 1971 both show up normally:

>>> batting_stats(1969, 1971, players=1002340)
      IDfg  Season              Name Team  Age    G   AB  ...  Events  CStr%  CSW%  xBA  xSLG  xwOBA  L-WAR
0  1002340    1969  Roberto Clemente  PIT   34  138  507  ...       0    NaN   NaN  NaN   NaN    NaN    7.0
1  1002340    1971  Roberto Clemente  PIT   36  132  522  ...       0    NaN   NaN  NaN   NaN    NaN    6.5

[2 rows x 320 columns]

whereas 1970 Clemente search returns a similar error (attaching full error trace)

>>> batting_stats(1970, players=1002340)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yzhang/Downloads/pybaseball/pybaseball/cache/cache.py", line 58, in _cached
    result = func(*args, **kwargs)
  File "/home/yzhang/Downloads/pybaseball/pybaseball/datasources/fangraphs.py", line 176, in fetch
    return super().fetch(*args, **kwargs)
  File "/home/yzhang/Downloads/pybaseball/pybaseball/datasources/fangraphs.py", line 154, in fetch
    self.html_accessor.get_tabular_data_from_options(
  File "/home/yzhang/Downloads/pybaseball/pybaseball/datasources/html_table_processor.py", line 90, in get_tabular_data_from_options
    return self.get_tabular_data_from_url(
  File "/home/yzhang/Downloads/pybaseball/pybaseball/datasources/html_table_processor.py", line 78, in get_tabular_data_from_url
    return self.get_tabular_data_from_html(
  File "/home/yzhang/Downloads/pybaseball/pybaseball/datasources/html_table_processor.py", line 59, in get_tabular_data_from_html
    return self.get_tabular_data_from_element(
  File "/home/yzhang/Downloads/pybaseball/pybaseball/datasources/html_table_processor.py", line 50, in get_tabular_data_from_element
    fg_data = pd.DataFrame(data_rows, columns=headings)
  File "/home/yzhang/.envs/deadball_env/lib/python3.9/site-packages/pandas/core/frame.py", line 782, in __init__
    arrays, columns, index = nested_data_to_arrays(
  File "/home/yzhang/.envs/deadball_env/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 498, in nested_data_to_arrays
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/home/yzhang/.envs/deadball_env/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 840, in to_arrays
    content, columns = _finalize_columns_and_data(arr, columns, dtype)
  File "/home/yzhang/.envs/deadball_env/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 937, in _finalize_columns_and_data
    raise ValueError(err) from err
ValueError: 320 columns passed, passed data had 1 columns

looks like an element might not be parsed correctly and so is just returning [[None]] when creating the DataFrame in get_tabular_data_from_element in HTMLTableProcessor. not sure if fangraphs has different DOM elements for different players/years that differs from the hardcoded rows/cells xpath?

yzhang3283 avatar Jun 17 '23 02:06 yzhang3283

Just add qual=0 argument to your code.

The default is qual=y and it means minimum PA should be 3.1 PA per team game which is 500 for one season. For pitching qual=y means minimum IP should be 1 IP per team game which is 162 for one season.

Therefore, if you want to get one player's data no matter how much is his PA or IP just set qual=0

- pitching_stats(start_season=1979, end_season=1980, players=1012905)
- batting_stats(1969, 1971, players=1002340)
+ pitching_stats(start_season=1979, end_season=1980, players=1012905, qual=0)
+ batting_stats(1969, 1971, players=1002340, qual=0)

ref: https://github.com/jldbc/pybaseball/blob/master/docs/batting_stats.md

ss77995ss avatar Aug 14 '23 18:08 ss77995ss