pybaseball icon indicating copy to clipboard operation
pybaseball copied to clipboard

pitching_stats_range consistently failing to fetch data

Open JGHB opened this issue 2 years ago • 14 comments

I have discovered that pitching_stats_range will not fetch data. Calling the function for any data range will regularly result in an index out of range error. On occasion the call will fetch successfully but this rarely happens. Has anyone else encountered this bug?

Here is the error output for your reference: Screen Shot 2023-03-05 at 5 02 30 PM

JGHB avatar Mar 05 '23 22:03 JGHB

The query works for me:

Python 3.10.6 (main, Aug 30 2022, 05:12:36) [Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybaseball import pitching_stats_range
>>> pitching_stats_range("2021-04-03")
                Name  Age  #days     Lev         Date             Tm  ...    PU   WHIP  BAbip   SO9  SO/W   mlbID
1    Tyler Alexander   26    700  Maj-AL  Apr 3, 2021        Detroit  ...  0.00  3.000  0.667  13.5   NaN  641302
2      Yency Almonte   27    700  Maj-NL  Apr 3, 2021       Colorado  ...  0.00  3.000  0.750  18.0   NaN  622075
3       Jose Alvarez   32    700  Maj-NL  Apr 3, 2021  San Francisco  ...  0.00  0.000  0.000   9.0   NaN  501625
4     Tyler Anderson   31    700  Maj-NL  Apr 3, 2021     Pittsburgh  ...  0.07  1.400  0.308  12.6   3.5  542881
5       Chris Archer   32    700  Maj-AL  Apr 3, 2021      Tampa Bay  ...  0.00  2.500  0.500   9.0   2.0  502042
..               ...  ...    ...     ...          ...            ...  ...   ...    ...    ...   ...   ...     ...
119      Matt Wisler   28    700  Maj-NL  Apr 3, 2021  San Francisco  ...   NaN  0.000    NaN  27.0   NaN  605538
120    Nick Wittgren   30    700  Maj-AL  Apr 3, 2021      Cleveland  ...  0.00  7.500  0.600   0.0   0.0  621295
121    Jake Woodford   24    700  Maj-NL  Apr 3, 2021      St. Louis  ...  0.00  1.714  0.400  11.6   1.5  663765
122  Brandon Workman   32    700  Maj-NL  Apr 3, 2021        Chicago  ...  0.00  0.000  0.000  18.0   NaN  519443
123     Huascar Ynoa   23    700  Maj-NL  Apr 3, 2021        Atlanta  ...  0.00  1.000  0.250   0.0   NaN  660623

[119 rows x 45 columns]

The error seems to indicate that it didn't fetch the table right. First, I'd confirm your internet is connected, then purge and/or disable your cache and try again:

from pybaseball import cache
cache.disable()

or

from pybaseball import cache
cache.purge()

If that fails try printing out the URL that's being called and/or the soup that's returned to see why there isn't a table in there

tjburch avatar Mar 05 '23 22:03 tjburch

I get the same error. It happens if I try to call either the batting_stats_range or pitching_stats_range functions more than four times. I tried both purging and disabling the cache, but neither appears to solve the problem

JTMachen avatar Mar 07 '23 20:03 JTMachen

Can you list what version you're running? If not 2.2.5, upgrade and confirm it still happens there.

tjburch avatar Mar 08 '23 14:03 tjburch

I'm running 2.2.5 and I got to seven function calls, but I still get the "list index out of range" error

JTMachen avatar Mar 09 '23 21:03 JTMachen

You're probably hitting request limits. Try putting a sleep of a few seconds before calling a bunch in a loop.

tjburch avatar Mar 09 '23 21:03 tjburch

Any update here?

tjburch avatar Mar 14 '23 12:03 tjburch

Sleeping doesn't do it. I tried sleep up to 20 seconds between each pull and I'd still get the error. Even walking away for a couple hours didn't help, kept getting the index error. Might be a pull/day thing

JTMachen avatar Mar 14 '23 13:03 JTMachen

That is very strange. Usually the cooldown is like an hour or two.

I would try the following:

In the get_soup function, add a URL printout after it's built, after line 21 here, just do print(url) and then enter that into your browser to see if the URL you're passing is valid and if there's a table on it.

If that's ok, then then I'd add a print(soup) option right before the error, after the get_soup call, after line 63 here. That's going to be a mess. But it might have some information, if there's an error in the response it's usually findable in there.

tjburch avatar Mar 14 '23 13:03 tjburch

I'm also experiencing this issue with batting_stats_range(). A few calls and then I get a list index out of range error.

If the requests are so tightly constricted here, does anyone know where I can get game level batting and pitching statistics? It seems like all the functions in this library automatically sum the data in given ranges...

tbryan2 avatar Mar 22 '23 21:03 tbryan2

I usually loop through the dates I'm looking for, calling the range functions for each single day and concat the dataframes into one large one. But there isn't a way to take the start date and end date and get the single games that way.

Update on the pulls. The URL keeps changing, but it's something similar to:

https://www.baseball-reference.com/leagues/daily.fcgi?user_team=&bust_cache=&type=b&lastndays=7&dates=fromandto&fromandto=2022-04-07.2022-04-07&level=mlb&franch=&stat=&stat_value=0

with the 2022-04-07 occuring on other dates, like 2022-06-15, 2022-9-10, etc. so the URL's are all valid. When I find the URL it breaks on and attempt to run BS stuff on my end, I end up with an "HTTPError: HTTP Error 429: Too Many Requests." This error doesn't go away unless I wait several hours to try running it again.

JTMachen avatar Mar 22 '23 22:03 JTMachen

Right, I'm doing the concat method you mentioned. I'm even sleeping for 5-10 seconds and outputting to a local Postgres database.

No matter what, if I make more than 4-5 requests I am getting the IndexError. Is Baseball Reference really that stingy with requests?

tbryan2 avatar Mar 24 '23 02:03 tbryan2

I was also having the same issue with batting_stats_range, but sleeping has seemed to fix the issue for me. Here's what's working for me. The same code works for batting_stats_range. Before adding sleeping to my code, I was having the same experience wherein I could fetch four or five times but then I would get locked out for a number of hours.

all_pitching_stats = pd.DataFrame([])

for month in range(4, 7):
    for i in range(1,32):
        time.sleep(10)
        if(month == (4 or 7) and i == 31):
            continue
        day = ''
        if i < 10:
            day ='2021-0' + str(month) + '-0'+str(i)
        else:
            day = '2021-0' + str(month) + '-' + str(i)
        temp=pd.DataFrame([])
        try:
            temp=pitching_stats_range(day,day)
        except:
            print(day + " Failed")
        if len(temp)>0:
            print(day + " Success")
            temp = temp.assign(Date = day)
            all_pitching_stats=all_pitching_stats.append(pd.DataFrame(temp),ignore_index=True)

JGHB avatar Mar 27 '23 17:03 JGHB

I've gotten about half-way through a full season (~100 loops) before this fails and I'm sleeping 15 seconds. Will keep trying things.

klatta87 avatar Apr 27 '23 17:04 klatta87

This issue has gotten worse. I can only pull a single day's worth of data. I tried sleeping for upwards of 15 seconds, but ti only pulls a single day's worth of data before throwing the same error.

JTMachen avatar Jul 30 '23 06:07 JTMachen