pybaseball icon indicating copy to clipboard operation
pybaseball copied to clipboard

add statcast cliplink

Open Maradonna90 opened this issue 3 years ago • 8 comments

It would be nice to have a link or at least the hash value where I can find the videoclip for a specific pitch. Right now I have to redo the search on the site to obtain it.

Maradonna90 avatar Aug 29 '20 22:08 Maradonna90

Is it available in the CSV export somewhere?

schorrm avatar Aug 30 '20 10:08 schorrm

I've tried to make this. The best I could do was a pretty clunky process of scraping gameday XML data to get the play id used in baseball savant's video pages, and then joining that back to the statcast event. It went something like:

  • Find a day's games from e.g. https://gd2.mlb.com/components/game/mlb/year_2020/month_07/day_25/scoreboard.xml
  • Find those games' events from e.g. http://gd2.mlb.com/components/game/mlb/year_2020/month_08/day_12/gid_2020_08_12_miamlb_tormlb_1/inning/inning_all.xml (where the game ids can be found in the previous bullet's XML)
  • Take the GUIDs from those games (atbat level wasn't hard to match with statcast, haven't tried pitch level before)
  • Join those with Statcast (the columns ['game_date', 'home_team', 'away_team', 'at_bat_number'] collectively seemed like a unique combo to match a Gameday GUID with a Statcast batted ball event)
  • Plug the GUID into the baseball savant url structure (e.g. https://baseballsavant.mlb.com/sporty-videos?playId=f9329f41-5c6a-431e-b001-a1a4ec7fb846, where playId is the GUID)

This approach was IMO way too slow to include in the default statcast scraper, but might be fine as a standalone gameday id or replay link scraper. Or maybe there's a faster way to get the GUIDs if you already know the statcast game id?

jldbc avatar Aug 30 '20 19:08 jldbc

I think the best way to do this might actually be to ask Tom Tango nicely on Twitter.

schorrm avatar Aug 30 '20 20:08 schorrm

They must have that GUID on the backend somewhere. If someone wants to try to convince Tango... Anyway, yeah. I don't see this exposed anywhere. From @jldbc's description, blah, someone can go implement that if they want to and I'll merge it happily, but 🤢

schorrm avatar Aug 31 '20 06:08 schorrm

Well I solved it for now by making very specific search request to baseballsavant with the pitch data I am given. Looks like this.

def savant_clip(pitch):
    clip_url = "https://baseballsavant.mlb.com/statcast_search?hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=[count]%7C&hfSea=[season]%7C&hfSit=&player_type=pitcher&hfOuts=[outs]%7C&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=[date_min]&game_date_lt=[date_max]&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfPull=&pitchers_lookup%5B%5D=[pitcher_id]&metric_1=api_p_release_speed&metric_1_gt=[min_speed]&metric_1_lt=[max_speed]&hfInn=[Inning]|&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_pas=0&type=details&player_id=[pitcher_id]"
    p_name = pitch['player_name'].split()
    p_id = playerid_lookup(p_name[1], p_name[0])['key_mlbam'].values
    pitch_map = {
        "[Inning]": int(pitch['inning']),
        "[pitcher_id]": p_id[0],
        "[date_min]": date_min,
        "[date_max]": date_max,
        "[count]": str(int(pitch['balls']))+str(int(pitch['strikes'])),
        "[season]": "2020",
        "[outs]": int(pitch['outs_when_up']),
        "[min_speed]": int(pitch["release_speed"]-1),
        "[max_speed]": int(pitch["release_speed"]+1)
    }
    #print(pitch_map)
    for k, v in pitch_map.items():
        clip_url = clip_url.replace(k, str(v))
    #print(clip_url)
    site = requests.get(clip_url)
    soup = BeautifulSoup(site.text, features="lxml")
    for link in soup.find_all('a'):
        #print(link.get('href'))
        clip_savant = requests.get("https://baseballsavant.mlb.com"+link.get('href'))
        clip_soup = BeautifulSoup(clip_savant.text, features='lxml')
        video_obj = clip_soup.find("video", id="sporty")
        clip_url = video_obj.find('source').get('src')
        return clip_url

Its ofc not fully finished and one could include more parameters to reduce the chance of having more than one result. Haven't checked performance, but should be ok. Plus this could be executed in parallel to the original search request.

Maradonna90 avatar Aug 31 '20 22:08 Maradonna90

@schorrm I'm friendly with Tango on twitter and can ask him for help here if still needed, although it looks like @Maradonna90 may have solved this?

Let me know if I should ping him.

kmedved avatar Aug 31 '20 23:08 kmedved

@kmedved I find @Maradonna90's approach here very interesting, I am inclined to leave it open at least for a bit (especially since I never really thought we needed the feature until he asked for it :) )

schorrm avatar Aug 31 '20 23:08 schorrm

I know this issue is over a year old, but I figured I'd chime in and share what I've found.

I built a wrapper to use MLB Video Room's GraphQL API - querying for video feeds based on a few key pitch identifiers (similar to what @Maradonna90 shared above). Was able to get multiple video feeds & resolutions per pitch, which was cool.

I realized there is a much simpler way to do this:

  1. Use the normal Statcast CSV data, find a pitch you want to locate a clip for.
  2. Pass the appropriate game_pk to this endpoint: https://baseballsavant.mlb.com/gf?game_pk={GAME_PK}
  3. In "team_home" and "team_away" - statcast data is listed for each pitch, and the "play_id" is included
  4. Use this URL to get to the video page https://baseballsavant.mlb.com/sporty-videos?playId={PLAY_ID}

coperyan avatar Nov 05 '21 09:11 coperyan