pybaseball
pybaseball copied to clipboard
Created a new function to retrieve box scores from baseball reference…
…. Quick example:
datetime_object = datetime.strptime('May 05 2021', '%b %d %Y')
visitor_batting_df, home_batting_df, visitor_pitching_df, home_pitching_df
= box_score('OAK', datetime_object, 0)
print(f"{visitor_pitching_df.loc[0, 'Pitching']} vs {home_pitching_df.loc[0, 'Pitching']}")
I had to parse through comments to get the data I wanted due to how bbref sets up their boxscore pages. Alternatively, I have a version that uses Selenium and a ChromeDriver which works a little cleaner (tables aren't in comments post-page load) but for now am submitting this version to avoid a new dependency
@TheCleric @bdilday if either of you can take a look?
I had to parse through comments to get the data I wanted due to how bbref sets up their boxscore pages. Alternatively, I have a version that uses Selenium and a ChromeDriver which works a little cleaner (tables aren't in comments post-page load) but for now am submitting this version to avoid a new dependency
I think I'd rather have the cleaner, selenium, version. It doesn't seem like a crazy dependency for a library who's job is largely to scrape the web.
@schorrm @TheCleric any thoughts?
I had to parse through comments to get the data I wanted due to how bbref sets up their boxscore pages. Alternatively, I have a version that uses Selenium and a ChromeDriver which works a little cleaner (tables aren't in comments post-page load) but for now am submitting this version to avoid a new dependency
I think I'd rather have the cleaner, selenium, version. It doesn't seem like a crazy dependency for a library who's job is largely to scrape the web.
@schorrm @TheCleric any thoughts?
@bdilday I'm not a fan of selenium for this since it doesn't need anything like JavaScript. As it is we've done similar things to this with just the xpath parser which can be used to parse into HTML comments.
EDIT: I found another PR where I provided some example code for something similar: https://github.com/jldbc/pybaseball/pull/137#discussion_r496769328
I was playing around with the Selenium version and have changed my mind and now agree with not using that. Main reason for my change of heart was that I didnt fully realize how much slower Selenium was until I ran a batch of calls. E.g. to get all 162 box scores for the Dodgers games this past season, it took the non-Selenium version 62 seconds but the Selenium version took around 15 minutes.
For 15 min vs 62 seconds, that's a pretty clear winner here, even if Selenium would be cleaner.
This has been opened through a year - are we merging this into the project or not? @schorrm @tjburch