pybaseball icon indicating copy to clipboard operation
pybaseball copied to clipboard

[feature] Scrape the yearly rookies page

Open kflorence opened this issue 1 year ago • 15 comments

Here is the 2023 page: https://www.baseball-reference.com/leagues/majors/2023-rookies.shtml

At a minimum it should probably include player name, team and ID. It would be great to also know which players exceeded rookie limits, but that information does not exist on this page. It seems to only exist on individual player pages.

kflorence avatar Feb 22 '24 22:02 kflorence

what do you exactly need? I can help

luhcartimods avatar Mar 21 '24 13:03 luhcartimods

Hey @luhcartimods -- I need a programmatic way of getting rookie eligibility for a list of players. As a backup, since that doesn't seem available on bbref as far as I can tell (except on individual player pages), if I were to get the IP/PA data off the rookie eligibility page I can calculate it myself (ignoring time on roster requirements, which I also don't see anywhere).

kflorence avatar Mar 21 '24 14:03 kflorence

Cant you directly download the data from the table and import it somehow into it? Or is that not what you are looking for?

luhcartimods avatar Mar 21 '24 17:03 luhcartimods

If it's currently possible to do with this tool, I'm not sure how to do it. Do you have an example I can try?

kflorence avatar Mar 21 '24 20:03 kflorence

you can directly download data from sports reference ** image ** and I think you can sort the data using the pandas module, may not work though!

luhcartimods avatar Mar 22 '24 06:03 luhcartimods

idk if this is exactly what you're looking for but here is some general data, dont know what you would do with it tho?:

[code]
Name Yrs PA Debut Age Tm Andrew Abbott 1 0 Jun 5 2023 24 CIN Logan Allen 1 0 Apr 23 2023 24 CLE Jake Alu 1 175 May 9 2023 26 WSN Francisco Alvarez 2 437 Sep 30 2022 21 NYM Miguel Amaya 1 156 May 4 2023 24 CHC Grant Anderson 1 0 May 30 2023 26 TEX Gabriel Arias 2 402 Apr 20 2022 23 CLE Javier Assad 2 0 Aug 23 2022 25 CHC Pedro Avila 4 3 Apr 11 2019 26 SDP Sam Bachman 1 0 May 26 2023 23 LAA Ji Hwan Bae 2 408 Sep 23 2022 23 PIT Patrick Bailey 1 353 May 19 2023 24 SFG Luken Baker 1 99 Jun 4 2023 26 STL Peyton Battenfield 1 0 Apr 12 2023 25 CLE Brett Baty 2 431 Aug 17 2022 23 NYM Tristan Beck 1 0 Apr 20 2023 27 SFG Brennan Bernardino 2 0 Jul 31 2022 31 BOS Tanner Bibee 1 0 Apr 26 2023 24 CLE Osvaldo Bido 1 0 Jun 14 2023 27 PIT Dairon Blanco 2 145 May 20 2022 30 KCR Ronel Blanco 2 0 Apr 8 2022 29 HOU Cody Bradford 1 0 May 15 2023 25 TEX Taj Bradley 1 0 Apr 12 2023 22 TBR Will Brennan 2 500 Sep 21 2022 25 CLE Jhony Brito 2 0 Apr 2 2023 25 NYY Hunter Brown 2 0 Sep 5 2022 24 HOU Alec Burleson 2 400 Sep 8 2022 24 STL Jose Caballero 1 280 Apr 15 2023 26 SEA Isaiah Campbell 1 0 Jul 7 2023 25 SEA Yennier Cano 2 0 May 11 2022 29 BAL Dominic Canzone 1 182 Jul 8 2023 25 TOT Conner Capel 2 145 Jun 27 2022 26 OAK Drew Carlton 3 0 Sep 4 2021 27 SDP Corbin Carroll 2 760 Aug 29 2022 22 ARI Triston Casas 2 597 Sep 4 2022 23 BOS Cade Cavalli 1 0 Aug 26 2022 24 Oscar Colas 1 263 Mar 30 2023 24 CHW Tom Cosgrove 2 0 Apr 29 2023 27 SDP Austin Cox 1 0 May 4 2023 26 KCR Fernando Cruz 2 0 Sep 2 2022 33 CIN Xzavion Curry 2 0 Aug 15 2022 24 CLE Tyler Cyr 2 0 Aug 21 2022 30 LAD Davis Daniel 1 0 Sep 7 2023 26 LAA Henry Davis 1 255 Jun 19 2023 23 PIT Noah Davis 2 0 Oct 5 2022 26 COL Elly De La Cruz 1 427 Jun 6 2023 21 CIN Jonny DeLuca 1 45 Jun 7 2023 24 LAD Jhonathan Diaz 3 0 Sep 17 2021 26 LAA Jordan Diaz 2 344 Sep 18 2022 22 OAK Yainer Diaz 2 386 Sep 2 2022 24 HOU Brenton Doyle 1 431 Apr 24 2023 25 COL Christian Encarnacion-Strand 1 241 Jul 17 2023 23 CIN Mason Englert 1 0 Mar 30 2023 23 DET Lucas Erceg 1 0 May 19 2023 28 OAK Jeremiah Estrada 3 0 Aug 30 2022 24 CHC Angel Felipe 1 0 Jul 7 2023 25 OAK Freddy Fermin 2 242 Jul 15 2022 28 KCR Jose Ferrer 1 0 Jul 1 2023 23 WSN J.P. France 1 0 May 6 2023 28 HOU Bowden Francis 2 0 Apr 27 2022 27 TOR Sal Frelick 1 223 Jul 22 2023 23 MIL David Fry 1 113 May 1 2023 27 CLE Shintaro Fujinami 1 0 Apr 1 2023 29 TOT Deivi Garcia 3 0 Aug 30 2020 24 TOT Maikel Garcia 2 538 Jul 15 2022 23 KCR Robert Garcia 1 0 Jul 14 2023 27 TOT Zack Gelof 1 300 Jul 14 2023 23 OAK Luis Gil 2 0 Aug 3 2021 25 Michael Grove 3 0 May 15 2022 26 LAD Dalton Guthrie 2 56 Sep 6 2022 27 PHI Ian Hamilton 4 0 Aug 31 2018 28 NYY Hogan Harris 1 0 Apr 14 2023 26 OAK Grant Hartwig 1 0 Jun 19 2023 25 NYM Gunnar Henderson 2 754 Aug 31 2022 22 BAL Jose Hernandez 1 0 Apr 1 2023 25 PIT Sean Hjelle 2 0 May 6 2022 26 SFG Bryan Hoeing 2 0 Aug 20 2022 26 MIA Gavin Hollowell 2 0 Sep 19 2022 25 COL Tyler Holton 2 0 Apr 28 2022 27 DET Brent Honeywell Jr. 2 0 Apr 11 2021 28 TOT Jake Irvin 1 0 May 3 2023 26 WSN Andre Jackson 3 4 Aug 16 2021 27 TOT Alek Jacob 1 0 Jul 15 2023 25 SDP Joe Jacques 1 0 Jun 12 2023 28 BOS Drey Jameson 2 0 Sep 15 2022 25 ARI Bryce Johnson 2 67 Aug 3 2022 27 SFG Nolan Jones 2 518 Jul 8 2022 25 COL Ben Joyce 1 0 May 29 2023 22 LAA Edouard Julien 1 408 Apr 12 2023 24 MIN Corey Julks 1 323 Mar 31 2023 27 HOU Josh Jung 2 617 Sep 9 2022 25 TEX Kevin Kelly 1 0 Apr 1 2023 25 TBR Michael Kelly 2 0 Jun 16 2022 30 CLE Zack Kelly 2 0 Aug 29 2022 28 BOS Ray Kerr 2 0 Apr 24 2022 28 SDP Grae Kessinger 1 45 Jun 7 2023 25 HOU Joe La Sorsa 1 0 May 29 2023 25 TOT Casey Legumina 1 0 Apr 15 2023 26 CIN Royce Lewis 2 280 May 6 2022 24 MIN Matthew Liberatore 2 0 May 21 2022 23 STL Otto Lopez 2 11 Aug 17 2021 24 Nathan Lukes 1 31 Mar 30 2023 28 TOR Alec Marsh 1 0 Jun 30 2023 25 KCR Miles Mastrobuoni 2 166 Sep 22 2022 27 CHC Luis Matos 1 253 Jun 14 2023 21 SFG James McArthur 1 0 Jun 28 2023 26 KCR Easton McGee 2 0 Oct 2 2022 25 SEA Scott McGough 2 0 Aug 20 2015 33 ARI Matt McLain 1 403 May 15 2023 23 CIN Luis Medina 1 0 Apr 26 2023 24 OAK Bobby Miller 1 0 May 23 2023 24 LAD Bryce Miller 1 0 May 2 2023 24 SEA Mason Miller 1 0 Apr 19 2023 24 OAK Tyson Miller 3 0 Aug 17 2020 27 TOT Garrett Mitchell 2 141 Aug 27 2022 24 MIL Carmen Mlodzinski 1 0 Jun 16 2023 24 PIT Andruw Monasterio 1 315 May 28 2023 26 MIL Bryce Montes de Oca 1 0 Sep 3 2022 27 Kyle Muller 3 13 Jun 16 2021 25 OAK Chris Murphy 1 0 Jun 7 2023 25 BOS Parker Mushinski 2 0 Apr 17 2022 27 HOU James Naile 2 0 Jun 27 2022 30 STL Bo Naylor 2 238 Oct 1 2022 23 CLE Ryne Nelson 2 0 Sep 5 2022 25 ARI Zach Neto 1 329 Apr 15 2023 22 LAA Ryan Noda 1 495 Mar 30 2023 27 OAK Logan OHoppe 2 215 Sep 28 2022 23 LAA Reese Olson 1 0 Jun 2 2023 23 DET Luis Ortiz 4 0 Sep 7 2018 27 PHI Luis Ortiz 2 0 Sep 13 2022 24 PIT James Outman 3 593 Jul 31 2022 26 LAD Daniel Palencia 1 0 Jul 4 2023 23 CHC Ryan Pepiot 2 0 May 11 2022 25 LAD Oswald Peraza 2 248 Sep 2 2022 23 NYY Carlos Perez 2 71 Aug 26 2022 26 CHW Eury Perez 1 0 May 12 2023 20 MIA Blake Perkins 1 168 Apr 19 2023 26 MIL Brandon Pfaadt 1 0 May 3 2023 24 ARI Israel Pineda 1 14 Sep 11 2022 23 Heliot Ramos 2 82 Apr 10 2022 23 SFG Henry Ramos 2 141 Sep 5 2021 31 CIN Zach Remillard 1 160 Jun 17 2023 29 CHW Endy Rodriguez 1 204 Jul 17 2023 23 PIT Grayson Rodriguez 1 0 Apr 5 2023 23 BAL Johan Rojas 1 164 Jul 15 2023 22 PHI Eguy Rosario 3 46 Aug 26 2022 23 SDP Esteury Ruiz 2 533 Jul 12 2022 24 OAK Blake Sabol 1 344 Mar 30 2023 25 SFG Cesar Salazar 1 19 Apr 2 2023 27 HOU Cole Sands 2 0 May 1 2022 25 MIN Gregory Santos 3 0 Apr 22 2021 23 CHW Casey Schmitt 1 277 May 9 2023 24 SFG Jesse Scholtens 1 0 Apr 7 2023 29 CHW Connor Seabold 3 0 Sep 11 2021 27 COL Kodai Senga 1 0 Apr 2 2023 30 NYM Emmet Sheehan 1 0 Jun 16 2023 23 LAD Jared Shuster 1 0 Apr 2 2023 24 ATL Chase Silseth 2 0 May 13 2022 23 LAA Tyler Soderstrom 1 138 Jul 14 2023 21 OAK George Soriano 1 0 Apr 16 2023 24 MIA Jose Soriano 1 0 Jun 3 2023 24 LAA Lenyn Sosa 2 209 Jun 23 2022 23 CHW Spencer Steer 2 773 Sep 2 2022 25 CIN Brett Sullivan 1 86 Apr 18 2023 29 SDP Thomas Szapucki 2 1 Jun 30 2021 27 Freddy Tarnok 2 0 Aug 17 2022 24 OAK Cody Thomas 2 78 Sep 1 2022 28 OAK Michael Toglia 2 272 Aug 30 2022 24 COL Justin Topa 4 1 Sep 1 2020 32 SEA Ezequiel Tovar 2 650 Sep 23 2022 21 COL Jared Triolo 1 209 Jun 28 2023 25 PIT Brice Turang 1 448 Mar 30 2023 23 MIL Abner Uribe 1 0 Jul 8 2023 23 MIL Enmanuel Valdez 1 149 Apr 19 2023 24 BOS Carlos Vargas 1 0 Mar 30 2023 23 ARI Miguel Vargas 2 354 Aug 3 2022 23 LAD Gus Varland 2 0 Mar 30 2023 26 TOT Louie Varland 2 0 Sep 7 2022 25 MIN Mark Vientos 2 274 Sep 11 2022 23 NYM Anthony Volpe 1 601 Mar 30 2023 22 NYY Cole Waites 2 0 Sep 13 2022 25 SFG Ken Waldichuk 2 0 Sep 1 2022 25 OAK Jordan Walker 1 465 Mar 30 2023 21 STL Josh Walker 1 0 May 16 2023 28 NYM Ryan Walker 1 0 May 21 2023 27 SFG Matt Wallner 2 319 Sep 17 2022 25 MIN Thaddeus Ward 1 0 Apr 1 2023 26 WSN Zack Weiss 3 0 Apr 12 2018 31 TOT Greg Weissert 2 0 Aug 25 2022 28 NYY Joey Wentz 2 0 May 11 2022 25 DET Hayden Wesneski 2 0 Sep 6 2022 25 CHC Jordan Westburg 1 228 Jun 26 2023 24 BAL Brendan White 1 0 Jun 14 2023 24 DET Joey Wiemer 1 410 Apr 1 2023 24 MIL Gavin Williams 1 0 Jun 21 2023 23 CLE Brandon Williamson 1 0 May 16 2023 25 CIN Bryan Woo 1 0 Jun 3 2023 23 SEA Masataka Yoshida 1 580 Mar 30 2023 29 BOS Angel Zerpa 3 0 Sep 30 2021 23 KCR Totals 318 24272 [/code]

Provided by [url=https://www.sports-reference.com/sharing.html?utm_source=direct&utm_medium=Share&utm_campaign=ShareTool]Baseball-Reference.com[/url]: [url=https://www.baseball-reference.com/leagues/majors/2023-rookies.shtml?sr&utm_source=direct&utm_medium=Share&utm_campaign=ShareTool#misc_batting]View Original Table[/url] Generated 3/22/2024.

luhcartimods avatar Mar 22 '24 06:03 luhcartimods

Interesting, I didn't know you could export CSV directly from the site. Assuming that requires an account? I'll check it out.

kflorence avatar Mar 22 '24 16:03 kflorence

if you didnt know already you can also just export selected data like this: image You can choose what data you want and it will upload like what I sent before in text format which would be easier to import

luhcartimods avatar Mar 22 '24 18:03 luhcartimods

ok I made a python script for PA and im making one for IP now

luhcartimods avatar Mar 23 '24 09:03 luhcartimods

made both scripts, how can I share with you? image this is how it works. I don't know if this is what you want however?

luhcartimods avatar Mar 23 '24 09:03 luhcartimods

Sorry for the slow reply @luhcartimods -- had some folks visiting over the weekend. You should stick those python scripts into a gist though so that others can reference them as needed.

I think the CSV download directly from baseball-reference will work for my use-case, thanks for pointing that out. I'll be using it as part of a larger ETL job that is used to calculate keeper prices for the upcoming season based on prices for players in the previous season as well as inflation. In our league rookie eligible players do not gain any inflation, so this is why I needed some programmatic way to look that up. I will need to use something like the player ID map on https://www.smartfantasybaseball.com/tools/ to link these players to Yahoo player IDs.

I think I'll leave this issue open for now since it seems like it would be useful to have the tool provide it directly in some way.

kflorence avatar Mar 26 '24 00:03 kflorence

@kflorence i think I could try to tackle this issue. Can I take it on this weekend and next week and then check back? I am new to open source so I may take a bit to get comfortable if that is alright.

marinersjk00 avatar May 11 '24 18:05 marinersjk00

Sure thing @marinersjk00, take your time. I have no expectations for a timeline on this, just think it would be a useful addition.

kflorence avatar May 11 '24 20:05 kflorence

Screen Shot 2024-05-13 at 2 30 39 PM

Hey @kflorence , is this along the lines of what you're looking for?

The webscraper is pretty slow. I think it could be a bit faster if I used a .csv file but I wanted to go directly from the Google Sheet from the link you provided since that has live updates. Let me know which one you think would be better.

Also, obviously there are a bunch of duplicates that I need to remove that are also probably slowing things down tremendously. I should be able to get those eliminated once I understand BeautifulSoup a little better, but wanted to check in and make sure I'm on the right track.

marinersjk00 avatar May 13 '24 21:05 marinersjk00

Hey again @kflorence

I've improved it a bit from what I posted earlier. Only the initial webscrape takes some time (about 20 seconds or so on my machine), after that it can handle repeated queries in the while loop by reusing the dataframe from the initial webscrape. Hope this is what you're looking for or at least along the right lines.

Screen Shot 2024-05-13 at 4 52 51 PM

marinersjk00 avatar May 13 '24 23:05 marinersjk00