Tennis-Prediction icon indicating copy to clipboard operation
Tennis-Prediction copied to clipboard

Prediction of the winner of a tennis matches with machine learning

Tennis-Prediction Repository

The goal of this project is to predict the outcome of a tennis match using the data of both players. The data used comes from Jeff Sackmann's repository.

Installation

To clone the repository, with the data you need to also clone the submodules:

git clone --recurse-submodules https://github.com/VincentAuriau/Tennis-Prediction.git

Usage

You can find examples in /examples:

Loading players statistics at match time + match outcome

Example

from data.data_loader import matches_data_loader
data_df = matches_data_loader(path_to_data="submodules/tennis_atp")

data_df contains let you access information about players (statistics prior to the match) along statistics of the match. A basic example statistic: the victory percentage of the best ranked player in a match, depending on players rankings.

Number of ATP main matches depending on players rank Victory % of best ranked player

It can be easily used to also compute players statistics over their carreer, and/or at match time. Here is a simple example with Stan Wawrinka:

Stan's Victory % in main ATP matches Stan's career aces % diff with adversary

Here is an example of a data row:

id tournament tournament_level tournament_date tournament_surface round best_of match_id Winner Score
atp_matches_qual_chall_2003_5427 San Benedetto CH C 20030811 Clay SF 3 20030811 0 2-6 7-5 7-5

Additional match statistics: elapsed_minutes, aces_nb_1, doublefaults_nb_1, svpt_1, 1stIn_1, 1stWon_1, 2ndWon_1, SvGms_1, bpSaved_1, bpFaced_1, aces_nb_2, doublefaults_nb_2, svpt_2, 1stIn_2, 1stWon_2, 2ndWon_2, SvGms_2 bpSaved_2, bpFaced_2

Name_1 ID_1 Ranking_1 Ranking_Points_1 Ranking_History_1 Best_Rank_1 Birth_Year_1 Versus_1 Hand_1 Last_Tournament_Date_1 Height_1 Matches_1 Matchs_Clay_1 Matches_Carpet_1 Matches_Grass_1 Matches_Hard_1 Victories_Percentage_1 Clay_Victories_Percentage_1 Carpet_Victories_Percentage_1 Grass_Victories_Percentage_1 Hard_Victories_Percentage_1 Aces_Percentage_1 Doublefaults_Percentage_1 First_Save_Success_Percentage_1 Winning_on_1st_Serve_Percentage_1 Winning_on_2nd_Serve_Percentage_1 Overall_Win_on_Serve_Percentage_1 BreakPoint_Face_Percentage_1 BreakPoint_Saved_Percentage_1 last_rankings_1 last_ranking_points_1
Stan.Wawrinka 104527 184 114 {20030616: [387, 68], 20030707: [363, 74], 20030714: [348, 79], 20030721: [303, 99], 20030811: [284, 114]} 284 19850328 [] R 20030721 183 [['V', 'atp_matches_qual_chall_2003_3466'], ['D', 'atp_matches_qual_chall_2003_3481'], ['D', 'atp_matches_2003_4049'], ['V', 'atp_matches_2003_4315'], ['D', 'atp_matches_2003_4328'], ['V', 'atp_matches_2003_4773'], ['D', 'atp_matches_2003_4782'], ['V', 'atp_matches_qual_chall_2003_5408'], ['V', 'atp_matches_qual_chall_2003_5419'], ['V', 'atp_matches_qual_chall_2003_5424']] ['V', 'D', 'D', 'V', 'D', 'V', 'D', 'V', 'V', 'V'] [] [] [] 60 60 0 0 0 3.41880341880342 4.27350427350427 64.957264957265 54.985754985755 15.6695156695157 70.6552706552707 11.3960113960114 7.69230769230769 [303, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 387] [99, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 68]
Name_2 ID_2 Ranking_2 Ranking_Points_2 Ranking_History_2 Best_Rank_2 Birth_Year_2 Versus_2 Hand_2 Last_Tournament_Date_2 Height_2 Matches_2 Matchs_Clay_2 Matches_Carpet_2 Matches_Grass_2 Matches_Hard_2 Victories_Percentage_2 Clay_Victories_Percentage_2 Carpet_Victories_Percentage_2 Grass_Victories_Percentage_2 Hard_Victories_Percentage_2 Aces_Percentage_2 Doublefaults_Percentage_2 First_Save_Success_Percentage_2 Winning_on_1st_Serve_Percentage_2 Winning_on_2nd_Serve_Percentage_2 Overall_Win_on_Serve_Percentage_2 BreakPoint_Face_Percentage_2 BreakPoint_Saved_Percentage_2 last_rankings_2 last_ranking_points_2
Martin.Vassallo Arguello 103506 125 296 {19990201: [817, 13], 20000710: [398, 61], 20000731: [354, 75], 20000807: [377, 70], 20010625: [459, 48], 20010709: [405, 61], 20010813: [391, 68], 20010820: [374, 72], 20010827: [342, 88], 20010917: [291, 117], 20010924: [286, 122], etc...} 123 19800210 [] R 20030804 183 [['V', 'atp_matches_qual_chall_1999_380'], ['D', 'atp_matches_qual_chall_1999_393'], ['V', 'atp_matches_qual_chall_2000_3972'], ['V', 'atp_matches_qual_chall_2000_3988'], ['D', 'atp_matches_qual_chall_2000_3996'], ['D', 'atp_matches_qual_chall_2000_4725'], ['D', 'atp_matches_qual_chall_2000_4758'], ['V', 'atp_matches_qual_chall_2001_3699'], etc...] ['V', 'D', 'V', 'V', 'D', 'V', 'V', 'D', 'V', 'V', 'D', 'V', 'D', 'V', 'V', 'D', 'V', 'V', 'V', 'D', 'V', 'D', 'V', 'V', 'V', 'V', 'D', 'D', 'V', 'V', 'V', 'D', 'V', 'D', 'V', 'D', 'D', etc...] ['D', 'V', 'D', 'V', 'D'] ['D'] ['D', 'D', 'D', 'V', 'V', 'V', 'V', 'D', 'V', 'D', 'V', 'D'] 61.0294117647059 63.5593220338983 40 0 50 4.82456140350877 5.26315789473684 61.4035087719298 46.4912280701754 18.859649122807 65.3508771929825 9.64912280701754 5.70175438596491 [157, 136, 165, 158, 9999, 9999, 204, 198, 197, 188, 204, 233] [247, 304, 220, 232, 0, 0, 164, 172, 177, 188, 167, 137]

Train/Testing on matches outcome:

Example

A generic function lets you evaluate your model with a train/test scheme without much work. Your model only needs a scikit-learn like signature. By playing with the years, columns to use in modelling and models & hyperparmaters, you can easily create your own best-performing model.

from sklearn.ensemble import RandomForestClassifier
from evaluation.train_test import train_test_evaluation

test_score = train_test_evaluation(
    train_years=[2020, 2021],
    test_years=[2022, 2023],
    model_class=RandomForestClassifier,
    model_params={"n_estimators": 2000, "max_depth": None},
    match_features=[],
    player_features=["Ranking"],
    encoding_params={},
    additional_features=[],
    save_path="./results",
    save_all_results=False
)

print("Test Score", test_score)

Models and hyperparamters can easily be compared with the file results.csv saved in save_path.

Different models performances :-------------------------:

If the argument save_all_results is set to True, the whole csv of test data is saved. It helps to get more in-depth analysis of results

Model precision compared with best ranked player wins strategy Model precision depending of players ranks

Encoding match

Example In order to represent history of a player, one can use MatchEncoders:

from history_modeling.encoding_model import PCAMatchEncoder

model = PCAMatchEncoder(num_pca_features=2)
model.fit(data_df, transform_data=True)
X_r, match_info = model.predict(data_df, transform_data=True)

2D representation of match outcome: :-------------------------: