worldfootballR icon indicating copy to clipboard operation
worldfootballR copied to clipboard

Error using fb_player_scouting_report() function

Open benjaminrholmes opened this issue 2 years ago • 8 comments

Hello,

First time using the worldfootballR package and have come across an error using the example code in the docs:

CODE:

install.packages("worldfootballR")
library(worldfootballR)
library(dplyr)


scout <- fb_player_scouting_report(player_url = "https://fbref.com/en/players/d70ce98e/Lionel-Messi",
                                   pos_versus = "primary") %>%
               dplyr::filter(scouting_period == "Last 365 Days")

OUTPUT: Error in open.connection(x, "rb") (worldfootball.R#9): HTTP error 403.

Also sessionInfo() OUTPUT:

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_1.2.0          dplyr_1.0.9          GGally_2.1.2         ggplot2_3.3.6        worldfootballR_0.5.6

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8         plyr_1.8.7         RColorBrewer_1.1-2 pillar_1.7.0       compiler_4.1.2    
 [6] tools_4.1.2        bit_4.0.4          gtable_0.3.0       lubridate_1.8.0    jsonlite_1.8.0    
[11] lifecycle_1.0.1    tibble_3.1.6       pkgconfig_2.0.3    rlang_1.0.2        DBI_1.1.2         
[16] cli_3.3.0          curl_4.3.2         parallel_4.1.2     withr_2.4.3        httr_1.4.2        
[21] stringr_1.4.0      janitor_2.1.0      xml2_1.3.3         generics_0.1.2     vctrs_0.4.1       
[26] hms_1.1.1          grid_4.1.2         bit64_4.0.5        tidyselect_1.1.2   reshape_0.8.9     
[31] snakecase_0.11.0   glue_1.6.1         R6_2.5.1           fansi_1.0.2        vroom_1.5.7       
[36] purrr_0.3.4        readr_2.1.2        tzdb_0.2.0         magrittr_2.0.2     scales_1.1.1      
[41] ellipsis_0.3.2     assertthat_0.2.1   rvest_1.0.2        colorspace_2.0-2   utf8_1.2.2        
[46] stringi_1.7.6      munsell_0.5.0      crayon_1.5.0      

Any help you can provide is much appreciated

Cheers

benjaminrholmes avatar Jun 21 '22 08:06 benjaminrholmes

Hi, As defined in a google search,

The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.

It would appear you have been blocked from accessing their servers for being in violation of their terms (see here: https://www.sports-reference.com/bot-traffic.html).

I think if you give it some time, you'll be allowed to scrape again. Remember to be mindful and ensure your time_pause number is set sufficiently high in all FBref functions.

I will close this issue now as it's not related to the functioning of the library. Reach out if there's anything else though.

JaseZiv avatar Jun 22 '22 08:06 JaseZiv

Hi,

I did think that was the case, however, my own scripts in python scraping fbref seem to be fine. Which made me doubt it was an over-request issue. I also waited 24 hours since I last executed the fb_player_scouting_report function and still no luck. I will just wait more days and retry.

Thank you for your help.

benjaminrholmes avatar Jun 22 '22 09:06 benjaminrholmes

Hi @JaseZiv, I have the same issue with functions that extracts data from FBref. I've tried different networks and different laptops, but it throws HTTP error 403 anyway. I didn't use worldfootballR since the end of the 2021-2022 season, so it's hard to believe I violated their scrapping data terms. It worked fine the entire season and the very last day of it, but now it doesn't.

@benjaminrholmes does it work for you now?

artiebits avatar Aug 01 '22 20:08 artiebits

@artiebits Can you please send through the code that you used to get the 403?

JaseZiv avatar Aug 02 '22 22:08 JaseZiv

Thanks for reopening the issue and investigating it.

library(worldfootballR)
library(lubridate)
library(dplyr)

countries <- c("ENG", "ESP")

for (country in countries) {
  print(paste("Getting data for", country))

  data <- get_match_results(country = country, gender = "M", season_end_year = 2010:2022)

  fixture <- data %>%
    filter(Date >= lubridate::today()) %>%
    select(Date, Time, Home, Away)

  history <- data %>%
    filter(Date < lubridate::today()) %>%
    select(Date, Home, Away, HomeGoals, AwayGoals)

  write.csv(fixture, paste0("data/", country, "-fixture.csv"))
  write.csv(history, paste0("data/", country, ".csv"))
}

print("All data downloaded")

artiebits avatar Aug 03 '22 05:08 artiebits

Two more things...

What version of the library are you using?

Additionally, can you paste in the output you get from running this line of code: httr::GET("http://httpbin.org/user-agent")

JaseZiv avatar Aug 03 '22 05:08 JaseZiv

The version is 0.5.7.

The output:

Response [http://httpbin.org/user-agent]
  Date: 2022-08-03 06:09
  Status: 200
  Content-Type: application/json
  Size: 61 B
{
  "user-agent": "libcurl/7.79.1 r-curl/4.3.2 httr/1.4.3"
}

artiebits avatar Aug 03 '22 06:08 artiebits

Before you run any of the FBref functions, add this to the start of your script (or substitute the user agent if you're on another software):

httr::set_config(httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"))

I don't know that this will help, but is worth a shot. Otherwise, I suspect your IP has been blocked more permanently than they say on their site?

JaseZiv avatar Aug 03 '22 20:08 JaseZiv

Unfortunately, it doesn't help. If I place httr::GET("http://httpbin.org/user-agent") after the code you proposed, then I see that my user agent has changed. However, I still get the same error :/

artiebits avatar Aug 11 '22 18:08 artiebits

Yeah this then looks like it could be a flat ban on your IP... you might have to reach out to them to see if you can get it lifted?

JaseZiv avatar Aug 12 '22 02:08 JaseZiv

Hi guys. I had same issue; I tried the above attempts but not successful. However, the issue HTTP 403 only appears when I'm using VSCode (running the same code in RStudio works fine). I guess that problem is in the VSCode extension and not with IP address.

matheussrod avatar Aug 16 '22 23:08 matheussrod

Hi guys. I had same issue; I tried the above attempts but not successful. However, the issue HTTP 403 only appears when I'm using VSCode (running the same code in RStudio works fine). I guess that problem is in the VSCode extension and not with IP address.

You could be right... I find that when I run some functions in RStudio locally, runs fine...when I run the same functions in GitHub Actions, 403s...

JaseZiv avatar Aug 16 '22 23:08 JaseZiv

Hey everyone, same 403 issue for get_team_match_results() for fbref. Working perfectly on RStudio running on linux server but once the task is scheduled using cron it returns 403, despite editing user_agent. Exactly the same in GitHub actions and also tried running from Databricks cluster all 403. I imagine any deployed Shiny app using these functions would also fail.

oliverp6 avatar Aug 18 '22 10:08 oliverp6

FYI I found a hacky workaround for this error (for me at least). It does seem like a user-agent issue, but I can only get one to work "RStudio Desktop (2022.7.1.554); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)"

In the get_team_match_results function I edited the section that uses xml2::read_html and replaced it with an rvest::html_session with the RStudio Desktop ua and then pipe that to `read_html() and that seems to have solved my issue. Hopefully it works for you guys too!

function (team_url, time_pause = 3) 
{
    time_wait <- time_pause
    get_each_team_log <- function(team_url, time_pause = time_wait) {
        pb$tick()
        Sys.sleep(time_pause)
        ua <- user_agent("RStudio Desktop (2022.7.1.554); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)")
        team_page <- rvest::html_session(team_url,ua) %>% read_html()
        team_name <- sub(".*\\/", "", team_url) %>% gsub("-Stats", 
            "", .) %>% gsub("-", " ", .)
        opponent_names <- team_page %>% rvest::html_nodes(".left:nth-child(10) a") %>% 
            rvest::html_text()
        team_log <- team_page %>% rvest::html_nodes("#all_matchlogs") %>% 
            rvest::html_nodes("table") %>% rvest::html_table() %>% 
            data.frame()
        team_log$Opponent <- opponent_names
        team_log <- team_log %>% dplyr::mutate(Team_Url = team_url, 
            Team = team_name) %>% dplyr::select(.data$Team_Url, 
            .data$Team, dplyr::everything(), -.data$Match.Report)
        team_log <- team_log %>% dplyr::mutate(Attendance = gsub(",", 
            "", .data$Attendance) %>% as.numeric(), GF = as.character(.data$GF), 
            GA = as.character(.data$GA))
        return(team_log)
    }
    pb <- progress::progress_bar$new(total = length(team_url))
    all_team_logs <- team_url %>% purrr::map_df(get_each_team_log)
}

oliverp6 avatar Aug 18 '22 17:08 oliverp6

Hi all,

Hoping this issue has been resolved for a lot of the fbref functions as of version 0.5.12.3000. The fix hasn't been implemented for functions.

Thanks to @oliverp6 for the inspiration and @tonyelhabr for the help implementing this!

Will keep this issue open for a little while to confirm things

JaseZiv avatar Aug 22 '22 03:08 JaseZiv