nflverse-pbp icon indicating copy to clipboard operation
nflverse-pbp copied to clipboard

duplicate game_id + play_id?

Open nickwan opened this issue 4 years ago • 8 comments

about 3% of the data has duplicate game_id + play_id combos and i'm unsure exactly why. seems like you can just run a straight up drop_duplicates() function on the full rows and most drop out. if you subset to game_id and play_id then that drops more rows.

specifically, looks like it is mostly on passing plays. seems pretty evenly distributed across all seasons (1300-2000 dupes per season). can't seem to find any particular pattern in the desc.

perhaps this is a known problem already in a different repo. anyway, hope that helps!

nickwan avatar Jun 08 '20 16:06 nickwan

Thanks for finding more stuff!

Is this in the newer data folder (not legacy-data)? Some of these are errors in the underlying data (ie NFL repeating play IDs) but will look into this

guga31bb avatar Jun 08 '20 16:06 guga31bb

yep yep, newest data! downloaded it the other night. i think the "best" news for the duped data is that there weren't visible discrepancies between the rows that i saw. i ran a standard deviation check on duped rows and all had SD==0 for the numerical columns. doesn't solve the riddle though -- if there were some systematic different then maybe it'd be due to some sort of join somewhere in the preprocessing pipeline. but that doesn't seem to be the case

nickwan avatar Jun 08 '20 16:06 nickwan

There was actually an issue with joining in the pipeline that got fixed a couple days ago (I was something dumb when joining CP to the full data) so if you have an older version of the data, re-downloading it would fix this. I'm actually re-scraping the data and saving csv.gz and .parquet files to make it easier for non-R users to try things out so once I close the other issue you might want to re-download and see if the dups are still there

guga31bb avatar Jun 08 '20 17:06 guga31bb

hm okay. i downloaded the data saturday at 3:15pm, so unsure if that's pre or post fix. will re-download when you push up the new stuff

nickwan avatar Jun 08 '20 17:06 nickwan

Okay I THINK we're good but please let me know if you find something to the contrary. game_id, qtr, half_seconds_remaining, desc should uniquely identify plays (play_id has some errors from NFL where they are duplicated, especially in older seasons)

pbp <- purrr::map_df(1999:2019, function(x) {
  readRDS(
    url(
      glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds")
    )
  )
})

dupes <- pbp %>% janitor::get_dupes(game_id, qtr, half_seconds_remaining, desc)
No duplicate combinations found of: game_id, qtr, half_seconds_remaining, desc

guga31bb avatar Jun 08 '20 23:06 guga31bb

I'm going to guess it's a similar issue that mlb has it their play IDs. The game level ID is unique but for some infrastructure reason the play IDs can be duped and the broad reason is due to processing on separate servers. Which... Seems near impossible to incur duped IDs ever but alas.

Thanks Ben!

nickwan avatar Jun 08 '20 23:06 nickwan

@guga31bb and @nickwan:

I am using data cloned on Aug 13 2020. I use (gameid, drive, playid) as a composite key in my PostgreSQL database. I found the following duplicates when I loaded the 1999-2019 csv data into my database:

Key (gameid, drive, playid)=(2000_03_PIT_CLE, 18, 2767) already exists. Key (gameid, drive, playid)=(2000_03_PIT_CLE, 18, 2768) already exists. Key (gameid, drive, playid)=(2000_06_WAS_PHI, 12, 1825) already exists. Key (gameid, drive, playid)=(2000_06_WAS_PHI, 12, 1825) already exists. Key (gameid, drive, playid)=(2000_11_OAK_DEN, 15, 2323) already exists. Key (gameid, drive, playid)=(2000_11_OAK_DEN, 15, 2323) already exists. Key (gameid, drive, playid)=(2002_05_KC_NYJ, 8, 1020) already exists. Key (gameid, drive, playid)=(2005_04_SEA_WAS, 13, 2861) already exists. Key (gameid, drive, playid)=(2005_04_SEA_WAS, 13, 2861) already exists. Key (gameid, drive, playid)=(2006_10_TB_CAR, 1, 103) already exists. Key (gameid, drive, playid)=(2006_10_TB_CAR, 1, 103) already exists. Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists. Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists. Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists. Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists. Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists. Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists. Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists. Key (gameid, drive, playid)=(2007_08_IND_CAR, 22, 4382) already exists.

I have addressed this in my database; but I thought it important to flag for this project. The play records (desc etc.) are unique; but the playid is not.

MargareeMan avatar Aug 15 '20 23:08 MargareeMan

Yep those are errors on the data we've been given. I'll un-close this just in case other people come here with the same question, but this isn't something that we can fix.

guga31bb avatar Aug 15 '20 23:08 guga31bb