boxball icon indicating copy to clipboard operation
boxball copied to clipboard

retrosheet_daily table missing game.source

Open segiddins opened this issue 4 years ago • 2 comments

cwdaily outputs daily lines for each player, which include the source for the game information. For games with multiple sources, there will be multiple daily entries for a given (player_id, game_id) tuple, and right now there's no column that can be used to disambiguate.

E.g. select * from retrosheet_daily where game_dt = '1943-06-19' and player_id = 'mackr101';

yields 5 rows for 2 games (two halves of a double header), which each game having a box score & deduced game, according to https://raw.githubusercontent.com/chadwickbureau/retrosplits/master/daybyday/playing-1943.csv. I'm not sure why mack in particular has 2 deduced game entries for CHA194306191, but that's probably an issue in chadwick

segiddins avatar May 25 '21 01:05 segiddins

Hmm, I put in some protection against this problem here, but looks like it's not working: https://github.com/droher/boxball/blob/72c7bc05993968b0897c1bcf9f662ed1e82b2776/extract/parsers/retrosheet.py#L61

I'll try to patch. Adding a general source column across all of these tables would be a great idea. For now, I do have an extra retrosheet_deduced_game table that you can join on to find which games have deduced entries -- I know that doesn't help with disambiguation, though.

droher avatar Jun 26 '21 14:06 droher

This hasn't been resolved in the code, but I've manually removed the duplicated games from my Retresheet fork, so the newly published version should be free of this bug.

droher avatar Apr 16 '22 01:04 droher