soccerdata
soccerdata copied to clipboard
Flatten the fbref index and columns
Hi,
Your package looks great! I am thinking of making a couple of fbref examples using soccerdata for mplsoccer. I was wondering whether you would consider flattening the multi-level index/ columns. I have previously flattened the columns like this: https://github.com/andrewRowlinson/outliers-football/blob/master/scrape_utils.py#L34-L40
Thanks,
Andy
I think the multi-level index is very convenient here because the data is logically hierarchical — meaning that a league contains multiple seasons, each season contains multiple teams, ... Also, I think the multi-level index has three main benefits:
- Easy manipulation via stack() and unstack(). For example, if we want to easily compare xG across seasons, we can use df.unstack('season') to line everything up side-by-side.
- Pandas provides convenient syntactic sugar for slicing/filtering on indexes
- The same indexes are used across data sources, making merging the data from different sources easy.
And of course, if you do not like the multi-level index all you need is a simple df.reset_index()
but I think using the multi-level index is a good default.
I do not have a strong preference for multi-level columns or flat columns. I assume the latter is a bit more convenient for novice users, but the former has some advantages too. Again, you have logical groups of stats here, and with the multi-level columns, you can easily select one of these groups. For example, if you are only interested in the "per 90" stats, you can simply do df["Per 90 Minutes"]
. With a flat index that would require something like df[['Gls_p90', 'Ast_p90', ...]]
. Eventually, I stuck to the default because I do not see a convincing reason to change it, but I am curious why you would like to flatten them.
Actually, I would drop all columns that can be derived from combining other columns. For example, all per 90 stats can be computed by diving the "performance" and "expected" groups by ("Playing time", "Min"). That would make the multi-level columns obsolete too, but I expect many people to complain that some columns are missing if I would do that 🙃 .
No worries, I agree indexes are easy to change with reset_index.
With columns, I guess all the standard stuff is harder, e.g. renaming a column or loading it into a database.
I expect the main use case is also for making pizza charts and radars. People will need to select stats from several of the top level columns and I wonder how they would do this in practice. It might get tricky quickly
Well, I guess hierarchical columns have some slight advantages in some use cases and some slight disadvantages in other use cases. For all of your use cases, you could also claim that hierarchical columns are an advantage. For example, to rename the same column in each group of stats you can do team_season_stats.rename(columns={"Gls": "goals"}, level=1)
and when creating a radar chart you probably want to select a single group of stats most of the time (e.g., you do not want to mix "per 90" stats with aggregated stats in a single chart).
Another issue that I see is that it will require quite a few manual overrides to get meaningful column names. If flattened, I would prefer just "goals" instead of "Performance_gls". These manual overrides come with an additional maintenance cost.
Hence, why I think it is better to stick to the default here. After all, this is a scraper, not a data manipulation library.
I'll leave the issue open for a while and reconsider if others can provide convincing arguments.