separate contributors table into two
By my understanding the contributors table is updated in essentially two parts: half is updated by tasks that fetch data from github (core), and the other half is updated by data from git (facade).
when only half is updated, the other half of the data is null.
This is not a great database design, especially given that this table in particular is already very highly contested (#3343).
I think this table should be split into two because each of the processes seems to only deal with a separate, only slightly overlapping, set of data anyway - a relationship that may be better served by a foreign key.
the only challenge I see with this is how the migration will work. To do the migration without data loss would require at least making a copy of half of the table into a new table before deleting the data. This would mean disk space for somewhere between a half and a whole copy of that table would be needed.
In my experience this table is one of the largest in Augur. so that may be... interesting
I am going to give a push to this!
Thanks @MoralCode for the clear explanation.
I agree that the current contributors table design causes both data inconsistency (partial nulls) and contention between the core and facade tasks.
Splitting it into two logical tables - one for Git data and one for GitHub data linked through a foreign key - sounds like a cleaner long-term solution.
The migration challenge makes sense, especially given how large the current table is. Would it make sense to explore a staged approach, for example:
- Creating the new tables alongside the existing one
- Backfilling data gradually (perhaps repository by repository)
- Then switching references once consistency checks pass?
Also, given that Git and GitHub contributors overlap only partially, it might be helpful to define a matching strategy early (e.g., by email or name normalization) to minimize orphan rows when we join the two tables later.
For migration, if there’s a staging DB or test dataset available, I can try running a small-scale migration locally to estimate the space overhead.
Would love to know more about this!
I think a phased approach like that would probably end up being a little bit worse, because you'd still have the problem of temporarily raising how much data is needed to store duplicate copies of information, but you'd be doing it over a longer period of time, requiring people with large instances to make more disk space available for longer.
I suspect this may not be as big of an issue in general as I'm thinking because larger instances will probably have more headroom or more ability to just buy more storage in AWS or something. And smaller instances probably won't have enough data to migrate to have it really take up that much more space than a Augur collection run would normally take.
I'd also worry about the risk of starting a big migration like this and then it never getting finished and ending up with duplicate tables for reasons only known to people who've been on the project for a very long time.