domjudge
domjudge copied to clipboard
Importing of files with data source = 0 broken
When we import files with data source = 0
, we get errors about duplicate keys.
This is because we store the ID of the teams, organizations and groups as externalid
on the entity, but then use the 'internal' ID to look up stuff.
We could easily change this to always use the external ID to look up, but there are two issues with that:
- We don't display the external ID anywhere when
data source = 0
. - The API uses the internal ID's everywhere with
data source = 0
.
Since ID's in imports can be any format, we can not store them in our internal, numeric ID column.
I see several options and I don't know which is best:
- Disallow importing files with
data source = 0
. - Always show the external ID field and use that for imports. Don't use it in the API with
data source = 0
. - Drop
data source = 0
, making us always require an external ID and thus basically making that the new primary key.
If someone knows a better way I am all ears, since I don't like any of these.
I would like to get some input from @eldering on this
Another option:
- When
data source = 0
only allow numeric ID's and store them as our internal ID. When you supply a non-numeric ID, add a warning saying 'usedata source = 1
'.
Michael and me had a discussion, and we think we came up with something that could work such that we can get rid of data source 0, leaving basically two modes:
- Primary mode
- Shadow mode
In both modes, we show the external ID fields everywhere and we make sure they are always filled with something automatically if the user doesn't enter a value:
- For users we can use the username as external ID. If a user with such an external ID already exists, produce an error.
- For teams, contests, problems, team categories and team affiliations we, when saving the entity:
- Start a transaction
- Save the entity
- If the entity doesn't have any errors and the external ID is empty, set the primary key value on the entity.
- Validate the entity again.
- If this results in a non unique error, display it to the user and rollback the transaction (thus not saving the entity).
- Otherwise save the entity and party.
- For submissions, judgings, runs and clarifications this depends on the mode:
- For primary mode, we are in control and can always use the internal ID as external ID.
- For shadow mode, the shadow will give the data to us, so we can save it. If someone still creates a submission or clarification in our UI, use a UUID to make sure it's unique.
For imports we know we always have the external ID, so we can just set it. Of course we still check for uniqueness.
Doing this I think we can get rid of data source = 0 mode, and thus a lot of code and logic about it. For example the API always can use the external ID field, the event log service becomes easier, UI checks can be gone, etc.
What do you all think?
After thinking about this a bit, I wonder if this remark is what we actually want:
For submissions, judgings, runs and clarifications this depends on the mode:
- For primary mode, we are in control and can always use the internal ID as external ID.
That means we need to do the trick where we save the submission in a transaction to get the primary key ID and then re-save it. How performant is this? It requires two queries for every new submission, judging and run instead of one.
But the only alternative I can think of is to always use UUID's for these, which seems to be annoying as well when referring to a submission in the API, so maybe we should accept that this is a bit slower?
After thinking about this a bit, I wonder if this remark is what we actually want:
For submissions, judgings, runs and clarifications this depends on the mode:
- For primary mode, we are in control and can always use the internal ID as external ID.
That means we need to do the trick where we save the submission in a transaction to get the primary key ID and then re-save it. How performant is this? It requires two queries for every new submission, judging and run instead of one.
But the only alternative I can think of is to always use UUID's for these, which seems to be annoying as well when referring to a submission in the API, so maybe we should accept that this is a bit slower?
Is it easy to do it for only one (let's say run
as that one has the most entries) and compare the difference?
After thinking about this a bit, I wonder if this remark is what we actually want:
For submissions, judgings, runs and clarifications this depends on the mode:
- For primary mode, we are in control and can always use the internal ID as external ID.
That means we need to do the trick where we save the submission in a transaction to get the primary key ID and then re-save it. How performant is this? It requires two queries for every new submission, judging and run instead of one. But the only alternative I can think of is to always use UUID's for these, which seems to be annoying as well when referring to a submission in the API, so maybe we should accept that this is a bit slower?
Is it easy to do it for only one (let's say
run
as that one has the most entries) and compare the difference?
I think we need a real big stress test to really see the difference, which is hard. You will probably not notice it with small loads. But maybe you don't notice it at all... I mean we already do multiple queries on every page / API action.
OK I got the initial thing working quite nicely, and will create a branch to track progress.
Some remarks:
- Judgements and runs don't have an external ID and don't need one; we always expose our internal ones in the API currently. We might want to change this such that in shadow mode we expose the external ones, but those are stored in a different table.
- I have a generic way to generate the external ID from the internal ID now, which requires a small change to the controllers that create the entities, but not much more. This works for all entities where the external ID is auto generated, which basically is all entities you can enter in the UI. It uses a Doctrine event, so even works in the API and will throw a readable error back to the user there.
After the initial commit I will do, I will start removing code that takes care if the distinction in many places.
* Judgements and runs don't have an external ID and don't need one; we always expose our internal ones in the API currently. We might want to change this such that in shadow mode we expose the external ones, but those are stored in a different table.
What would we expose if we are ahead of the primary?
* Judgements and runs don't have an external ID and don't need one; we always expose our internal ones in the API currently. We might want to change this such that in shadow mode we expose the external ones, but those are stored in a different table.
What would we expose if we are ahead of the primary?
Whatever we have already. And if we disagree our scoreboard endpoint wouldn't match our judgements.
But I'm leaving this change out of this feature. This is something we can discuss and implement separately.
After talking with Tobi, we decided to change a bit: Instead of using internal ID's, let's use UUID's for teams, contests, problems, team categories and team affiliations and for users if we have a collision.
For submissions etc, we can use the internal ID when not shadowing and a UUID when shadowing.
After talking with Tobi, we decided to change a bit: Instead of using internal ID's, let's use UUID's for teams, contests, problems, team categories and team affiliations and for users if we have a collision.
For submissions etc, we can use the internal ID when not shadowing and a UUID when shadowing.
Are these UUIDs then displayed? (Which I think would be a horrible thing)
After talking with Tobi, we decided to change a bit: Instead of using internal ID's, let's use UUID's for teams, contests, problems, team categories and team affiliations and for users if we have a collision. For submissions etc, we can use the internal ID when not shadowing and a UUID when shadowing.
Are these UUIDs then displayed? (Which I think would be a horrible thing)
They will be used:
- In the API as ID, so in API URL's
- In the UI as 'external ID'
But that latter thing would make the tables a bit wide, so maybe that indeed is not desired.
I'm fine sticking with the original solution, but the UX is a bit weird if an ID is reused: you will get send back to the form telling you you need to enter an external ID yourself since the auto generated one is in use.
Yeah, I didn't think about the UI. Jaap brings up a good point.
Another option would be to reduce the likelihood that a collision happens, e.g. by not just using our internal IDs but dj-<id>
or similar?
OK I like that idea.
Yeah, I didn't think about the UI. Jaap brings up a good point.
Another option would be to reduce the likelihood that a collision happens, e.g. by not just using our internal IDs but
dj-<id>
or similar?
This is also something that can easily be changed afterwards, right? So maybe first stick with simply reusing the internal id, and if we see that it causes too many collisions, then do somethink else like a dj-
prefix?
Yeah, I didn't think about the UI. Jaap brings up a good point. Another option would be to reduce the likelihood that a collision happens, e.g. by not just using our internal IDs but
dj-<id>
or similar?This is also something that can easily be changed afterwards, right? So maybe first stick with simply reusing the internal id, and if we see that it causes too many collisions, then do somethink else like a
dj-
prefix?
@meisterT had a fair point: imagine you create a team manually first, them import teams from a JSON, the change of a collision are basically 100%, since the JSON will probably have 1..n and those are already used by manually created teams.
And to add to that: this will make sure that we lose data that was added manually as we overwrite it.