domjudge icon indicating copy to clipboard operation
domjudge copied to clipboard

Importing of files with data source = 0 broken

Open nickygerritsen opened this issue 1 year ago • 3 comments

When we import files with data source = 0, we get errors about duplicate keys. This is because we store the ID of the teams, organizations and groups as externalid on the entity, but then use the 'internal' ID to look up stuff.

We could easily change this to always use the external ID to look up, but there are two issues with that:

  1. We don't display the external ID anywhere when data source = 0.
  2. The API uses the internal ID's everywhere with data source = 0.

Since ID's in imports can be any format, we can not store them in our internal, numeric ID column.

I see several options and I don't know which is best:

  1. Disallow importing files with data source = 0.
  2. Always show the external ID field and use that for imports. Don't use it in the API with data source = 0.
  3. Drop data source = 0, making us always require an external ID and thus basically making that the new primary key.

If someone knows a better way I am all ears, since I don't like any of these.

nickygerritsen avatar Jun 22 '23 15:06 nickygerritsen

I would like to get some input from @eldering on this

meisterT avatar Jun 22 '23 15:06 meisterT

Another option:

  1. When data source = 0 only allow numeric ID's and store them as our internal ID. When you supply a non-numeric ID, add a warning saying 'use data source = 1'.

nickygerritsen avatar Jun 22 '23 15:06 nickygerritsen

Michael and me had a discussion, and we think we came up with something that could work such that we can get rid of data source 0, leaving basically two modes:

  • Primary mode
  • Shadow mode

In both modes, we show the external ID fields everywhere and we make sure they are always filled with something automatically if the user doesn't enter a value:

  • For users we can use the username as external ID. If a user with such an external ID already exists, produce an error.
  • For teams, contests, problems, team categories and team affiliations we, when saving the entity:
    • Start a transaction
    • Save the entity
    • If the entity doesn't have any errors and the external ID is empty, set the primary key value on the entity.
    • Validate the entity again.
    • If this results in a non unique error, display it to the user and rollback the transaction (thus not saving the entity).
    • Otherwise save the entity and party.
  • For submissions, judgings, runs and clarifications this depends on the mode:
    • For primary mode, we are in control and can always use the internal ID as external ID.
    • For shadow mode, the shadow will give the data to us, so we can save it. If someone still creates a submission or clarification in our UI, use a UUID to make sure it's unique.

For imports we know we always have the external ID, so we can just set it. Of course we still check for uniqueness.

Doing this I think we can get rid of data source = 0 mode, and thus a lot of code and logic about it. For example the API always can use the external ID field, the event log service becomes easier, UI checks can be gone, etc.

What do you all think?

nickygerritsen avatar Jan 12 '24 14:01 nickygerritsen

After thinking about this a bit, I wonder if this remark is what we actually want:

For submissions, judgings, runs and clarifications this depends on the mode:

  • For primary mode, we are in control and can always use the internal ID as external ID.

That means we need to do the trick where we save the submission in a transaction to get the primary key ID and then re-save it. How performant is this? It requires two queries for every new submission, judging and run instead of one.

But the only alternative I can think of is to always use UUID's for these, which seems to be annoying as well when referring to a submission in the API, so maybe we should accept that this is a bit slower?

nickygerritsen avatar May 20 '24 18:05 nickygerritsen

After thinking about this a bit, I wonder if this remark is what we actually want:

For submissions, judgings, runs and clarifications this depends on the mode:

  • For primary mode, we are in control and can always use the internal ID as external ID.

That means we need to do the trick where we save the submission in a transaction to get the primary key ID and then re-save it. How performant is this? It requires two queries for every new submission, judging and run instead of one.

But the only alternative I can think of is to always use UUID's for these, which seems to be annoying as well when referring to a submission in the API, so maybe we should accept that this is a bit slower?

Is it easy to do it for only one (let's say run as that one has the most entries) and compare the difference?

vmcj avatar May 20 '24 18:05 vmcj

After thinking about this a bit, I wonder if this remark is what we actually want:

For submissions, judgings, runs and clarifications this depends on the mode:

  • For primary mode, we are in control and can always use the internal ID as external ID.

That means we need to do the trick where we save the submission in a transaction to get the primary key ID and then re-save it. How performant is this? It requires two queries for every new submission, judging and run instead of one. But the only alternative I can think of is to always use UUID's for these, which seems to be annoying as well when referring to a submission in the API, so maybe we should accept that this is a bit slower?

Is it easy to do it for only one (let's say run as that one has the most entries) and compare the difference?

I think we need a real big stress test to really see the difference, which is hard. You will probably not notice it with small loads. But maybe you don't notice it at all... I mean we already do multiple queries on every page / API action.

nickygerritsen avatar May 20 '24 18:05 nickygerritsen

OK I got the initial thing working quite nicely, and will create a branch to track progress.

Some remarks:

  • Judgements and runs don't have an external ID and don't need one; we always expose our internal ones in the API currently. We might want to change this such that in shadow mode we expose the external ones, but those are stored in a different table.
  • I have a generic way to generate the external ID from the internal ID now, which requires a small change to the controllers that create the entities, but not much more. This works for all entities where the external ID is auto generated, which basically is all entities you can enter in the UI. It uses a Doctrine event, so even works in the API and will throw a readable error back to the user there.

After the initial commit I will do, I will start removing code that takes care if the distinction in many places.

nickygerritsen avatar May 21 '24 20:05 nickygerritsen

* Judgements and runs don't have an external ID and don't need one; we always expose our internal ones in the API currently. We might want to change this such that in shadow mode we expose the external ones, but those are stored in a different table.

What would we expose if we are ahead of the primary?

vmcj avatar May 22 '24 05:05 vmcj

* Judgements and runs don't have an external ID and don't need one; we always expose our internal ones in the API currently. We might want to change this such that in shadow mode we expose the external ones, but those are stored in a different table.

What would we expose if we are ahead of the primary?

Whatever we have already. And if we disagree our scoreboard endpoint wouldn't match our judgements.

But I'm leaving this change out of this feature. This is something we can discuss and implement separately.

nickygerritsen avatar May 22 '24 08:05 nickygerritsen

After talking with Tobi, we decided to change a bit: Instead of using internal ID's, let's use UUID's for teams, contests, problems, team categories and team affiliations and for users if we have a collision.

For submissions etc, we can use the internal ID when not shadowing and a UUID when shadowing.

nickygerritsen avatar May 25 '24 13:05 nickygerritsen

After talking with Tobi, we decided to change a bit: Instead of using internal ID's, let's use UUID's for teams, contests, problems, team categories and team affiliations and for users if we have a collision.

For submissions etc, we can use the internal ID when not shadowing and a UUID when shadowing.

Are these UUIDs then displayed? (Which I think would be a horrible thing)

eldering avatar May 25 '24 13:05 eldering

After talking with Tobi, we decided to change a bit: Instead of using internal ID's, let's use UUID's for teams, contests, problems, team categories and team affiliations and for users if we have a collision. For submissions etc, we can use the internal ID when not shadowing and a UUID when shadowing.

Are these UUIDs then displayed? (Which I think would be a horrible thing)

They will be used:

  • In the API as ID, so in API URL's
  • In the UI as 'external ID'

But that latter thing would make the tables a bit wide, so maybe that indeed is not desired.

I'm fine sticking with the original solution, but the UX is a bit weird if an ID is reused: you will get send back to the form telling you you need to enter an external ID yourself since the auto generated one is in use.

nickygerritsen avatar May 25 '24 13:05 nickygerritsen

Yeah, I didn't think about the UI. Jaap brings up a good point.

Another option would be to reduce the likelihood that a collision happens, e.g. by not just using our internal IDs but dj-<id> or similar?

meisterT avatar May 25 '24 13:05 meisterT

OK I like that idea.

nickygerritsen avatar May 25 '24 13:05 nickygerritsen

Yeah, I didn't think about the UI. Jaap brings up a good point.

Another option would be to reduce the likelihood that a collision happens, e.g. by not just using our internal IDs but dj-<id> or similar?

This is also something that can easily be changed afterwards, right? So maybe first stick with simply reusing the internal id, and if we see that it causes too many collisions, then do somethink else like a dj- prefix?

eldering avatar May 25 '24 14:05 eldering

Yeah, I didn't think about the UI. Jaap brings up a good point. Another option would be to reduce the likelihood that a collision happens, e.g. by not just using our internal IDs but dj-<id> or similar?

This is also something that can easily be changed afterwards, right? So maybe first stick with simply reusing the internal id, and if we see that it causes too many collisions, then do somethink else like a dj- prefix?

@meisterT had a fair point: imagine you create a team manually first, them import teams from a JSON, the change of a collision are basically 100%, since the JSON will probably have 1..n and those are already used by manually created teams.

nickygerritsen avatar May 25 '24 14:05 nickygerritsen

And to add to that: this will make sure that we lose data that was added manually as we overwrite it.

meisterT avatar May 25 '24 14:05 meisterT