uberon icon indicating copy to clipboard operation
uberon copied to clipboard

Automatic ID allocation for AI-created PRs

Open gouttegd opened this issue 7 months ago • 3 comments

Introduction

The use of AI agents (such as Dragon AI) that automatically open pull requests to change the ontology creates a particular problem regarding the allocation of term IDs.

As soon as the agent has more than one PRs open at the same time, and if it creates new terms in those PRs, it runs into the risk of generating conflicting IDs across the PRs.

This is a problem that theoretically also exists for human editors, but for human editors it is mitigated in two ways:

  1. All editors have their own ID range, so even if two editors create each a PR at around the same, they cannot cause ID conflicts.

  2. Protégé (used by most if not all human editors) can keep track of used IDs independently of Git (so, even across PRs), so even if one single editor is working on different PRs simultaneously, conflicts should still be avoided.

With an AI agent, both those mitigations are unavailable: all PRs from an agent will use the same ID range, and all invocations of the agent happen in their own context (their own branch), where the agent is not aware (cannot be aware) of what is happening in the other contexts.

For now, people using the agent have been working around the issue by basically keeping track themselves of which IDs have been used in presently open PRs, and explicitly telling the agent not to use these IDs (example here). Needless to say, this is hardly a satisfying solution.

So, in this ticket I want to discuss a better solution.

Principle

The basic idea is the one proposed by @balhoff in this ROBOT ticket.

  1. Let the agent create new terms using temporary IDs picked in a distinguishable namespace.

  2. Once the PR is ready (all checks passed, all reviewers happy with the changes, etc.), before it is merged (or just after it has been merged; see later for a discussion about the options), temporary IDs are automatically replaced by newly minted definitive IDs.

An implementation of the proposed mint command is available in the KGCL plugin for ROBOT, which is itself available in the ODK.

Of note, contrary to the original idea from @balhoff, here I’d suggest that the temporary ID namespace be merely a dedicated range within the normal UBERON namespace, rather than a completely different namespace such as http://purl.obolibrary.org/temp#, and also that the IDs picked within that range look like “normal IDs”, rather than UUIDs. This is to avoid making the PRs harder to review. All that matters for the mint command to work is that the temporary IDs start by a distinguishable prefix, which can easily be achieved simply by deciding, for example, that all IDs within the 9900000–10000000 range are “temporary”. Then temporary IDs can be distinguished simply by their http://purl.obolibrary.org/obo/UBERON_99 prefix, but they would still look like “normal” IDs for reviewers (instead of http://purl.obolibrary.org/temp#82CBF624-B72F-4F23-8E77-640D96EAC86F).

Implementation

Allocating ID ranges

We would need to allocate two ID ranges:

  • One range for the temporary IDs, such as 9900000–10000000 as proposed above (but in reality the exact range does not actually matter, the approach would work just as fine with any other range). This is the range that the AI agents should be instructed to use.
  • One range for the definitive IDs (again, can be any range).

Both ranges can (and in fact, should) be properly added to the Uberon ID range file:

Datatype: idrange:XX
    Annotations:
        allocatedto: "Temporary IDs"
    EquivalentTo:
        xsd:integer[>= 9900000, < 10000000]

Datatype: idrange:XY
    Annotations:
        allocatedto: "Definitive auto-allocated IDs"
    EquivalentTo:
        xsd:integer[>= 1200000, < 1300000]

Replacing temporary IDs by definitive IDs

This is the matter of a single command (assuming the KGCL ROBOT plugin is available, which would be the case if the ODK is used):

robot --catalog catalog-v001.xml \
      kgcl:mint -i uberon-edit.obo \
                --temp-id-prefix http://purl.obolibrary.org/obo/UBERON_99 \
                --id-range-name "Definitive auto-allocated IDs" \
      convert -f obo --check false -o uberon-edit.obo

Of course the command could conveniently be hidden away by a helper rule in the custom Makefile:

PHONY: allocate-definitive-ids
allocate-definitive-ids: | all_robot_plugins
        $(ROBOT) kgcl:mint -i $(SRC) \
                           --temp-id-prefix http://purl.obolibrary.org/obo/UBERON_99 \
                           --id-range-name "Definitive auto-allocated IDs" \
                 convert -f obo --check false -o $(SRC)

so that editors (or a GitHub Action, see below) would just need to run

sh run.sh make allocate-definitive-ids

When should temporary IDs be replaced by definitive IDs?

This I believe is the main question raised by this whole approach. Options are:

A) Before a PR is merged.

In that case, the difficulty is that the ID replacement should really occur just before the PR is merged. If temporary IDs are replaced, and then the PR is left for a while before being merged, then we again risk running into the very conflicts we want to avoid (if another PR with auto-generated IDs is merged in the meantime). Also, this requires that the PR be up-to-date with the master branch, at the time the ID replacement is done.

Unfortunately, AFAIK it is not possible to trigger a GitHub Actions workflow that runs exactly just before a PR is merged. So this would have to rely on whoever is merging the PR: Once they know that they want to merge the PR (all checks passed, all comments addressed, etc.), that person would need to manually trigger the ID replacement workflow, wait for it to finish (it shouldn’t take more than a couple of minutes), and then immediately merge the PR.

B) After a PR is merged.

This would have the advantage that it can be fully automatized. We can have a workflow that runs whenever commits are pushed to the master branch (which automatically covers the case when a PR is merged), performs the ID replacement, and immediately commits it to the master branch.

One drawback would be that the temporary IDs would always end up in the history of the master branch, which would make the history more cumbersome to peruse (whereas if we replace IDs before merging, we have the option of then making a “squash merge”, so that only the definitive IDs would end up in the history of the master branch).

Also, this would require some configuration on the GitHub repository, because in Uberon the master branch is “protected”, meaning a GitHub Action would not be allowed to push a commit to it.

gouttegd avatar Jun 17 '25 16:06 gouttegd

@gouttegd I had originally suggested UUIDs so that there was no chance of identifier collision (at least, astronomically unlikely). There may be some situation where two work branches are merged prior to merging to master.

balhoff avatar Jun 17 '25 18:06 balhoff

Yes, I understand the reason. And I am open to use UUIDs as originally proposed – from the point of view of the mint command it doesn’t matter what the temporary IDs are, as long as they have a distinguishing prefix.

But I have several concerns with UUIDs as temporary IDs.

  • It will be harder to review the PRs if they contain stuff like
+[Term]
+id: http://purl.obolibrary.org/temp#88084323-77f1-484c-8bb7-1929ee470b4b
+name: my new term
+...
+[Term]
+id: http://purl.obolibrary.org/temp#b7acd290-24f4-40c9-bcb5-332f59fefb2b
+name: another new term
+is_a: http://purl.obolibrary.org/temp#88084323-77f1-484e-8bb7-1929ee470b4b
+...

compared to something like

+[Term]
+id: UBERON:9900001
+name: my new term
+...
+[Term]
+id: UBERON:9900002
+name: another new term
+is_a: UBERON:9900001
+...
  • I don’t particularly trust a LLM not to make mistakes when referencing a temporary ID (if they need not only to create a new term but to insert axioms that use the new term, as in the example above). Reviewers will need to be very careful that all ID references are correct, which would be much more difficult with UUIDs.

  • I also don’t particularly trust a LLM to be efficient at generating random numbers, so I am not even sure the risk of collision would really be “astronomically unlikely”.

But, as said, I am still open to use UUIDs in the end, if other people disagree that those concerns are blocking.

gouttegd avatar Jun 17 '25 19:06 gouttegd

I like this.

I wouldn't trust the LLM to make the random number. I would have other orchestrating process inject instructions like "use UBERON:99NNN IDs", and then (deterministcally) check after the fact. But the essence of the proposal remains the same.

cmungall avatar Jun 17 '25 20:06 cmungall

Now I read the mature PR I withdraw concerns about the random number part, as the proposed workflow protects against inter-PR collisions. I don't have major concerns with intra-PR collisions, in the unlikely event they happen things will be flagged anyway due to e.g. duplicate labels when collided terms are merged

cmungall avatar Jul 02 '25 14:07 cmungall