augur icon indicating copy to clipboard operation
augur copied to clipboard

Collect_pull_requests : IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"

Open cdolfi opened this issue 6 months ago • 12 comments

We are seeing about 50 of these post duplicate script. I will provide examples but if it is not already, pull_request natural key to be based off of pr_src_id might solve this problem, similar to #3046

Errors:

IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/quarkusio/quarkus-release/pulls/4) already exists.\n')

IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/apache/incubator-kie-drools/pulls/1789) already exists.\n')

IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/apache/incubator-kie-kogito-runtimes/pulls/2787) already exists.\n')

IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/sigstore/sigstore-go/pulls/76) already exists.\n')

IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/osbuild/osbuild.github.io/pulls/23) already exists.\n')

IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/sigstore/ruby-sigstore/pulls/34) already exists.\n')

IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/huggingface/lm-evaluation-harness/pulls/11) already exists.\n')

cdolfi avatar Jun 16 '25 17:06 cdolfi

This issue is certainly a result of ongoing duplicates, and a byproduct of the conservative, "minimize destruction" nature of the the "old augur instance clean up script" for PRs, etc. @cdolfi : Do you think the solution is to take a more aggressive stance with duplicates that remain after the first script? This would involve a simpler approach of simply deleting remaining repositories that do not have repo_src_id populated.

The first script already goes in and tries to populate any NULLs, erroring, of course, if that repo_src_id is already on another repo.

sgoggins avatar Jun 16 '25 17:06 sgoggins

@sgoggins This issue has nothing to do with repo_src_ids from what I can see. From investigation these examples do not have duplicates in our database currently, its that its trying to insert a duplicate because it does not realize it is in there. These all have pr_src_ids in there already, thats why my hypothesis is that if the natural key for prs is changed like it was in #3046 it would solve this problem

cdolfi avatar Jun 16 '25 17:06 cdolfi

@cdolfi : I have not observed a duplicate on the PR table in the absence of a duplicate repo (pre-repo_src_id versions of Augur that started collecting before we addressed the common movement of repos) ... Does every repo in your your instance have a repo_src_id?

Try:

select * from augur_data.pull_requests where pr_url = 'https://api.github.com/repos/sigstore/sigstore-go/pulls/76'

I have not seen a case, yet, where checking for duplicates in this way did not in fact show that:

  1. That URL is already in the pull_requests table and
  2. Not for the repo_id that's throwing the error

If you could check those things, then we can be sure about whether or not this is different from previously observed issues.

sgoggins avatar Jun 16 '25 17:06 sgoggins

@sgoggins yes that is the query I used to test each error to see if there was already more than one PR. For each one including the example you have provided, there is only 1 pull_request result, the problem is that it is trying to insert it again. I just double checked to make sure. It is very reminiscent of the error we hit with pull_request_events a few months ago when all the last_collect_on dates were set to null.

We do have some repo_src_ids that are null, all of them I have individually checked are ones where the repo no longer exist. For these errors all of them have repo_src_ids

cdolfi avatar Jun 16 '25 18:06 cdolfi

https://github.com/chaoss/augur/blob/7f8f04928b00dc69c5b000e6ac288c063f75b3b9/augur/tasks/github/pull_requests/tasks.py#L122 Changing this to be just based off of pr_src_id might fix the issue

cdolfi avatar Jun 17 '25 19:06 cdolfi

this was reported by a contributor today as well:

File "/Users/admin/augur_venv/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
  cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"
DETAIL: Key (pr_url)=(https://api.github.com/repos/z-galaxy/busd/pulls/271) already exists.

MoralCode avatar Oct 29 '25 18:10 MoralCode

Seems like this issue may be related to adding repos via the CLI and/or CSV files

per @sgoggins analysis:

augur/tasks/frontend.py has the method called first. (add_new_github_repos) augur/application/cli/db.py seems to handle the insertion of repos at the command line (add_repos)

A brief examination of the code, [...] suggests to me that you possibly added repositories using the command line tools? I say that because it "looks like" we actually already have the GH ID in the table, and we are checking for duplicates in that interface.

MoralCode avatar Oct 29 '25 18:10 MoralCode

pull-request-insert-unique as a constraint is defined at https://github.com/chaoss/augur/blob/4221cba76e6ce93e8d4e2f2676f1525bfeacc8ca/augur/application/db/models/augur_data.py#L1696

It looks like the underlying issue is likely to be a duplicate URL issue

MoralCode avatar Oct 31 '25 14:10 MoralCode

augur/augur/tasks/github/pull_requests/tasks.py

Line 122 in 7f8f049

pr_natural_keys = ["repo_id", "pr_src_id"] Changing this to be just based off of pr_src_id might fix the issue

@MoralCode good catch, I think changing what you pointed out and this to be based on pr_src_id will solve the issue

cdolfi avatar Oct 31 '25 14:10 cdolfi

the two lines above it are already constraints that ensure uniqueness beased on repo ID and PR source ID (why they are duplicated i dont know, i also suspect we have unnecessary indexes but those are subjects for different issues): https://github.com/chaoss/augur/blob/4221cba76e6ce93e8d4e2f2676f1525bfeacc8ca/augur/application/db/models/augur_data.py#L1693-L1697

MoralCode avatar Oct 31 '25 15:10 MoralCode

@MoralCode but in atleast some of the examples above that is part of the issues with repos that were loaded in pre augur fixes that prevented the same repo getting loaded in twice bc of different name/capitalization

So, a pr could have gotten loaded in under a different repo_id, then when duplicates were merged they are now under a different repo_id but the same src_id and then it tries to insert it but its already there. The error examples listed are actually of augur trying to insert data that already exists in the db

cdolfi avatar Oct 31 '25 15:10 cdolfi

This issue has been Solved for repos added via the frontend/web UI (in #2286 and #3056) It will soon be solved for repos added via the CI (Seans observations are likely a different bug)

If it is still being seen in current/"fixed" versions, that means you were running a version of augur prior to the fixes, meaning that your augur instance inadvertently allowed a duplicate repository to be created in your database (i.e. you have the URL of the repo both before and after the move as separate entries in the repos table of your DB).

We need to provide documentation and guidance on how people can migrate if they face this situation. Fix scripts might be https://github.com/chaoss/augur/issues/3056#issuecomment-2725168060

MoralCode avatar Oct 31 '25 19:10 MoralCode