Collect_pull_requests : IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"
We are seeing about 50 of these post duplicate script. I will provide examples but if it is not already, pull_request natural key to be based off of pr_src_id might solve this problem, similar to #3046
Errors:
IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/quarkusio/quarkus-release/pulls/4) already exists.\n')
IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/apache/incubator-kie-drools/pulls/1789) already exists.\n')
IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/apache/incubator-kie-kogito-runtimes/pulls/2787) already exists.\n')
IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/sigstore/sigstore-go/pulls/76) already exists.\n')
IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/osbuild/osbuild.github.io/pulls/23) already exists.\n')
IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/sigstore/ruby-sigstore/pulls/34) already exists.\n')
IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"\nDETAIL: Key (pr_url)=(https://api.github.com/repos/huggingface/lm-evaluation-harness/pulls/11) already exists.\n')
This issue is certainly a result of ongoing duplicates, and a byproduct of the conservative, "minimize destruction" nature of the the "old augur instance clean up script" for PRs, etc. @cdolfi : Do you think the solution is to take a more aggressive stance with duplicates that remain after the first script? This would involve a simpler approach of simply deleting remaining repositories that do not have repo_src_id populated.
The first script already goes in and tries to populate any NULLs, erroring, of course, if that repo_src_id is already on another repo.
@sgoggins This issue has nothing to do with repo_src_ids from what I can see. From investigation these examples do not have duplicates in our database currently, its that its trying to insert a duplicate because it does not realize it is in there. These all have pr_src_ids in there already, thats why my hypothesis is that if the natural key for prs is changed like it was in #3046 it would solve this problem
@cdolfi : I have not observed a duplicate on the PR table in the absence of a duplicate repo (pre-repo_src_id versions of Augur that started collecting before we addressed the common movement of repos) ... Does every repo in your your instance have a repo_src_id?
Try:
select * from augur_data.pull_requests where pr_url = 'https://api.github.com/repos/sigstore/sigstore-go/pulls/76'
I have not seen a case, yet, where checking for duplicates in this way did not in fact show that:
- That URL is already in the pull_requests table and
- Not for the repo_id that's throwing the error
If you could check those things, then we can be sure about whether or not this is different from previously observed issues.
@sgoggins yes that is the query I used to test each error to see if there was already more than one PR. For each one including the example you have provided, there is only 1 pull_request result, the problem is that it is trying to insert it again. I just double checked to make sure. It is very reminiscent of the error we hit with pull_request_events a few months ago when all the last_collect_on dates were set to null.
We do have some repo_src_ids that are null, all of them I have individually checked are ones where the repo no longer exist. For these errors all of them have repo_src_ids
https://github.com/chaoss/augur/blob/7f8f04928b00dc69c5b000e6ac288c063f75b3b9/augur/tasks/github/pull_requests/tasks.py#L122 Changing this to be just based off of pr_src_id might fix the issue
this was reported by a contributor today as well:
File "/Users/admin/augur_venv/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pull-request-insert-unique"
DETAIL: Key (pr_url)=(https://api.github.com/repos/z-galaxy/busd/pulls/271) already exists.
Seems like this issue may be related to adding repos via the CLI and/or CSV files
per @sgoggins analysis:
augur/tasks/frontend.py has the method called first. (add_new_github_repos) augur/application/cli/db.py seems to handle the insertion of repos at the command line (add_repos)
A brief examination of the code, [...] suggests to me that you possibly added repositories using the command line tools? I say that because it "looks like" we actually already have the GH ID in the table, and we are checking for duplicates in that interface.
pull-request-insert-unique as a constraint is defined at https://github.com/chaoss/augur/blob/4221cba76e6ce93e8d4e2f2676f1525bfeacc8ca/augur/application/db/models/augur_data.py#L1696
It looks like the underlying issue is likely to be a duplicate URL issue
augur/augur/tasks/github/pull_requests/tasks.py
Line 122 in 7f8f049
pr_natural_keys = ["repo_id", "pr_src_id"] Changing this to be just based off of pr_src_id might fix the issue
@MoralCode good catch, I think changing what you pointed out and this to be based on pr_src_id will solve the issue
the two lines above it are already constraints that ensure uniqueness beased on repo ID and PR source ID (why they are duplicated i dont know, i also suspect we have unnecessary indexes but those are subjects for different issues): https://github.com/chaoss/augur/blob/4221cba76e6ce93e8d4e2f2676f1525bfeacc8ca/augur/application/db/models/augur_data.py#L1693-L1697
@MoralCode but in atleast some of the examples above that is part of the issues with repos that were loaded in pre augur fixes that prevented the same repo getting loaded in twice bc of different name/capitalization
So, a pr could have gotten loaded in under a different repo_id, then when duplicates were merged they are now under a different repo_id but the same src_id and then it tries to insert it but its already there. The error examples listed are actually of augur trying to insert data that already exists in the db
This issue has been Solved for repos added via the frontend/web UI (in #2286 and #3056) It will soon be solved for repos added via the CI (Seans observations are likely a different bug)
If it is still being seen in current/"fixed" versions, that means you were running a version of augur prior to the fixes, meaning that your augur instance inadvertently allowed a duplicate repository to be created in your database (i.e. you have the URL of the repo both before and after the move as separate entries in the repos table of your DB).
We need to provide documentation and guidance on how people can migrate if they face this situation. Fix scripts might be https://github.com/chaoss/augur/issues/3056#issuecomment-2725168060