augur icon indicating copy to clipboard operation
augur copied to clipboard

Fixes for repo url update on move detection

Open MoralCode opened this issue 1 month ago • 9 comments

Description @cdolfi has reported an issue (#3129) where repos that have moved and are redirecting when visited in a browser are not having their URLs updated to reflect the move.

Using an AI tool to look over the relevant task and identify issues for review, it identified that the hit_api function that was being used for the API calls was internally passing follow_redirects=True to the underlying HTTP library.

This explains why the repo urls werent being updated - because all github calls were automatically following the redirect, meaning the check for response_code == 301 later in the code would practically never get called.

This is related to #3129 (but also slightly exacerbates it because it doesnt yet store the old url)

Notes for Reviewers I have yet to test this locally. Trying to see if i can replicate the issue, although i have a lot of repos in my local instance now that I should probably clear out....

Signed commits

  • [X] Yes, I signed my commits.

Generative AI disclosure

  • [X] This contribution was assisted or created by Generative AI tools.
    • GPT-5 was used through the chat feature of Cursor to provide an initial summary of problems that were then reviewed by me, leading to the set of fixes present in this PR. The only generated code in this pr is in commit 12d7be2cf03850dcdd24c5bb60cf81a0949f96cf, where Cursor helped suggest some code to log and raise an error if, for some crazy reason, a 301 response comes back without a Location header. This code was reviewed and built upon by me (to make it casing-agnostic) before submitting.

MoralCode avatar Nov 11 '25 18:11 MoralCode

@MoralCode : Your root cause analysis sounds DEAD ON, and explains why this appears not to be working in the case of automatic moving. This will fix that issue.

sgoggins avatar Nov 11 '25 18:11 sgoggins

Still waiting on testing this to 100% confirm that this will update the repo_git url

MoralCode avatar Nov 13 '25 16:11 MoralCode

Still waiting on testing this to 100% confirm that this will update the repo_git url

Will hold off on marking it ready and merging until you green light it then.

sgoggins avatar Nov 13 '25 18:11 sgoggins

Current issues with this: seems to be a dependence on using the repo URL for querying:

[augur]        | 2025-11-18 23:04:30 cd3ac88591d1 augur_collection_monitor[276] INFO Setting github repo core status to collecting for repo: https://github.com/moralcode/classclockapi
[augur]        | [2025-11-18 23:04:30,464: INFO/MainProcess] Task augur.tasks.github.detect_move.tasks.detect_github_repo_move_core[38307c62-7947-495d-ac2f-35d8bd2bb241] received
[augur]        | 2025-11-18 23:04:30 cd3ac88591d1 detect_github_repo_move_core[280] INFO Starting repo_move operation with https://github.com/moralcode/classclockapi
[augur]        | 2025-11-18 23:04:30 cd3ac88591d1 augur_collection_monitor[276] INFO Starting collection on 0 secondary repos
[augur]        | 2025-11-18 23:04:30 cd3ac88591d1 augur_collection_monitor[276] INFO Starting collection on 0 facade repos
[augur]        | 2025-11-18 23:04:30 cd3ac88591d1 detect_github_repo_move_core[280] INFO Pinging repo: https://github.com/moralcode/classclockapi
[augur]        | [2025-11-18 23:04:30,474: INFO/ForkPoolWorker-2] Task augur.tasks.start_tasks.augur_collection_monitor[bb2d2553-8919-4da8-ac60-e628d826b9c3] succeeded in 0.09832821798045188s: None
[augur]        | 2025-11-18 23:04:30 cd3ac88591d1 detect_github_repo_move_core[280] INFO Retrieved 1 github api keys for use
[augur]        | 2025-11-18 23:04:30 cd3ac88591d1 detect_github_repo_move_core[280] DEBUG Key used for request (masked): ghp_EA******kQm
[augur]        | 2025-11-18 23:04:30 cd3ac88591d1 analyze_commits_in_parallel[281] DEBUG Analyzing commit 1f9454ebee957364e9073378ad2562b37f9c0394 for repo_id=1
[augur]        | 2025-11-18 23:04:30 cd3ac88591d1 detect_github_repo_move_core[280] DEBUG Key used for request (masked): ghp_EA******kQm
[augur]        | 2025-11-18 23:04:30 cd3ac88591d1 detect_github_repo_move_core[280] INFO Updated repo for https://github.com/classclock/API
[augur]        | 
[augur]        | [2025-11-18 23:04:30,942: WARNING/ForkPoolWorker-2] 2025-11-18 23:04:30 cd3ac88591d1 core_task_failure[280] ERROR Task 38307c62-7947-495d-ac2f-35d8bd2bb241 raised exception: ERROR: Repo has moved! Resetting Collection!
[augur]        |  Traceback: Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 453, in trace_task
[augur]        |     R = retval = fun(*args, **kwargs)
[augur]        |                  ^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 736, in __protected_call__
[augur]        |     return self.run(*args, **kwargs)
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/augur/tasks/github/detect_move/tasks.py", line 27, in detect_github_repo_move_core
[augur]        |     ping_github_for_repo_move(session, key_auth, repo, logger)
[augur]        |   File "/augur/augur/tasks/github/detect_move/core.py", line 89, in ping_github_for_repo_move
[augur]        |     raise Exception("ERROR: Repo has moved! Resetting Collection!")
[augur]        | Exception: ERROR: Repo has moved! Resetting Collection!
[augur]        | [2025-11-18 23:04:30,943: WARNING/ForkPoolWorker-2] 2025-11-18 23:04:30 cd3ac88591d1 core_task_failure[280] INFO Repo git: https://github.com/moralcode/classclockapi
[augur]        | [2025-11-18 23:04:30,955: WARNING/ForkPoolWorker-2] /augur/.venv/lib/python3.11/site-packages/celery/app/trace.py:662: RuntimeWarning: Exception raised outside body: NoResultFound('No row was found when one was required'):
[augur]        | Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 453, in trace_task
[augur]        |     R = retval = fun(*args, **kwargs)
[augur]        |                  ^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 736, in __protected_call__
[augur]        |     return self.run(*args, **kwargs)
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/augur/tasks/github/detect_move/tasks.py", line 27, in detect_github_repo_move_core
[augur]        |     ping_github_for_repo_move(session, key_auth, repo, logger)
[augur]        |   File "/augur/augur/tasks/github/detect_move/core.py", line 89, in ping_github_for_repo_move
[augur]        |     raise Exception("ERROR: Repo has moved! Resetting Collection!")
[augur]        | Exception: ERROR: Repo has moved! Resetting Collection!
[augur]        | 
[augur]        | During handling of the above exception, another exception occurred:
[augur]        | 
[augur]        | Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 470, in trace_task
[augur]        |     I, R, state, retval = on_error(task_request, exc)
[augur]        |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 381, in on_error
[augur]        |     R = I.handle_error_state(
[augur]        |         ^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 175, in handle_error_state
[augur]        |     return {
[augur]        |            ^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 233, in handle_failure
[augur]        |     task.on_failure(exc, req.id, req.args, req.kwargs, einfo)
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 107, in on_failure
[augur]        |     self.augur_handle_task_failure(exc, task_id, repo_git, "core_task_failure")
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 90, in augur_handle_task_failure
[augur]        |     repo = session.query(Repo).filter(Repo.repo_git == repo_git).one()
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 2798, in one
[augur]        |     return self._iter().one()  # type: ignore
[augur]        |            ^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 1827, in one
[augur]        |     return self._only_one_row(
[augur]        |            ^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 760, in _only_one_row
[augur]        |     raise exc.NoResultFound(
[augur]        | sqlalchemy.exc.NoResultFound: No row was found when one was required
[augur]        | 
[augur]        |   warn(RuntimeWarning(
[augur]        | 
[augur]        | [2025-11-18 23:04:30,983: WARNING/ForkPoolWorker-2] 2025-11-18 23:04:30,983,983ms [PID: 280] core_task_failure [ERROR] Task 38307c62-7947-495d-ac2f-35d8bd2bb241 raised exception: No row was found when one was required
[augur]        |  Traceback: Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 453, in trace_task
[augur]        |     R = retval = fun(*args, **kwargs)
[augur]        |                  ^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 736, in __protected_call__
[augur]        |     return self.run(*args, **kwargs)
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/augur/tasks/github/detect_move/tasks.py", line 27, in detect_github_repo_move_core
[augur]        |     ping_github_for_repo_move(session, key_auth, repo, logger)
[augur]        |   File "/augur/augur/tasks/github/detect_move/core.py", line 89, in ping_github_for_repo_move
[augur]        |     raise Exception("ERROR: Repo has moved! Resetting Collection!")
[augur]        | Exception: ERROR: Repo has moved! Resetting Collection!
[augur]        | 
[augur]        | During handling of the above exception, another exception occurred:
[augur]        | 
[augur]        | Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 470, in trace_task
[augur]        |     I, R, state, retval = on_error(task_request, exc)
[augur]        |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 381, in on_error
[augur]        |     R = I.handle_error_state(
[augur]        |         ^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 175, in handle_error_state
[augur]        |     return {
[augur]        |            ^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 233, in handle_failure
[augur]        |     task.on_failure(exc, req.id, req.args, req.kwargs, einfo)
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 107, in on_failure
[augur]        |     self.augur_handle_task_failure(exc, task_id, repo_git, "core_task_failure")
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 90, in augur_handle_task_failure
[augur]        |     repo = session.query(Repo).filter(Repo.repo_git == repo_git).one()
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 2798, in one
[augur]        |     return self._iter().one()  # type: ignore
[augur]        |            ^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 1827, in one
[augur]        |     return self._only_one_row(
[augur]        |            ^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 760, in _only_one_row
[augur]        |     raise exc.NoResultFound(
[augur]        | sqlalchemy.exc.NoResultFound: No row was found when one was required
[augur]        | [2025-11-18 23:04:30,983: WARNING/ForkPoolWorker-2] 2025-11-18 23:04:30 cd3ac88591d1 core_task_failure[280] ERROR Task 38307c62-7947-495d-ac2f-35d8bd2bb241 raised exception: No row was found when one was required
[augur]        |  Traceback: Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 453, in trace_task
[augur]        |     R = retval = fun(*args, **kwargs)
[augur]        |                  ^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 736, in __protected_call__
[augur]        |     return self.run(*args, **kwargs)
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/augur/tasks/github/detect_move/tasks.py", line 27, in detect_github_repo_move_core
[augur]        |     ping_github_for_repo_move(session, key_auth, repo, logger)
[augur]        |   File "/augur/augur/tasks/github/detect_move/core.py", line 89, in ping_github_for_repo_move
[augur]        |     raise Exception("ERROR: Repo has moved! Resetting Collection!")
[augur]        | Exception: ERROR: Repo has moved! Resetting Collection!
[augur]        | 
[augur]        | During handling of the above exception, another exception occurred:
[augur]        | 
[augur]        | Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 470, in trace_task
[augur]        |     I, R, state, retval = on_error(task_request, exc)
[augur]        |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 381, in on_error
[augur]        |     R = I.handle_error_state(
[augur]        |         ^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 175, in handle_error_state
[augur]        |     return {
[augur]        |            ^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 233, in handle_failure
[augur]        |     task.on_failure(exc, req.id, req.args, req.kwargs, einfo)
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 107, in on_failure
[augur]        |     self.augur_handle_task_failure(exc, task_id, repo_git, "core_task_failure")
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 90, in augur_handle_task_failure
[augur]        |     repo = session.query(Repo).filter(Repo.repo_git == repo_git).one()
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 2798, in one
[augur]        |     return self._iter().one()  # type: ignore
[augur]        |            ^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 1827, in one
[augur]        |     return self._only_one_row(
[augur]        |            ^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 760, in _only_one_row
[augur]        |     raise exc.NoResultFound(
[augur]        | sqlalchemy.exc.NoResultFound: No row was found when one was required
[augur]        | [2025-11-18 23:04:30,983: WARNING/ForkPoolWorker-2] 2025-11-18 23:04:30,983,983ms [PID: 280] core_task_failure [INFO] Repo git: https://github.com/moralcode/classclockapi
[augur]        | [2025-11-18 23:04:30,983: WARNING/ForkPoolWorker-2] 2025-11-18 23:04:30 cd3ac88591d1 core_task_failure[280] INFO Repo git: https://github.com/moralcode/classclockapi
[augur]        | [2025-11-18 23:04:30,999: ERROR/MainProcess] Task handler raised error: NoResultFound('No row was found when one was required')
[augur]        | billiard.einfo.RemoteTraceback: 
[augur]        | """
[augur]        | Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 453, in trace_task
[augur]        |     R = retval = fun(*args, **kwargs)
[augur]        |                  ^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 736, in __protected_call__
[augur]        |     return self.run(*args, **kwargs)
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/augur/tasks/github/detect_move/tasks.py", line 27, in detect_github_repo_move_core
[augur]        |     ping_github_for_repo_move(session, key_auth, repo, logger)
[augur]        |   File "/augur/augur/tasks/github/detect_move/core.py", line 89, in ping_github_for_repo_move
[augur]        |     raise Exception("ERROR: Repo has moved! Resetting Collection!")
[augur]        | Exception: ERROR: Repo has moved! Resetting Collection!
[augur]        | 
[augur]        | During handling of the above exception, another exception occurred:
[augur]        | 
[augur]        | Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 470, in trace_task
[augur]        |     I, R, state, retval = on_error(task_request, exc)
[augur]        |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 381, in on_error
[augur]        |     R = I.handle_error_state(
[augur]        |         ^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 175, in handle_error_state
[augur]        |     return {
[augur]        |            ^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 233, in handle_failure
[augur]        |     task.on_failure(exc, req.id, req.args, req.kwargs, einfo)
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 107, in on_failure
[augur]        |     self.augur_handle_task_failure(exc, task_id, repo_git, "core_task_failure")
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 90, in augur_handle_task_failure
[augur]        |     repo = session.query(Repo).filter(Repo.repo_git == repo_git).one()
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 2798, in one
[augur]        |     return self._iter().one()  # type: ignore
[augur]        |            ^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 1827, in one
[augur]        |     return self._only_one_row(
[augur]        |            ^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 760, in _only_one_row
[augur]        |     raise exc.NoResultFound(
[augur]        | sqlalchemy.exc.NoResultFound: No row was found when one was required
[augur]        | 
[augur]        | During handling of the above exception, another exception occurred:
[augur]        | 
[augur]        | Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/billiard/pool.py", line 362, in workloop
[augur]        |     result = (True, prepare_result(fun(*args, **kwargs)))
[augur]        |                                    ^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 651, in fast_trace_task
[augur]        |     R, I, T, Rstr = tasks[task].__trace__(
[augur]        |                     ^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 574, in trace_task
[augur]        |     I, _, _, _ = on_error(task_request, exc)
[augur]        |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 381, in on_error
[augur]        |     R = I.handle_error_state(
[augur]        |         ^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 175, in handle_error_state
[augur]        |     return {
[augur]        |            ^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 233, in handle_failure
[augur]        |     task.on_failure(exc, req.id, req.args, req.kwargs, einfo)
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 107, in on_failure
[augur]        |     self.augur_handle_task_failure(exc, task_id, repo_git, "core_task_failure")
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 90, in augur_handle_task_failure
[augur]        |     repo = session.query(Repo).filter(Repo.repo_git == repo_git).one()
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 2798, in one
[augur]        |     return self._iter().one()  # type: ignore
[augur]        |            ^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 1827, in one
[augur]        |     return self._only_one_row(
[augur]        |            ^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 760, in _only_one_row
[augur]        |     raise exc.NoResultFound(
[augur]        | sqlalchemy.exc.NoResultFound: No row was found when one was required
[augur]        | """
[augur]        | 
[augur]        | The above exception was the direct cause of the following exception:
[augur]        | 
[augur]        | Traceback (most recent call last):
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/billiard/pool.py", line 362, in workloop
[augur]        |     result = (True, prepare_result(fun(*args, **kwargs)))
[augur]        |                                    ^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 651, in fast_trace_task
[augur]        |     R, I, T, Rstr = tasks[task].__trace__(
[augur]        |                     ^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 574, in trace_task
[augur]        |     I, _, _, _ = on_error(task_request, exc)
[augur]        |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 381, in on_error
[augur]        |     R = I.handle_error_state(
[augur]        |         ^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 175, in handle_error_state
[augur]        |     return {
[augur]        |            ^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 233, in handle_failure
[augur]        |     task.on_failure(exc, req.id, req.args, req.kwargs, einfo)
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 107, in on_failure
[augur]        |     self.augur_handle_task_failure(exc, task_id, repo_git, "core_task_failure")
[augur]        |   File "/augur/augur/tasks/init/celery_app.py", line 90, in augur_handle_task_failure
[augur]        |     repo = session.query(Repo).filter(Repo.repo_git == repo_git).one()
[augur]        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/orm/query.py", line 2798, in one
[augur]        |     return self._iter().one()  # type: ignore
[augur]        |            ^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 1827, in one
[augur]        |     return self._only_one_row(
[augur]        |            ^^^^^^^^^^^^^^^^^^^
[augur]        |   File "/augur/.venv/lib/python3.11/site-packages/sqlalchemy/engine/result.py", line 760, in _only_one_row
[augur]        |     raise exc.NoResultFound(
[augur]        | sqlalchemy.exc.NoResultFound: No row was found when one was required

MoralCode avatar Nov 19 '25 14:11 MoralCode

@MoralCode Would changing that query to be based on the repo_src_id fix the issue? Or does github require the URL to get to the repo_src_id?

cdolfi avatar Nov 19 '25 16:11 cdolfi

Thats what I was thinking, I just have to do it. And I suspect the code for it is buried somewhere in augurs various functions.

MoralCode avatar Nov 19 '25 16:11 MoralCode

The above stack trace seems to be happening in the error handler for augur's celery tasks. The fact that it is still repo_url based is a different tech debt issue. But the fact that we are getting it comes from us throwing an exception to stop collection on repo move or delete. This is the subject of #3166. ill likely try and solve both in this PR

MoralCode avatar Nov 19 '25 16:11 MoralCode

@MoralCode : This one appears ready.

sgoggins avatar Dec 09 '25 22:12 sgoggins

This one appears ready.

It largely is, however, I would like to also include a new database table alongside this fix so that, when the repo url gets updated, the old one gets saved in a repo_aliases table so that lookups can be performed using either the old url or new one (making the process of checking if a repo is already in the db when it is added easier)

we can merge this, but data will be lost until that secondary change is in as well, and because that secondary change requires a database migration, its largely blocked on some of the database sync/organizing PRs that are being reviewed currently

MoralCode avatar Dec 11 '25 14:12 MoralCode

seems to be a dependence on using the repo URL for querying:

Ok so this was basically due to how the retry behavior in celery works. I think it was retrying the same task with the same URL, but now that the repo table has a new URL it wasnt finding it.

MoralCode avatar Dec 15 '25 22:12 MoralCode

OK this change now contains the new table and the code to populate it on move. Therefore it officially fixes #3129 :tada:

I tested this with mild effort locally. I am noticing values populate the new tables when the task runs, and was able to fix a few issues with the code, but please someone else also test this.

here is a set of repos I have used that still have active redirects (test one at a time so you can iterate and not struggle to find new ones because you tested them all at once):

  • https://github.com/dbus2/busd
  • https://github.com/apache/incubator-opendal
  • https://github.com/apache/incubator-uniffle
  • https://github.com/MoralCode/ClassClock
  • https://github.com/MoralCode/ClassClockAPI

MoralCode avatar Dec 15 '25 22:12 MoralCode

Maintainers call brought up the concern that, when github is redirecting the old url for a moved repo, another new (and different) repo can be created at the old url, and we would need a way to disambiguate.

I suspect the way that i presented it (i.e. that we would use this table for primary augur operations) was probably wrong. After talking to cali, it sounds like the best plan is just to always use repo source ID for operations, especially repo uniqueness checking.

Essentially this would mean that, newly added repos can use the github API for most cases (valid repo, moved repo, getting the src id), and, if that fails (i.e. repo url is a 404) we can basically fall back to a "best effort" strategy where we then check repo_aliases to see if there is anything, grab the most recent url if there is, and fail if there isnt.

@Ulincsys does that work as far as conflict resolution? the goal would be to essentially treat this aliases table as more of a historic log for analysis/not losing data

MoralCode avatar Dec 16 '25 14:12 MoralCode

Im sorry if that had not been clear earlier! Absolutely still using the repo src id for operations (my personal agenda is to push for everything that can be based on the src id to be), but the prior urls are stored for historical reference. In the case of 8knot, that info would be integrated into the search bar at some point. Happy to discuss more, Ive thought about this issue a lot

cdolfi avatar Dec 16 '25 14:12 cdolfi

part of me is a little worried about "use the src id for everything" since it is fundamentally a git-dependent value. I think it makes a ton of sense to use it when interfacing with the outside world (i.e. someone gives us a git url and we need to check if we have it, we should ping the api and check github id for it), but i think for internal stuff (i.e. JOIN queries for data analysis, querying the list of all previously known urls for a repo) should maybe still JOIN on the repo_id.

At this point im not sure whether it makes more sense to also include the src_id in this new aliases table or not.

Im leaning no because i think it makes sense to treat this table as essentially an internal log of historical names for analysis/informational purposes (and a last-ditch effort to resolve a users URL to a repo that makes some sense before showing an error), but not as a primary form of deduplicating repos

MoralCode avatar Dec 16 '25 18:12 MoralCode

On repo_id: completely agree, I should have been more specific. I meant for checking for uniqueness. src_id in this new aliases table: I dont think so, just needs the repo_id

cdolfi avatar Dec 16 '25 18:12 cdolfi

se the src id for everything" since it is fundamentally a git-dependent value. I think it makes a ton of sense to use it when interfacing with the outside world (i.e. someone gives us a git url and we need to

the src_id is not Git dependent ... each platform has their own integer identifier that follows a repository even if you change its URL.

sgoggins avatar Dec 17 '25 16:12 sgoggins

Im sorry if that had not been clear earlier! Absolutely still using the repo src id for operations (my personal agenda is to push for everything that can be based on the src id to be), but the prior urls are stored for historical reference. In the case of 8knot, that info would be integrated into the search bar at some point. Happy to discuss more, Ive thought about this issue a lot

Agreed: We need to keep the url's ... they just won't be our primary identifier.

sgoggins avatar Dec 17 '25 16:12 sgoggins

the src_id is not Git dependent ... each platform has their own integer identifier that follows a repository even if you change its URL.

Do we know whether this applies to all the forges we plan to support (I.e. forgejo, cgit/generic git)?

MoralCode avatar Dec 18 '25 07:12 MoralCode

Also, where does this conversion leave us as far as this PR?

If we essentially write to the aliases table as if it is a log of each time a repo URL changes, does that reframing of its purpose prevent the issues that would arise (new different repo reusing an old url that was previously a redirect or something) if we used it for operational deduplication?

CC @Ulincsys

MoralCode avatar Dec 18 '25 07:12 MoralCode

@MoralCode that's how I understand it

cdolfi avatar Dec 18 '25 13:12 cdolfi

Thinking about this again, I think either framing has the same issue, but the difference is essentially who deals with it.

It sounds like the possibility of the aliases table having two entries (two repo_ids) for the same url is likely rare enough that it can probably be dealt with at the time of data analysis using the collection date to differentiate.

@cdolfi does the collection date seem sufficient for distinguishing possible duplicates for analysis

MoralCode avatar Dec 18 '25 17:12 MoralCode

@MoralCode yes, im not concerned personally about how to handle the situation where two repos had the same url at different points of time. Already much less difficult than navigating the current situation

cdolfi avatar Dec 18 '25 17:12 cdolfi

As far as aliases for moved repos are concerned, I think there is no reason to suspect that every platform would allow such a feature to exist.

Support for such a feature would need to be implemented on a per-platform basis in Augur, possibly with a Factory or Builder design pattern approach. Here is my line of thinking:

  • For platforms that provide a unique global identifier that exists separately from the repo URL in addition to repo URL redirection for those which have moved, we can implement in the collection process the functionality of aliasing as described.
  • For platforms which do not provide both of the above, we do not implement aliasing.

This is because: there is no reason to suspect that a platform which implements URL redirects must also provide a unique global identifier separate from the URL.

  • It is IMO the simplest and most robust way of doing so, but simplicity and robustness are not universally appreciated.
  • Additionally, we cannot assume that all platforms would be willing to expose such global identifiers to external API clients.

In the event that a repo alias for a supported platform returns a conflicting source ID, that entry can simply be deleted. Though I do consider myself to be a tremendous data hoarder, I see no use in maintaining an infinite changelog for repo location histories.

Please let me know if there are any questions I can answer; @cdolfi, @MoralCode

Ulincsys avatar Dec 19 '25 03:12 Ulincsys

I think the existence of this aliases table is more of a best effort/nice to have/convenience tier solution anyway.

If we wanted to be thorough about URL history, we would need a way to query that history from somewhere like github since the aliases are only sourced from the URLs people have attempted to load into augur.

I think the best effort ness of this generally lines up with johns point that not every platform is likely to even support it. It helps us not actively lose data when we detect a repo move, but I dont think the goal is to be perfectly comprehensive about every url move - just to provide a basic list of other urls we have previously seen a particular repo at.

@Ulincsys I guess my core question is: is there anything fundamentally problematic that would prevent us merging this? As Cali mentioned this will help improve the experience of managing duplicate urls by a lot, even if its a stepping stone to a better solution later.

MoralCode avatar Dec 19 '25 05:12 MoralCode

@Ulincsys So personally I do see the value in keeping the historical log of the repo url. Repos can change name/org location but still be known from their prior identifier. It also helpful when doing data analysis around repo donations to foundation and things like that. As well as foundations like apache changes the repo name with their progress through graduation. Having the up to date repo url is definitely the biggest priority and 8knot has had user issues with it for months now but the historical is incredibly useful from an analysis standpoint

cdolfi avatar Dec 19 '25 18:12 cdolfi

Id also say that id think about it like the contributor alias table. To my knowledge it does not/will not be compatible for every source but still useful in the cases where we can get that information

cdolfi avatar Dec 19 '25 18:12 cdolfi