augur icon indicating copy to clipboard operation
augur copied to clipboard

Associate new repository URL with one used to originally load in the repo

Open cdolfi opened this issue 8 months ago • 14 comments

As repos move there should be a way for augur to store the information of the new repository URL (if it does not already). This additional info should be integrated into 8Knot so if someone types in the new URL its associated with the old one.

For example:

I am being asked to analyze https://github.com/redhat-developer/devspaces which is in our augur instance as https://github.com/redhat-developer/codeready-workspaces . There is no direct way (to my knowledge) to discover the old url it is under in augur from the new one

From my understanding augur.tasks.github.detect_move.tasks.detect_github_repo_move_core / augur.tasks.github.detect_move.tasks.detect_github_repo_move_secondary tasks should handle updating the repo_git when the repo moves

cdolfi avatar Apr 15 '25 22:04 cdolfi

@EngCaioFonseca tagging for visibility, this will be important for the search bar

cdolfi avatar Apr 15 '25 22:04 cdolfi

@ABrain7710 / @Ulincsys : I am not able to easily replicate this exactly. I thought Augur updated the URLs when they changed.

sgoggins avatar Apr 16 '25 13:04 sgoggins

@Ulincsys

sgoggins avatar Apr 23 '25 18:04 sgoggins

If a repo was collected on and moved, it changes the repo_git of the repo table ... (We have tested this and it worked, however we need to sort out why its not working.) ....

sgoggins avatar Apr 28 '25 14:04 sgoggins

Ill give a few days of the new release running to see it this has been solved

cdolfi avatar Jun 02 '25 19:06 cdolfi

Update: so far this example still holds

cdolfi avatar Jun 09 '25 16:06 cdolfi

To be able to find examples of this look at the apache org, they change the repo name often so there is a lot of url changes. Still seeing the old urls

cdolfi avatar Jul 08 '25 15:07 cdolfi

@MoralCode tagging for visibility, definitely an issue thatd a priority for us

cdolfi avatar Sep 29 '25 17:09 cdolfi

Lets look at Apache repos to see if we can replicate this issue. @IsaacMilarky ...

sgoggins avatar Sep 30 '25 14:09 sgoggins

i think for the purposes of this issue, having a way to get any known historical/alternate URLs for a repo, or even resolve a repo to a stable content-based identifier of some kind, could be really helpful in allowing apps/researchers to resolve a canonical ID (or list all previous known URLS for a repo) before performing data queries.

Ultimately the goal would be to create some kind of layer that practically allows a git URL to be "resolved" to an augur-stable identifier for the purpose of converting from human interfaces (loading a repo, searching for a repo) into the augur system's stable ID as soon as possible.

My current thinking is along the lines of a new table to store:

  • repo url (or maybe alternate identifiers like github repo id if we want to be broad with it) (required)
  • the augur repo id it maps to (required)
  • effective starting (date, optional) - this would be the date we first knew this url to be valid, such as when we detect a move)
  • effective ending (date, optional) - the ending date of validity for this URL, for example, a moved repo or old link to a repo that no longer woulds would have a date in the past)

For handling edge cases:

  • if a user enters a URL that github is actively redirecting, we detect and handle that as any other repo move and adjust the the new URLs table as part of that
  • if a user enters a url that github isnt currently redirecting (such as the one from the red hat developers org in the original issue description) we check the table of known previous URLs and either resolve using that, or show the user an error as if that repo doesnt exist if we dont have that url.
    • This would mean older augur instances have an advantage in being able to resolve URLs since they will have "seen" more move events
    • maybe this opens the door for a future where augur instances can either talk to each other or share information with the community to enable crowdsourcing of these known previous urls
    • Short of being able to use the GitHub API to get a list of previous URLs for a repo, or maybe querying other archival services like archive.org or softwareheritage, or the internet (see techniques like what this tool uses), we really cant do too much better than that

One edge case Im not sure how to handle without a MUCH bigger technical lift is inter-forge transfers (project moves or is mirrored from GH to GL, or moves to forgejo, etc. Maybe thatll be an application of a more content-based identification system/potentially the subject of a larger epic task surrounding auditing of identifiers in augur

MoralCode avatar Oct 03 '25 19:10 MoralCode

@MoralCode I think what you are talking about is actually a separate enhancement vs what this issue is intended for. Right now Augur has a task that is suppose to update the repo_git if the repo changes its name. That currently isnt happening and needs to get fixed. I like the enhancement of having a table of prior known urls but think it should be a separate issue to be taken on after the current task is fixed.

Also, the edge cases you are describing are actually a handled if the core task in place is working correctly. There is an unique identifier for each repo already that is not dependent on the name that augur stores from GH, repo_src_id. Also, the core task that currently isnt working (which this issue is for) would update the URL

cdolfi avatar Oct 03 '25 19:10 cdolfi

augur User in slack affected by this on a bare metal install. Repo https://github.com/dbus2/busd was in augur, got moved to https://github.com/z-galaxy/busd, and https://github.com/z-galaxy/busd was added as part of a new batch of repos being scraped.

The issue manifest itself as an occurrence of # 3192, but these are not duplicates.

MoralCode avatar Oct 31 '25 16:10 MoralCode

i suspect this procedure would replicate the issue:

  1. create a public repo on github with something in it
  2. add that repo to augur
  3. allow it to collect
  4. rename the repo on github - the old link should redirect to the new one
  5. load the new url into augur and let it collect

You should observe the unique violation error from issue 3192 when it goes to collect pull requests.

I suspect this is a bug of some kind that either relates to repo move detection, or just isnt detecting that the two URLs are pointing at the same github repo early enough in the process. Ideally it should know/check for this when the repo is first added and simply add as an alias for an existing repo (per this issues title) so that either name can be used to grab the metrics for the repo

Edit: no, this should be fixed already - creating duplicate repos should not be possible. see https://github.com/chaoss/augur/issues/3192#issuecomment-3474549710

MoralCode avatar Oct 31 '25 17:10 MoralCode

@MoralCode thats actually different issue than the one I am describing, that one was already handled but I suspect there might be another change that unknowingly impacted this. The one I am describing is not when augur allows for duplicate repos to get collected, it is updating the repo_git of a specific repo when it changes

cdolfi avatar Oct 31 '25 18:10 cdolfi