gh-migration
gh-migration copied to clipboard
Some GH PR references were not ported correctly
https://github.com/python/cpython/issues/90908#issuecomment-1093944756 has two incorrect references. The first one goes to GH-75453, which is bpo-31270. It should be GH-31270 (https://bugs.python.org/issue46752#msg413260). Note that the reference is correct in the following 'New changeset' comment.
This is the only case I've seen so far, but I haven't systematically looked for more.
Found two more:
https://github.com/python/cpython/issues/91294#issuecomment-1093947224 references GH-76305 instead of GH-32124. https://github.com/python/cpython/issues/91302#issuecomment-1093947297 references GH-76242 instead of GH-32061.
It seems like GH-* were interpreted as issue number and converted automatically to the new migrated issue during the transfer. This automatic conversion was supposed to be disabled during the transfer, but apparently it wasn't.
Another example. In https://bugs.python.org/issue47075 I explicitly wrote I proposed GH-32112 for that. to get a link to my PR https://github.com/python/cpython/pull/32112 but the converted issue https://github.com/python/cpython/issues/91231 points to https://github.com/python/cpython/issues/76293 i don't understand how GH-32112 was transformed to #76293 which is an unrelated issue (not a PR)
i don't understand how GH-32112 was transformed to #76293 which is an unrelated issue (not a PR)
During the transfer, https://bugs.python.org/issue32112 was transferred to https://github.com/python/cpython/issues/76293. The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR), and since bpo-32112 is now GH-76293, it converted GH-32112 into GH-76293.
The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR)
How could that happen? That seems terrible. :-(
Any update on whether/how/when this could hopefully get fixed?
The transfer tool thought that your GH-32112 was a reference to bpo-32112
Would it be possible to process all data once more time to detect the issue and fix it? This issue is quite annoying.
The current workaround is to go to the original issue, find the message to look for the correct GH issue/PR number.
I agree that it is very serious issue. For now we have incorrect references and the reader can not even know that they are incorrect.
Can we repeat translation which correctly detects GitHub references locally, compare it with the old incorrect translation, and apply the diff? The diff should be orders smaller that the whole data (but it still can be thousands messages).
Since this happened at transfer time, one of the test repos that I kept around might still have the correct references and it might be used to fix the links. Doing it from bpo is a bit trickier because there is quite a bit of processing that happened at import time (even though it should be possible to extract only the references).
The main issue is that this requires a mass-rewrite which will have to go through thousands of issues and comments and edit them, and this is not trivial to write and test. At this point we will also have to use the public API, which is rate limited. If it is done, the same infrastructure could also be used to replace bpo links with GH links in the first message of the PRs, and to @mention again nosy list members in the first message of issues. I think this will also edit the "last updated" field of all the affected issues/PRs, making it more difficult to find old/stale issues/PRs.
How many issues are affected? 100, 1000 or 10000? If it is only few 100s, we can survive editing the "last updated" field. But in long term it would be better to find a way to edit messages without changing the "last updated" field. We may need it to fix references to old Subversion and Mercurial revisions.
I don't have a number, and when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help).
I can get an estimate of this shortly.
There are about 19270 lines in messages on bpo (as of late March, shortly before the migration) that contain a link with the text GH-*. This number includes the 'New changeset' messages, which were migrated correctly, and I'm not sure how to filter them out (without a complete export of the GitHub issues to compare it directly). I don't see any differences between the two in bpo's HTML. For example, the links mentioned in the original issue are:
<a href="https://github.com/python/cpython/pull/31270" class="closed" title="GitHub PR 31270: [merged] Merged">GH-31270</a>– Incorrect in GH issue<a href="https://github.com/python/cpython/pull/31313" title="GitHub PR 31313">GH-31313</a>– Incorrect in GH issue<a href="https://github.com/python/cpython/pull/31270" class="closed" title="GitHub PR 31270: [merged] Merged">GH-31270</a>– Correct in GH issue
In addition, there are about 5993 lines in bpo messages that contain other links to GH PRs. Roughly half of these are PR # links (example), the others are mostly just plain URLs (example). A brief sampling seems to indicate that those were transferred correctly.
One thing I suspected regarding those correctly migrated links is that perhaps messages that first contain a bpo link and then a GH PR link worked. This does not appear to be the case: exhibit A, exhibit B.
I also found one issue with a weird link on GitHub: this comment was rewritten as python/issues-test-cpython#30631. It should clearly point to https://github.com/python/cpython/pull/30631.
A few things that might help you:
- the migration happened in two stages: the import from bpo to a new GitHub repo, and the transfer from that repo to
python/cpython. - before the import, I did a lot of processing to fix/convert some links, add MarkDown markup, add the first message with the table, etc. In particular, I converted
GH-*andPR-*references (but not plain URLs) using this regex:
and bpo issues withCPYTHON_PR = 'https://github.com/python/cpython/pull/' # PR-123, PR 123, PR123, pull request 123, GH 123, GH123 # use a markdown link to preserve the original text -- the popup works (re.compile(r'(?P<text>(\b(?<![-/_])(PR-?|GH|pull\s*request))\s*' r'(?P<pr_no>\d+))\b', re.I), r'[\g<text>](%s\g<pr_no>)' % CPYTHON_PR),# issue1234, issue 1234, bpo 1234, #1234 (re.compile(r'(?:(?<!\w)\#|\b(?<![-/_])(?:issue|bpo))' r' ?(1?\d{4,6})\b', re.I), r'bpo-\1'), - during the transfer, the tool used by GitHub apparently converted some of the links. I know that one limitation of the tool is that it could only convert issues that had already been migrated. For example, if there was an issue with id
1234that was migrated to e.g.https://github.com/python/cpython/issue/4321, issues that had references like#1234could have been converted to#4321, but only in messages migrated after issue1234. This was the reason why we converted all the issue references tobpo-*instead of relying on their link conversion (that was supposed to be disabled, since it wasn't relyable).
This number includes the 'New changeset' messages, which were migrated correctly, and I'm not sure how to filter them out (without a complete export of the GitHub issues to compare it directly).
Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by (...), like in the "New changesets" messages. All those message have the same format, so it shouldn't be difficult to filter them out (a msg.startswith('New changeset') is probably enough).
If you want to see the pre-transfer messages, I can give you access to one of the test repos. If you have other questions let me know.
Hmm, your regex would not match GH-1234 with a hyphen at all. This would suggest that such references were all rewritten by GitHub...?
I know that one limitation of the tool is that it could only convert issues that had already been migrated.
This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.
Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by
(...), like in the "New changesets" messages. All those message have the same format, so it shouldn't be difficult to filter them out (amsg.startswith('New changeset')is probably enough).
I can't easily filter for New changeset (since it's generally on a different line than the PR reference in the HTML served by bpo that I'm processing for this), but the hint about brackets is a good one! Firstly, I found several more of those python/issues-test-cpython examples: https://bugs.python.org/msg404903, https://bugs.python.org/msg364827, https://bugs.python.org/msg410766. I also found this example of the form (GH-# and GH-#) where the former link is correct but the latter goes to that test repo: https://bugs.python.org/msg331261 → https://github.com/python/cpython/issues/53775#issuecomment-1093515121. Overall, there are about 777 GH-# links that are not preceded by (.
Hmm, your regex would not match GH-1234 with a hyphen at all. This would suggest that such references were all rewritten by GitHub...?
You are right: I didn't rewrite the GH-* references because those are already recognized by GitHub, but other refs like GH 1234 or GH1234 or PR-1234 aren't, so I used the regex to rewrite them. Instead of converting them into GH-1234 I converted them into MarkDown links in order to preserve the original text.
This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.
Note that there is quite a bit of overlapping between the PRs ids and bpo ids, so a bpo message that linked to the PR GH-1234 could have been mistaken by the tool for a link to an issue with id 1234 (since issues and PRs share the same namespace on GitHub), and after migrating bpo-1234 to GH-4321 all the references to PR GH-1234 could have been rewritten to GH-4321.
I can't easily filter for New changeset (since it's generally on a different line than the PR reference in the HTML served by bpo that I'm processing for this)
Instead of scraping the HTML directly (if this is what you are doing), you could use the XMLRPC interface to fetch all the bpo messages. Even if you are scraping HTML, it shouldn't be difficult to isolate the individual messages. If you need more info about this let me know, however it might be better to look at the test repo to see how the messages look like after the import (I gave you read access to one of them [^1]).
but the hint about brackets is a good one!
Glad that helped! I noticed similar problems while testing before the migration.
Firstly, I found several more of those python/issues-test-cpython examples: https://bugs.python.org/msg404903
This might be a separate bug in the transfer tool:
- the original message on bpo contains the text:
we get a resolution in GH-25718 as it will likely conflict. - the imported message contains the text:
we get a resolution in GH-25718 as it will likely conflict.(since my regex didn't touch theGH-*s) - the transferred message contains the text:
we get a resolution in python/issues-test-cpython#25718 as it will likely conflict.(note thatpython/issues-test-cpythonis the repo where we imported the issues before the transfer topython/cpython)
So my export tool kept references to GH-* unchanged and converted other references (e.g. GH *, PR *, etc.) to MarkDown links. Then, the transfer tool:
- converted some (all?)
GH-*references to point to a different id- this happened when the id matched an already migrated issue id and when the
GH-*wasn't preceded by other chars (like(); - the
GH-*was converted intoorg/repo#ID; - sometimes it used the source repo (
python/issues-test-cpython#4321) instead of the target repo (python/cpython#4321) -- not sure why
- this happened when the id matched an already migrated issue id and when the
- left full links alone (e.g. the first two in https://github.com/python/cpython/issues/91294#issuecomment-1093947224 - https://bugs.python.org/issue47138#msg416171)
- maybe left Markdown links alone too
Can you confirm that your observations match these statements?
[^1]: I actually realized that the one I gave you access to (issues-test-cypthon-2) was used to test the transfer, so it had the same issues as the python/cpython repo. I now revoked your permissions to that repo and added you to a different repo (issues-test-2) that had the imported messages before they were transferred.
when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help).
I don't see the "last updated" change as an issue. It's a good thing that changes are traced and public, no?
Technically yes, however if we mass-update the issues you won't be able to find old or stale issues anymore since they will all show as recently updated. Before the migration we decided it would be better to preserve the original "last updated" date, but we could revisit that decision. Even if we do, we would still need a script to perform the mass update.