cla-bot
cla-bot copied to clipboard
If a whitelisted user deletes their account - an 'imposter' can adopt it
via email from @maoo
What if a GitHub user, that is "whitelisted" by the bot (since covered by a CLA) and therefore part of the "contributorListGithubUrl", deletes the GitHub account and another person creates a GitHub account with the same GitHub ID later on?
This is a very good point, and a bit of a design flaw! Usernames (called login in the GitHub API response) are unique, but can be relinquished when accounts are deleted.
However, users also have a unique ID which is returned by the API, and is visible in your avatar URL for example, https://avatars0.githubusercontent.com/u/1098110?s=460&v=4.
We could list user IDs in the whitelist file, but that would make it much harder to configure. We'd still experience the issues relating to #74, where git commits can be from authors that do not have GitHub accounts.
I can't see a good fix for this one!
Similar to #99
One solution to this is to just push the problem onto the (cla-bot) user. Rather than the cla-bot config file containing a whitelist of GitHub logins, it could contain a list of User IDs. This makes the task of creating a config file a little more arduous, but is more robust.
For FINOS, managing user IDs instead of GitHub logins would be simple, since we auto-generate the whitelist. That said, I'd propose to define a whitelist format that allows to support both, for example:
[{
github_id: "ColinEberhardt",
user_id: 123456
},{
github_id: "maoo",
user_id: 678901
}]
We could simply add a new .config parameter, for example contributorIdListGithubUrl , which reads this format, while keeping contributorListGithubUrl parameter to avoid breaking existing installations.
This format IMO gives 2 advantages, if compared with a flat list of user_id fields:
- GitHub IDs can be read by humans, so administrators could validate the file entries manually (just visiting
github.com/<github_id>or calling GitHub APIs) - CLA bot (or any other application) could periodically check if
github_ids anduser_ids are in sync. If not, it means that a GItHub account was removed, therefore an entry from thecontributorIdListGithubUrldefinition should be removed.
Adding a 👍 to @maoo's comment here of extending the CLA bot to support both the login id and the user id (one for interacting with, and one for data integrity and validation).
The other note I'd share though is to keep the attack vector in mind here:
- User relinquishes their username back to GitHub
- Another user comes along, with malicious intent to
impersonateprevious user - Malicious user commits code into CLA-bot managed repository.
- Code is of high enough quality and value to get accepted
- Malicious user gets this code accepted without actually signing the CLA??
- Malicious user then ??? (sues? claims ownership? something else?) against the repo that includes their code.
Is the CLABot the authoritative source? Would their actively attempting to circumvent the process via the ☝️ not be enough to invalidate the action? I want to make sure we're not trying to solve an edge case that is already covered by legal or maintainer remediation.
Another user comes along, with malicious intent to impersonate previous user
Even if the user is able to get the same GitHub ID, he/she won't get the same user id and therefore the CLA bot would flag it. And when we catch cases where the github id matches but the user id doesn't, we can notify the CLA maintainers (ie FINOS staff), that there's entry that should be removed from the whitelist.
Is the CLABot the authoritative source? Would their actively attempting to circumvent the process via the ☝️ not be enough to invalidate the action?
I believe that the case we're protecting ourselves from is that the malicious code gets shipped into a release and used in production by downstream consumers; although the action on code can be reverted, some repositories (like Maven Central) make it extremely hard to delete a released package. For this reason, I'd like the CLA Bot to be our authoritative source. Adding @copiesofcopies here, in case there's something to add from a legal standpoint.
I believe that the case we're protecting ourselves from is that the malicious code gets shipped into a release and used in production by downstream consumers
My comment is that the CLA bot shouldn't be protecting the project from malicious code, instead commits that ybr user was not authorized to contribute.
In separation of dities, I want to make sure CLAbot isn't being asked to do more than it should.
Welcome thoughts @copiesofcopies
@maoo recall that the FINOS metadata includes all GitHub usernames an individual has ever used, in order to support Bitergia's historical data. Furthermore, the metadata doesn't (currently) differentiate between usernames that are current and those that are no longer used by an individual.
The net effect is that simply emitting GitHub user ids alongside usernames won't provide the level of protection that you're discussing here. For that, the person metadata schema would need to be updated to explicitly distinguish between "current GitHub username(s)" and "no-longer-in-use GitHub username(s)", on a per-person basis. This would ensure that the Bitergia generation process can continue emitting everything (needed because of Bitergia's historical data), while the cla-bot generation process only emits the "current" username data.
This is by no means a complex change to the metadata and metadata-tool, btw, but also not quite as trivial as was mentioned above.