sherlock
sherlock copied to clipboard
Ideas On Choosing The Perfect Unclaimed Username?
One of the pains in delivering on #37 is that I need to figure out an unclaimed username (to verify that Sherlock can correctly detect that a given username is open).
I have been using "noonewouldeverusethis7" as this username, because...who would use something like that? However, sometime in the last couple of days, someone has claimed that username on BuzzFeed and Canva. The irony is not lost on me.
So, my questions are twofold:
- Does anyone have ideas on how to choose usernames that no one else has claimed?
- Does anyone know how these names might get claimed just from Sherlock doing queries against them? I find it hard to believe that "noonewouldeverusethis7" just randomly got claimed.
I was thinking about generating a random alphabetical string, testing, and repeating this process once or twice.
This way, we get usernames that are very unlikely to be in use, and prevent people from claiming usernames that are in the source of Sherlock.
I think I could submit a pull request this weekend for the false positive testing, if no one else wants this issue of course.
Please let me know your thoughts.
Afterthought: perhaps we could also generate random usernames based on the regex that describes valid usernames per website, but that might be a V2 of this functionality?
The current tests only check one username per site query. So, the test for unclaimed detection would have to be changed. If the tests always had to try multiple usernames to find an unclaimed one, it would really extend the test times.
I guess what I am really worrying about is how exactly the existing site got the "noonewouldeverusethis7" username. If this is something that is happening on the site side, then the entire approach of Sherlock is called into question. It is only going to work if it was someone in the Sherlock community that is defining them.
using a username that can't be used in the sites could be an option, like username with less than 3 chars or more than 52 chars in some sites.
@hoadlck You don't have to generate multiple usernames for every site. You can generate one random string and reuse it for every site. It won't add up in terms of test times if you just reuse it just like you're already doing for all the sites. Hardly a few have different unclaimed usernames but otherwise, it's mostly the same.
Further, you can explore Faker package to generate a username + append a random string/digits to make it more unique and testable but I don't really see any reason to do that since the random string should just be fine anyway.
@TheYahya Some sites will throw a different error for such kinda usernames, unfortunately. So we can't really rely on that.
To make the random string work more accurately, we could also rely on the regex in the data to generate the string that fits within the regex pattern (Ex: in cases where the site requires min. 3 chars and max 8 chars for a username -- anything beyond will throw a completely different error, happens with PayPal for example).
Using a somewhat long username (more than 50 chars or something) may be an quite good approach.
import time ; f"user{time.time_ns()}".rstrip('0') --> user1657191108413085
It will be difficult for spammers to guess which nanosecond the test will run.
We have removed unclaimed usernames from data.json. For the sites that need it for the tests, we use exrex to generate a random username based on the regexCheck regex :)