check_language.py misc. problems
Some things I noted trying to run the script on the Vulkan repo:
- Seems to require python 3, which /usr/bin/env python may return python 2 in some environments. It's possible asking for 'env python3' would suffice. If run with python 2, exits immediately with no failure code. Python 2 is at EOL now and I am definitely not suggesting backwards compatibility, just that it execute the right Python version.
- Matches 'master' inside a link to another repository, e.g. https:...blob/master/README.md, and should not as that's not under control of the source document. Also matches the word in certain other contexts where it might not be appropriate to report - frex we have a Python script containing a list of branch names and metadata for them, which looks like
'master': [ 4000, 4999, 4448 ],
I'll note additional problems here as I encounter them, unless you want separate issues for them. I don't know to what degree you are willing to make this a more general-purpose tool vs. just running on Amber, but we're kinda hoping to leverage off this.
I've run into both of those problem myself.
I don't know the best way to find python3 on a system. Maybe I can check if we're under python2 and error out in some fashion? Would at least make it more noticeable (@zoddicus any suggestions?)
For the detecting of master, I see two ways to do it, either we add some kind of suppression mechanism in the file
'master': [4000, 4999, 4448], # NOLANG(master)
or you do something along the lines of running the script, fixing up issues but leaving ones like master and then creating a suppression file that you diff against.
Would either of those options work for you (or do you see another option?)
It is tempting to have an allowlist of regexps as well as a denylist but that could lead to false negatives unless specific expressions in the allowlist are run only for specific expressions that triggered in the denylist, and that would significantly complicate the script.
It might be easiest to just edit the regexps. It is hard to make general statements but any URL containing one of the problematic words is more-than-likely OK no matter the context. Quoted content is less obviously OK though.
Maybe collapse the REGEX_LIST / SUPPRESSION_LIST into a single data structure with (allowed_expression, suppression_expression*)* and then only run the suppression expressions specific to the allowed_expression that just matched? Then could have e.g. [ r"(?i)master", [ r"(?i)[/']master[/']" ] ] for the specific case I noted above.