Fix heuristic for extension `.yy` (JSON vs Yacc)
Description
Closes https://github.com/github-linguist/linguist/issues/6517 496k+ files affected
~~Implements fix suggested by @lildude and backed by @RobQuistNL in https://github.com/github-linguist/linguist/issues/6517#issuecomment-1680403065.~~
See this comment for the approach used: https://github.com/github-linguist/linguist/pull/6976#issuecomment-2264230000
Checklist:
- [x] I am fixing a misclassified language
- [x] I have included a new sample for the misclassified language:
- Sample source: https://github.com/Ttanasart-pt/Pixel-Composer/blob/363a1dec53f6785ee58a7b3f3b8407a48dcdd2c2/scripts/VCT/VCT.yy
- Sample License: MIT license
- [x] I have included a change to the heuristics to distinguish my language from others using the same extension.
- [x] I have included a new sample for the misclassified language:
Can we have a new sample and updated heuristic test, if necessary, for this too please.
Sure. I've added a sample JSON file with the .yy extension.
The test doesn't need any changes:
https://github.com/github-linguist/linguist/blob/39fd5e93de98de3434cb3e857c5b1972c418f8c5/test/test_heuristics.rb#L1087-L1093
I was referring to adding a new GMStudio sample that matches the heuristic change.
Those .yy files are produced by GMStudio, but they are (should be) classified as JSON by Linguist. That's why I've added the sample in the JSON folder.
The file added matches the heuristic added with the following line:
"resourceType":"GMScript",
🤦 Whoops. I missed that.
🤦 Whoops. I missed that.
No worries. Let me know if you want me to fix the generated detection as well or if we can keep that for another PR. https://github.com/github-linguist/linguist/blob/916bd8f3df802fa98b0ea85539e67bd0b88ef158/lib/linguist/generated.rb#L672-L683
Yeah, probably best to fix that too. Lets do it in this PR.
Should be all done now.
Note that I've had to relax the constraint that the property has to be on the 3rd line. There seems to be 50k+ files of GMStudio generated JSON where the old modelName property appears after the 3rd line, so even for GMStudio before v2.3, that constraint was too restrictive.
After looking a little more at the syntax for Yacc, I realize that a Yacc file cannot start with a { or [. It has to start with a declaration which begins with a %. This means that we could simply use the heuristic that we used for JSON from here (checking that the first non-whitespace character is either { or [):
https://github.com/github-linguist/linguist/blob/39fd5e93de98de3434cb3e857c5b1972c418f8c5/lib/linguist/heuristics.yml#L353-L354
And for the generated condition inside generated.rb , we could simply have something like:
return lines.first(3).join('').match?(/^\s*[{\[]/) ||
lines[0] =~ /^\d\.\d\.\d.+\|\{/
Thoughts?
Corresponding search query: Still 600K files.
That is actually a much better approach in my opinion :+) Good find! I've seen one other .yy syntax that I don't think is either GM or Yacc, but since this setup only has these 2 included I think this might be a very clean and good solution.
JSON + .yy extension = GM No JSON + .yy extension = Yacc
We should be all good now with the latest changes bb34e54 (#6976).
I've seen one other .yy syntax that I don't think is either GM or Yacc
Good catch! This actually seems like an unusual Yacc file that skips the directives section going against the convention/documentation, so it would still be correct to categorize them as Yacc technically.
However, for this to be a problem, those unconventional Yacc files would need to start with a {, but GitHub Search tells me the number of files where this happens is likely in the single digits as it's struggling to find more than 3.