linguist icon indicating copy to clipboard operation
linguist copied to clipboard

Fix heuristic for extension `.yy` (JSON vs Yacc)

Open DecimalTurn opened this issue 1 year ago • 10 comments

Description

Closes https://github.com/github-linguist/linguist/issues/6517 496k+ files affected

~~Implements fix suggested by @lildude and backed by @RobQuistNL in https://github.com/github-linguist/linguist/issues/6517#issuecomment-1680403065.~~

See this comment for the approach used: https://github.com/github-linguist/linguist/pull/6976#issuecomment-2264230000

Checklist:

  • [x] I am fixing a misclassified language
    • [x] I have included a new sample for the misclassified language:
      • Sample source: https://github.com/Ttanasart-pt/Pixel-Composer/blob/363a1dec53f6785ee58a7b3f3b8407a48dcdd2c2/scripts/VCT/VCT.yy
      • Sample License: MIT license
    • [x] I have included a change to the heuristics to distinguish my language from others using the same extension.

DecimalTurn avatar Jul 31 '24 04:07 DecimalTurn

Can we have a new sample and updated heuristic test, if necessary, for this too please.

Sure. I've added a sample JSON file with the .yy extension.

The test doesn't need any changes:

https://github.com/github-linguist/linguist/blob/39fd5e93de98de3434cb3e857c5b1972c418f8c5/test/test_heuristics.rb#L1087-L1093

DecimalTurn avatar Jul 31 '24 13:07 DecimalTurn

I was referring to adding a new GMStudio sample that matches the heuristic change.

lildude avatar Jul 31 '24 14:07 lildude

Those .yy files are produced by GMStudio, but they are (should be) classified as JSON by Linguist. That's why I've added the sample in the JSON folder.

The file added matches the heuristic added with the following line: "resourceType":"GMScript",

DecimalTurn avatar Jul 31 '24 14:07 DecimalTurn

🤦 Whoops. I missed that.

lildude avatar Jul 31 '24 15:07 lildude

🤦 Whoops. I missed that.

No worries. Let me know if you want me to fix the generated detection as well or if we can keep that for another PR. https://github.com/github-linguist/linguist/blob/916bd8f3df802fa98b0ea85539e67bd0b88ef158/lib/linguist/generated.rb#L672-L683

DecimalTurn avatar Jul 31 '24 15:07 DecimalTurn

Yeah, probably best to fix that too. Lets do it in this PR.

lildude avatar Jul 31 '24 16:07 lildude

Should be all done now. Note that I've had to relax the constraint that the property has to be on the 3rd line. There seems to be 50k+ files of GMStudio generated JSON where the old modelName property appears after the 3rd line, so even for GMStudio before v2.3, that constraint was too restrictive.

DecimalTurn avatar Jul 31 '24 17:07 DecimalTurn

After looking a little more at the syntax for Yacc, I realize that a Yacc file cannot start with a { or [. It has to start with a declaration which begins with a %. This means that we could simply use the heuristic that we used for JSON from here (checking that the first non-whitespace character is either { or [):

https://github.com/github-linguist/linguist/blob/39fd5e93de98de3434cb3e857c5b1972c418f8c5/lib/linguist/heuristics.yml#L353-L354

And for the generated condition inside generated.rb , we could simply have something like:

      return lines.first(3).join('').match?(/^\s*[{\[]/) ||
             lines[0] =~ /^\d\.\d\.\d.+\|\{/

Thoughts?


Corresponding search query: Still 600K files.

DecimalTurn avatar Aug 01 '24 23:08 DecimalTurn

That is actually a much better approach in my opinion :+) Good find! I've seen one other .yy syntax that I don't think is either GM or Yacc, but since this setup only has these 2 included I think this might be a very clean and good solution.

JSON + .yy extension = GM No JSON + .yy extension = Yacc

RobQuistNL avatar Aug 02 '24 13:08 RobQuistNL

We should be all good now with the latest changes bb34e54 (#6976).

I've seen one other .yy syntax that I don't think is either GM or Yacc

Good catch! This actually seems like an unusual Yacc file that skips the directives section going against the convention/documentation, so it would still be correct to categorize them as Yacc technically.

However, for this to be a problem, those unconventional Yacc files would need to start with a {, but GitHub Search tells me the number of files where this happens is likely in the single digits as it's struggling to find more than 3.

DecimalTurn avatar Aug 02 '24 17:08 DecimalTurn