onefetch icon indicating copy to clipboard operation
onefetch copied to clipboard

Incorrect language detected (C++ as C, XML as TypeScript, etc.)

Open aaronfranke opened this issue 5 years ago • 41 comments

https://github.com/github/linguist

Linguist is a tool developed by GitHub for the specific purpose of detecting languages. It's a very mature tool that gets it right the majority of the time by using complex rules.

aaronfranke avatar Feb 16 '19 20:02 aaronfranke

But why ? https://github.com/Aaronepower/tokei is written in Rust and does a great job detecting languages.

o2sh avatar Feb 17 '19 10:02 o2sh

Is that what Onefetch currently uses? It detects C++ as C in the case of Godot, and it didn't detect anything for the repo of a Godot project (while GitHub detects GDScript).

aaronfranke avatar Feb 17 '19 18:02 aaronfranke

it only detects the languages that are currently supported by onefetch (WIP):

C
Clojure
C++
C#
Go
Haskell
Java
Lisp
Lua
Python
R
Ruby
Rust
Scala
Shell
TypeScript
JavaScript
Php

Also tokei ignores all commented lines which is why the language distribution sometimes differs from GH.

Supported languages by tokei --> https://github.com/Aaronepower/tokei#supported-languages

o2sh avatar Feb 17 '19 18:02 o2sh

Upstream issues: https://github.com/Aaronepower/tokei/issues/305 and https://github.com/Aaronepower/tokei/issues/67

We can leave this closed though if you want.

aaronfranke avatar Mar 10 '19 00:03 aaronfranke

Ok, with the new title it makes more sense to keep this open.

We'll wait for tokei to fix it then.

Thx @aaronfranke

o2sh avatar Mar 10 '19 11:03 o2sh

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 21 '20 23:08 stale[bot]

This issue still exists, though it is likely seen by the devs as low priority, so I'll probably have to bump this again later to please the stale bot.

aaronfranke avatar Aug 21 '20 23:08 aaronfranke

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 20 '20 00:11 stale[bot]

This issue still exists, though it is likely seen by the devs as low priority, so I'll probably have to bump this again later to please the stale bot.

aaronfranke avatar Nov 20 '20 01:11 aaronfranke

Hi, I added the c header and cpp header to language.rs file to my fork https://github.com/mapau/onefetch/tree/feature/add-c-cpp-header

https://github.com/Aaronepower/tokei already detects the c header and cpp header only the mapping in onefetch is missing.

Here is a PR https://github.com/o2sh/onefetch/pull/365

mapau avatar Dec 28 '20 17:12 mapau

I'm not very fund of this idea of having separate entries for header files (CHeader and C++Header). I personally prefer the GitHub Linguist approach of extending C and C++ detection scope to include their respective header files:

C++:
  type: programming
  tm_scope: source.c++
  ace_mode: c_cpp
  codemirror_mode: clike
  codemirror_mime_type: text/x-c++src
  color: "#f34b7d"
  aliases:
  - cpp
  extensions:
  - ".cpp"
  - ".c++"
  - ".cc"
  - ".cp"
  - ".cxx"
  - ".h"
  - ".h++"
  - ".hh"
  - ".hpp"
  - ".hxx"
  - ".inc"
  - ".inl"
  - ".ino"
  - ".ipp"
  - ".re"
  - ".tcc"
  - ".tpp"

  C:
  type: programming
  color: "#555555"
  extensions:
  - ".c"
  - ".cats"
  - ".h"
  - ".idc"
  interpreters:
  - tcc
  tm_scope: source.c
  ace_mode: c_cpp
  codemirror_mode: clike
  codemirror_mime_type: text/x-csrc
  language_id: 41

I doubt the people over at tokei would be ready to make that shift...So, either we stick to tokei's detection rules and merge @mapau's PR, or we override the logic in Onefetch or...

o2sh avatar Dec 30 '20 17:12 o2sh

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 02 '21 17:04 stale[bot]

This issue still exists, though it is seen by the devs as low priority, so I'll probably have to bump this again later to please the stale bot. The problem still deserves some kind of solution eventually.

aaronfranke avatar Apr 02 '21 18:04 aaronfranke

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 01 '21 19:07 stale[bot]

This issue still exists, though it is seen by the devs as low priority, so I'll probably have to bump this again later to please the stale bot. The problem still deserves some kind of solution eventually.

aaronfranke avatar Jul 01 '21 19:07 aaronfranke

FWIW I tried this on the Bitcoin Core repo and it thinks it's 50.5%...

a) theres no typescript in the project b) perhaps this is related to C++ header detection?

onefetch 2.11.0

image

github-linguist does return the correct information on this repo:

[20:05] ubuntu:bitcoin (master) | ₿ github-linguist
67.10%  9517488    C++
18.93%  2684890    Python
8.65%   1227100    C
1.66%   235453     M4
1.47%   209083     Shell
0.98%   138998     Makefile
0.40%   56897      Sage
0.21%   29132      CMake
0.20%   28178      Assembly
0.17%   24552      Scheme
0.15%   21833      HTML
0.04%   5497       Objective-C++
0.01%   1721       Dockerfile
0.01%   1256       Cap'n Proto
0.00%   541        Java
0.00%   438        QMake

willcl-ark avatar Feb 21 '22 20:02 willcl-ark

Interesting about the TypeScript issue. Looks like it's caused by the files in src/qt/locale, which all have .ts extensions (guess it's the extension of Qt translation files, too).

b) perhaps this is related to C++ header detection?

Most likely. What it comes down to is that tokei, our dependency for language detection, naively using filenames instead of the actual content for categorization. github-linguist performs heuristics on file contents, helping make it much more accurate when there are conflicts with extensions. You can try creating an issue over there, but looking at the issue history it might be closed as a duplicate of https://github.com/XAMPPRocky/tokei/issues/67.

spenserblack avatar Feb 21 '22 20:02 spenserblack

Right. I did take a quick look at those issues but agree that it seems likely that it will be closed as a duplicate.

Can't really see how this can be improved without either using the github tool (which works).

willcl-ark avatar Feb 22 '22 09:02 willcl-ark

I don't think we should be using github-linguist directly, since it's written in Ruby, and any interfacing with it would likely be hack-y. However, it could be possible to make a Rust port of linguist that uses the same data files (similar to what linguist-js does). But one cool thing that tokei does is categorize lines so that commented lines can be treated separately from actual code, which I think we would lose by migrating to a lingusit-like crate. Removing the tokei dependency would also be a large change on our end at this point.

Another option is to fork tokei to fix this.

Unfortunately, for your use case, it looks like you would have to use a tokei config file for now to ignore "TypeScript," since that is tokei's preferred solution.

spenserblack avatar Feb 22 '22 13:02 spenserblack

I think we should start collecting all "incorrect language detected" issues together, so I'm going to rename this to something more general and pin it.

spenserblack avatar May 12 '22 16:05 spenserblack

Interesting about the TypeScript issue. Looks like it's caused by the files in src/qt/locale, which all have .ts extensions (guess it's the extension of Qt translation files, too).

b) perhaps this is related to C++ header detection?

Most likely. What it comes down to is that tokei, our dependency for language detection, naively using filenames instead of the actual content for categorization. github-linguist performs heuristics on file contents, helping make it much more accurate when there are conflicts with extensions. You can try creating an issue over there, but looking at the issue history it might be closed as a duplicate of XAMPPRocky/tokei#67.

I reported this issue over a year ago: XAMPPRocky/tokei#719.

throwaway1037 avatar May 14 '22 17:05 throwaway1037

Incorrect detection of Verilog using tokei.

tokei uses file extension *.vg for verilog while the more popular choice is *.v tokei *.v is defined as COQ, all this is described in tokei issue 520

When considering a new approach, please consider verilog file identification as a useful test case.

atluft avatar Oct 08 '22 16:10 atluft

Thanks for the report, @atluft. I guess this also means that tokei can have a problem with V, it it is ever implemented.

spenserblack avatar Oct 08 '22 19:10 spenserblack

@o2sh We might want to create a known-issues.md file, documenting workarounds (alternate extensions to use, tokei config file snippets, etc.).

spenserblack avatar Oct 08 '22 19:10 spenserblack

I'd be happy to do so, but do we actually have any workaround for this? 🤔

As far as I know tokei still doesn't provide an option to allows users to override the extensions - as suggested here

o2sh avatar Oct 09 '22 16:10 o2sh

Sorry, I incorrectly assumed that tokei allowed language overrides, but I guess that's not implemented yet.

Well, the only workaround that I know of is renaming all Verilog files to *.vg :sweat_smile:

spenserblack avatar Oct 10 '22 13:10 spenserblack

Coming back to a really old issue to document a potential solution:

It might be worth creating a new crate that acts as a wrapper for tokei. This wrapper would provide its own function for getting languages, adding the following:

  • heuristics using regexes
  • naive bayesian classification using code samples Basically, it should do what github-linguist does.

Also, it should probably re-export the rest of tokei's public interface to make usage easier.

Such a crate should probably be in a separate repository, as I anticipate releases occurring on a very different schedule from onefetch.

Additionally, this crate would probably need a lot of community support to provide the heuristics and code samples.

I might attempt to do this sometime, but I can't promise that it will be soon. If someone else wants to take this on, I'll be happy to help and discuss this further.

spenserblack avatar Feb 20 '23 02:02 spenserblack

that acts as a wrapper for tokei.

You mean exposing the same set of APIs? or is it gonna reuse some of tokei's code?

If I'm understanding correctly, the new project will be similar to github-linguist but implemented in Rust. Does that mean sacrificing some performance for better accuracy?

Regardless, it's definitely an intriguing challenge. If executed well, it could gain a lot of traction, especially given the current state/limitations of existing solutions 😢.

I'd be happy to help 👍

o2sh avatar Feb 20 '23 10:02 o2sh

You mean exposing the same set of APIs? or is it gonna reuse some of tokei's code?

Mostly exposing the same API. Basically a bunch of pub use. For the re-implementation of get_statistics, some code reuse might be necessary.

This could also be a fork of tokei. I was just thinking about "wrapping" tokei since AFAIK get_statistics is the only part we're not satisfied with. But maybe forking makes more sense so we wouldn't have to re-implement as much internal code :shrug:

Also, since we're mentioning github-linguist, I should note that linguist actually analyzes the HEAD (or other rev) of a repository, not the filesystem, which is pretty cool. But I'm not sure if we'd want that since we display pending change stats.

So I guess the first question is: do we want to improve tokei for our purposes, or port linguist to Rust?

spenserblack avatar Feb 20 '23 13:02 spenserblack

I'm seeing this as well on a repository:

$ github-linguist
46.47%  205323     Shell
38.84%  171601     Lua
8.45%   37319      Perl
2.97%   13138      Ruby
2.91%   12836      Python
0.19%   861        Scheme
0.17%   750        Vim Script

Which is correct, while onefetch things I have a lot of Go code:

$ onefetch
           --==============--              Dan Sully ~ git version 2.40.0
  .-==-.===oooo=oooooo=ooooo===--===-      ------------------------------
 .==  =o=oGGGGGGo=oo=oGGGGGGGG=o=  oo-     Project: dotfiles
 -o= oo=G .=GGGGGo=o== .=GGGGG=ooo o=-     HEAD: f76ffe4 (master)
  .-=oo=o==oGGGGG=oo=oooGGGGGo=oooo.       Pending: 12+- 153+
   -ooooo=oooooo=.   .==ooo==oooooo-       Created: 3 years ago
   -ooooooooooo====_====ooooooooooo=       Languages:
   -oooooooooooo==#.#==ooooooooooooo                  ● Go (21.6 %) ● HTML (18.9 %)
   -ooooooooooooo=#.#=oooooooooooooo                  ● Rust (14.8 %) ● Python (9.9 %)
   .oooooooooooooooooooooooooooooooo.                 ● C (7.2 %) ● Vim script (7.1 %)
    oooooooooooooooooooooooooooooooo.                 ● Other (20.5 %)
  ..oooooooooooooooooooooooooooooooo..     Authors: 99% Dan Sully 1397
-=o-=ooooooooooooooooooooooooooooooo-oo.            0% Dan Sully 5
.=- oooooooooooooooooooooooooooooooo-.-             0% Dan Sully 3
   .oooooooooooooooooooooooooooooooo-      Last change: 17 hours ago
   -oooooooooooooooooooooooooooooooo-      Contributors: 7
   -oooooooooooooooooooooooooooooooo-      URL: [email protected]:dsully/dotfiles.git
   -oooooooooooooooooooooooooooooooo-      Commits: 1409
   .oooooooooooooooooooooooooooooooo       Lines of code: 5023922
    =oooooooooooooooooooooooooooooo-       Size: 1.13 MiB (241 files)
    .=oooooooooooooooooooooooooooo-
      -=oooooooooooooooooooooooo=.
     =oo====oooooooooooooooo==-oo=-
    .-==-    .--=======---     .==-

dsully avatar Apr 25 '23 20:04 dsully