bat icon indicating copy to clipboard operation
bat copied to clipboard

Syntax autodetect based on file content

Open ruihe774 opened this issue 1 year ago • 6 comments

Hi! In current implementation, bat currently detects the language of a file by its extension name and its first line, and may fail to detect and highlight files without extension or stdin. A solution for such cases is to guess the language according to the file content. This approach is used in editors like VSCode.

I've tried implementing this autodetecting feature for bat. See https://github.com/ruihe774/bat/tree/guesslang. In this implementation, bat probes the first few (kilo)bytes and detects the language using the model from guesslang, which is also used in VSCode, if the file extension detection and first line detection failed. It works fairly well and you could have a try. I'm wondering if you are interested in this feature and whether this can be merged into upstream.

ruihe774 avatar Oct 06 '23 09:10 ruihe774

Looks interesting, thanks for sharing. So, if I understand correctly, the new asset file is 738KiB, so the bat binary would grow at least that much larger? I wonder how much it would affect startup time in the case where guesslang isn't needed, and I'm curious how long it takes when guesslang is used.

Also, how is the onnx file generated? Probably if we were to integrate something like this, we'd want instructions on how to update the model etc - I guess we'd have to read up on it in the guesslang documentation, right?

I see that, the way it is trained, it supports just that static list of LABELS , and when invoking it, it returns indexes from that array with probabilities? It's a little hard for me to mentally map those labels/"tokens" to the relevant syntax - especially as GitHub search doesn't search submodule content. Are they all file extensions? Actually, I think I partially answered my own question, it's taken directly from https://github.com/yoeo/guesslang/blob/f67e7b1bc963d06fc244304dba7f0f0ce39a0d4c/guesslang/data/languages.json. I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?

keith-hall avatar Oct 20 '23 20:10 keith-hall

So, if I understand correctly, the new asset file is 738KiB, so the bat binary would grow at least that much larger?

After it is compressed by zlib, which is also used by themes.bin and syntaxes.bin, the size of guesslang.onnx will be 549K.

Meanwhile, we can link onnxruntime dynamically or statically. There are three situations. In all cases, ORT_STRATEGY should be set to system in build time (See doc of ort).

  • A dynamic onnxruntime library that is built without --use_extensions is installed in system. In this case, the onnxruntime-extensions library is also required to be installed. Bat will depend on these two dynamic libraries. ORT_LIB_LOCATION should be set to the library directory of onnxruntime in build time; bat will be dynamically linked with it. OCOS_LIB_PATH should be set to the path of onnxruntime-extensions in build time; ort will dlopen() it in run time.
  • A dynamic onnxruntime library that is built with --use_extensions (see doc of onnxruntime) is installed in system. In this case, bat will depend on only onnxruntime.
  • A minimal static build of onnxruntime with selected ops is built (see doc; it is somewhat complicated). ORT_LIB_LOCATION should be set to the directory of the build in build time; bat will be statically linked with it. No system-wide dependencies are required. In this case, bat will grow another 1.7M.

I wonder how much it would affect startup time in the case where guesslang isn't needed, and I'm curious how long it takes when guesslang is used.

I use some OnceCell to initialize the onnx runtime and session at the first call to guesslang(). So, if it is not called, it will initialize nothing. Also, take a look at assets.rs#L286-L294. guesslang() will only be called when other methods cannot infer the language. If guesslang is used, it takes only some tens of milliseconds. And, we can further provide a way to customize whether to run guesslang or not through command line and lib interface.

I have done some benchmarks. In my computer, the vanilla 0.24.0 is:

bat benchmark results

Startup time

Command Mean [ms] Min [ms] Max [ms] Relative
bat 5.5 ± 0.3 5.0 7.0 1.00

Startup time with syntax highlighting

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-CpuInfo-file.cpuinfo 8.6 ± 0.3 8.1 10.6 1.00

Startup time with syntax with dependencies

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-Markdown-file.md 12.0 ± 0.9 11.4 20.0 1.00

Plain-text speed

Command Mean [ms] Min [ms] Max [ms] Relative
bat … --language=txt numpy_test_multiarray.py 9.3 ± 0.3 8.9 10.8 1.00

Syntax highlighting speed --wrap=character: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative
bat … grep-output-ansi-sequences.txt 24.6 ± 4.2 23.4 69.3 1.00

Syntax highlighting speed --wrap=character: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative
bat … jquery.js 335.2 ± 5.0 332.5 349.0 1.00

Syntax highlighting speed --wrap=character: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative
bat … miniz.c 28.8 ± 1.2 27.7 36.1 1.00

Syntax highlighting speed --wrap=character: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative
bat … numpy_test_multiarray.py 442.7 ± 5.6 436.4 452.3 1.00

Syntax highlighting speed --wrap=never: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative
bat … grep-output-ansi-sequences.txt 20.8 ± 0.6 20.2 25.6 1.00

Syntax highlighting speed --wrap=never: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative
bat … jquery.js 330.6 ± 1.1 329.0 332.6 1.00

Syntax highlighting speed --wrap=never: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative
bat … miniz.c 28.3 ± 1.0 27.6 37.8 1.00

Syntax highlighting speed --wrap=never: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative
bat … numpy_test_multiarray.py 437.7 ± 2.9 434.3 442.0 1.00

Many small files speed (overhead of metadata)

Command Mean [ms] Min [ms] Max [ms] Relative
bat … --language=txt *.txt 6.7 ± 0.4 6.2 8.5 1.00

The bat with guesslang has a startup time of:

bat benchmark results

Startup time

Command Mean [ms] Min [ms] Max [ms] Relative
bat 6.2 ± 0.3 5.8 7.5 1.00

Startup time with syntax highlighting

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-CpuInfo-file.cpuinfo 9.4 ± 0.6 8.9 17.8 1.00

Startup time with syntax with dependencies

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-Markdown-file.md 13.5 ± 3.9 12.4 64.9 1.00

Also, I have benchmarked a modified version that enable guesslang for all inputs. Its speed is:

bat benchmark results

Startup time

Command Mean [ms] Min [ms] Max [ms] Relative
bat 6.1 ± 0.3 5.7 7.4 1.00

Startup time with syntax highlighting

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-CpuInfo-file.cpuinfo 38.4 ± 0.6 37.6 40.9 1.00

Startup time with syntax with dependencies

Command Mean [ms] Min [ms] Max [ms] Relative
bat … small-Markdown-file.md 42.3 ± 0.6 41.3 46.0 1.00

Plain-text speed

Command Mean [ms] Min [ms] Max [ms] Relative
bat … --language=txt numpy_test_multiarray.py 10.2 ± 0.3 9.8 11.7 1.00

Syntax highlighting speed --wrap=character: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative
bat … grep-output-ansi-sequences.txt 61.5 ± 0.7 60.9 65.4 1.00

Syntax highlighting speed --wrap=character: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative
bat … jquery.js 372.6 ± 1.4 370.7 375.7 1.00

Syntax highlighting speed --wrap=character: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative
bat … miniz.c 65.8 ± 0.5 65.3 68.2 1.00

Syntax highlighting speed --wrap=character: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative
bat … numpy_test_multiarray.py 477.2 ± 1.8 474.7 480.9 1.00

Syntax highlighting speed --wrap=never: grep-output-ansi-sequences.txt

Command Mean [ms] Min [ms] Max [ms] Relative
bat … grep-output-ansi-sequences.txt 58.1 ± 0.9 57.4 63.3 1.00

Syntax highlighting speed --wrap=never: jquery.js

Command Mean [ms] Min [ms] Max [ms] Relative
bat … jquery.js 369.3 ± 2.9 366.7 376.7 1.00

Syntax highlighting speed --wrap=never: miniz.c

Command Mean [ms] Min [ms] Max [ms] Relative
bat … miniz.c 65.4 ± 0.5 64.9 68.6 1.00

Syntax highlighting speed --wrap=never: numpy_test_multiarray.py

Command Mean [ms] Min [ms] Max [ms] Relative
bat … numpy_test_multiarray.py 472.7 ± 1.7 470.9 476.8 1.00

Many small files speed (overhead of metadata)

Command Mean [ms] Min [ms] Max [ms] Relative
bat … --language=txt *.txt 7.3 ± 0.3 6.9 9.0 1.00

Also, how is the onnx file generated?

You can refer to my script. (I have not polished it yet.)

I see that, the way it is trained, it supports just that static list of LABELS , and when invoking it, it returns indexes from that array with probabilities?

Yes, your link to https://github.com/yoeo/guesslang/blob/f67e7b1bc963d06fc244304dba7f0f0ce39a0d4c/guesslang/data/languages.json is right. We can use the keys for language names, or we can use the values for extensions. I'm a bit of lazy and I don't want to translate the language names to what we use in bat, so I just use the extensions.

The model outputs an array of probabilities of the 54 languages (the sum is 1). I just pick the one with largest probability and select it if the probability is greater than 0.5. The threshold can be further tuned.

I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?

Well, it is difficult question for me to answer. How many kinds of first lines these 54 languages have...

Or, we can think how many languages do not have a first line. I think many.

ruihe774 avatar Oct 21 '23 06:10 ruihe774

Thanks for the detailed explanations and benchmarks. It will be interesting to see what the other maintainers think about this.

I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?

Well, it is difficult question for me to answer. How many kinds of first lines these 54 languages have...

It was more like food for thought than something I expected you to answer, sorry for not making that clearer.

keith-hall avatar Oct 27 '23 01:10 keith-hall

This file type detection could also be an optional feature, so users could decide at compile time whether they prefer a smaller binary or a larger one with auto-detection.

Some more thoughts:

  1. The file-format library could also be used for file type detection. Its footprint is likely much smaller.
  2. Or going in the other direction, there is a new AI library Magika for content type detection.

ppetr avatar Jun 20 '24 12:06 ppetr

I'd very much appreciate this kind of content type auto-detection in any form - even a very primitive one would be very helpful.

ppetr avatar Jun 20 '24 12:06 ppetr