bat
bat copied to clipboard
Syntax autodetect based on file content
Hi! In current implementation, bat currently detects the language of a file by its extension name and its first line, and may fail to detect and highlight files without extension or stdin. A solution for such cases is to guess the language according to the file content. This approach is used in editors like VSCode.
I've tried implementing this autodetecting feature for bat. See https://github.com/ruihe774/bat/tree/guesslang. In this implementation, bat probes the first few (kilo)bytes and detects the language using the model from guesslang, which is also used in VSCode, if the file extension detection and first line detection failed. It works fairly well and you could have a try. I'm wondering if you are interested in this feature and whether this can be merged into upstream.
Looks interesting, thanks for sharing. So, if I understand correctly, the new asset file is 738KiB, so the bat
binary would grow at least that much larger? I wonder how much it would affect startup time in the case where guesslang isn't needed, and I'm curious how long it takes when guesslang is used.
Also, how is the onnx
file generated? Probably if we were to integrate something like this, we'd want instructions on how to update the model etc - I guess we'd have to read up on it in the guesslang
documentation, right?
I see that, the way it is trained, it supports just that static list of LABELS
, and when invoking it, it returns indexes from that array with probabilities? It's a little hard for me to mentally map those labels/"tokens" to the relevant syntax - especially as GitHub search doesn't search submodule content. Are they all file extensions? Actually, I think I partially answered my own question, it's taken directly from https://github.com/yoeo/guesslang/blob/f67e7b1bc963d06fc244304dba7f0f0ce39a0d4c/guesslang/data/languages.json. I wonder how many of those bat
supports, and how many can/can't often have the syntax detected from the first line with syntect
- i.e. how much benefit would it really bring?
So, if I understand correctly, the new asset file is 738KiB, so the bat binary would grow at least that much larger?
After it is compressed by zlib, which is also used by themes.bin and syntaxes.bin, the size of guesslang.onnx will be 549K.
Meanwhile, we can link onnxruntime dynamically or statically. There are three situations. In all cases, ORT_STRATEGY
should be set to system
in build time (See doc of ort).
- A dynamic onnxruntime library that is built without
--use_extensions
is installed in system. In this case, the onnxruntime-extensions library is also required to be installed. Bat will depend on these two dynamic libraries.ORT_LIB_LOCATION
should be set to the library directory of onnxruntime in build time; bat will be dynamically linked with it.OCOS_LIB_PATH
should be set to the path of onnxruntime-extensions in build time; ort willdlopen()
it in run time. - A dynamic onnxruntime library that is built with
--use_extensions
(see doc of onnxruntime) is installed in system. In this case, bat will depend on only onnxruntime. - A minimal static build of onnxruntime with selected ops is built (see doc; it is somewhat complicated).
ORT_LIB_LOCATION
should be set to the directory of the build in build time; bat will be statically linked with it. No system-wide dependencies are required. In this case, bat will grow another 1.7M.
I wonder how much it would affect startup time in the case where guesslang isn't needed, and I'm curious how long it takes when guesslang is used.
I use some OnceCell
to initialize the onnx runtime and session at the first call to guesslang()
. So, if it is not called, it will initialize nothing. Also, take a look at assets.rs#L286-L294. guesslang()
will only be called when other methods cannot infer the language. If guesslang is used, it takes only some tens of milliseconds. And, we can further provide a way to customize whether to run guesslang or not through command line and lib interface.
I have done some benchmarks. In my computer, the vanilla 0.24.0 is:
bat
benchmark resultsStartup time
Command Mean [ms] Min [ms] Max [ms] Relative bat
5.5 ± 0.3 5.0 7.0 1.00 Startup time with syntax highlighting
Command Mean [ms] Min [ms] Max [ms] Relative bat … small-CpuInfo-file.cpuinfo
8.6 ± 0.3 8.1 10.6 1.00 Startup time with syntax with dependencies
Command Mean [ms] Min [ms] Max [ms] Relative bat … small-Markdown-file.md
12.0 ± 0.9 11.4 20.0 1.00 Plain-text speed
Command Mean [ms] Min [ms] Max [ms] Relative bat … --language=txt numpy_test_multiarray.py
9.3 ± 0.3 8.9 10.8 1.00 Syntax highlighting speed --wrap=character:
grep-output-ansi-sequences.txt
Command Mean [ms] Min [ms] Max [ms] Relative bat … grep-output-ansi-sequences.txt
24.6 ± 4.2 23.4 69.3 1.00 Syntax highlighting speed --wrap=character:
jquery.js
Command Mean [ms] Min [ms] Max [ms] Relative bat … jquery.js
335.2 ± 5.0 332.5 349.0 1.00 Syntax highlighting speed --wrap=character:
miniz.c
Command Mean [ms] Min [ms] Max [ms] Relative bat … miniz.c
28.8 ± 1.2 27.7 36.1 1.00 Syntax highlighting speed --wrap=character:
numpy_test_multiarray.py
Command Mean [ms] Min [ms] Max [ms] Relative bat … numpy_test_multiarray.py
442.7 ± 5.6 436.4 452.3 1.00 Syntax highlighting speed --wrap=never:
grep-output-ansi-sequences.txt
Command Mean [ms] Min [ms] Max [ms] Relative bat … grep-output-ansi-sequences.txt
20.8 ± 0.6 20.2 25.6 1.00 Syntax highlighting speed --wrap=never:
jquery.js
Command Mean [ms] Min [ms] Max [ms] Relative bat … jquery.js
330.6 ± 1.1 329.0 332.6 1.00 Syntax highlighting speed --wrap=never:
miniz.c
Command Mean [ms] Min [ms] Max [ms] Relative bat … miniz.c
28.3 ± 1.0 27.6 37.8 1.00 Syntax highlighting speed --wrap=never:
numpy_test_multiarray.py
Command Mean [ms] Min [ms] Max [ms] Relative bat … numpy_test_multiarray.py
437.7 ± 2.9 434.3 442.0 1.00 Many small files speed (overhead of metadata)
Command Mean [ms] Min [ms] Max [ms] Relative bat … --language=txt *.txt
6.7 ± 0.4 6.2 8.5 1.00
The bat with guesslang has a startup time of:
bat
benchmark resultsStartup time
Command Mean [ms] Min [ms] Max [ms] Relative bat
6.2 ± 0.3 5.8 7.5 1.00 Startup time with syntax highlighting
Command Mean [ms] Min [ms] Max [ms] Relative bat … small-CpuInfo-file.cpuinfo
9.4 ± 0.6 8.9 17.8 1.00 Startup time with syntax with dependencies
Command Mean [ms] Min [ms] Max [ms] Relative bat … small-Markdown-file.md
13.5 ± 3.9 12.4 64.9 1.00
Also, I have benchmarked a modified version that enable guesslang for all inputs. Its speed is:
bat
benchmark resultsStartup time
Command Mean [ms] Min [ms] Max [ms] Relative bat
6.1 ± 0.3 5.7 7.4 1.00 Startup time with syntax highlighting
Command Mean [ms] Min [ms] Max [ms] Relative bat … small-CpuInfo-file.cpuinfo
38.4 ± 0.6 37.6 40.9 1.00 Startup time with syntax with dependencies
Command Mean [ms] Min [ms] Max [ms] Relative bat … small-Markdown-file.md
42.3 ± 0.6 41.3 46.0 1.00 Plain-text speed
Command Mean [ms] Min [ms] Max [ms] Relative bat … --language=txt numpy_test_multiarray.py
10.2 ± 0.3 9.8 11.7 1.00 Syntax highlighting speed --wrap=character:
grep-output-ansi-sequences.txt
Command Mean [ms] Min [ms] Max [ms] Relative bat … grep-output-ansi-sequences.txt
61.5 ± 0.7 60.9 65.4 1.00 Syntax highlighting speed --wrap=character:
jquery.js
Command Mean [ms] Min [ms] Max [ms] Relative bat … jquery.js
372.6 ± 1.4 370.7 375.7 1.00 Syntax highlighting speed --wrap=character:
miniz.c
Command Mean [ms] Min [ms] Max [ms] Relative bat … miniz.c
65.8 ± 0.5 65.3 68.2 1.00 Syntax highlighting speed --wrap=character:
numpy_test_multiarray.py
Command Mean [ms] Min [ms] Max [ms] Relative bat … numpy_test_multiarray.py
477.2 ± 1.8 474.7 480.9 1.00 Syntax highlighting speed --wrap=never:
grep-output-ansi-sequences.txt
Command Mean [ms] Min [ms] Max [ms] Relative bat … grep-output-ansi-sequences.txt
58.1 ± 0.9 57.4 63.3 1.00 Syntax highlighting speed --wrap=never:
jquery.js
Command Mean [ms] Min [ms] Max [ms] Relative bat … jquery.js
369.3 ± 2.9 366.7 376.7 1.00 Syntax highlighting speed --wrap=never:
miniz.c
Command Mean [ms] Min [ms] Max [ms] Relative bat … miniz.c
65.4 ± 0.5 64.9 68.6 1.00 Syntax highlighting speed --wrap=never:
numpy_test_multiarray.py
Command Mean [ms] Min [ms] Max [ms] Relative bat … numpy_test_multiarray.py
472.7 ± 1.7 470.9 476.8 1.00 Many small files speed (overhead of metadata)
Command Mean [ms] Min [ms] Max [ms] Relative bat … --language=txt *.txt
7.3 ± 0.3 6.9 9.0 1.00
Also, how is the onnx file generated?
You can refer to my script. (I have not polished it yet.)
I see that, the way it is trained, it supports just that static list of LABELS , and when invoking it, it returns indexes from that array with probabilities?
Yes, your link to https://github.com/yoeo/guesslang/blob/f67e7b1bc963d06fc244304dba7f0f0ce39a0d4c/guesslang/data/languages.json is right. We can use the keys for language names, or we can use the values for extensions. I'm a bit of lazy and I don't want to translate the language names to what we use in bat, so I just use the extensions.
The model outputs an array of probabilities of the 54 languages (the sum is 1). I just pick the one with largest probability and select it if the probability is greater than 0.5. The threshold can be further tuned.
I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?
Well, it is difficult question for me to answer. How many kinds of first lines these 54 languages have...
Or, we can think how many languages do not have a first line. I think many.
Thanks for the detailed explanations and benchmarks. It will be interesting to see what the other maintainers think about this.
I wonder how many of those bat supports, and how many can/can't often have the syntax detected from the first line with syntect - i.e. how much benefit would it really bring?
Well, it is difficult question for me to answer. How many kinds of first lines these 54 languages have...
It was more like food for thought than something I expected you to answer, sorry for not making that clearer.
This file type detection could also be an optional feature, so users could decide at compile time whether they prefer a smaller binary or a larger one with auto-detection.
Some more thoughts:
- The
file-format
library could also be used for file type detection. Its footprint is likely much smaller. - Or going in the other direction, there is a new AI library Magika for content type detection.
I'd very much appreciate this kind of content type auto-detection in any form - even a very primitive one would be very helpful.