bat
bat copied to clipboard
Share sublime syntaxes
Bat and Zola both use syntect for highlighting syntaxes and have their own folder for additional syntaxes on top of the official Sublime repo. It would be a good idea to merge so there is one known source of syntaxes that all of syntect users can contribute to.
The idea would be:
- have a third party repo containing the syntax files
- keep the patches that bat has since they solve issues
- pretty much any .sublime-syntax should be accepted as long as it's not a duplicate of an existing one (Zola highlights .srt for example, which is not strictly a programming language but can be useful in various contexts)
- build step is up to the consumer: bat seems to have a built in method for it while Zola has a small Rust script (https://github.com/getzola/zola/blob/master/components/config/examples/generate_sublime.rs) that prints a blurb pasted in the docs.
A bonus point for syntect is to have a Rust CI setup on that repo that tests every syntax for both rust-onig and fancy-regex, which can be used to report potential bugs/issues.
I've seen that quite often, tmLanguage syntax grammars are manually converted to sublime-syntax format for use in syntect. Have you considered adding support for tmLanguage files directly in syntect, it seems it could save some hassle? They essentially support a subset of sublime-syntax functionality, so the existing data structures can be used, it would just be a case of parsing the plist/xml format...
@Keats Thank you for your suggestion. As initially discussed in https://github.com/trishume/syntect/issues/287, that sounds like a good idea in principle.
Before we go forward with this, I'd first like to think about the possible downsides as well. I'm only interested in following this route, if it really leads to less maintenance work overall. There are also a few questions/problems that would need to be sorted out.
Pros:
- Synergy effect: all projects immediately benefit from contributions to the shared repo.
- Maintenance work for including new syntaxes can be shared (PR reviews, bug reports, patches, etc.)
- Part of the tooling and testing could be shared. CI would have to be set up just once.
Cons (some of them are specific to bat):
- Dealing with submodules (of submodules) is not a great experience and adds friction in the contribution/maintenance process. This is kind of an obvious point, but a really big downside in my point of view. For now, contributors to
batcan easily open a new PR on thebatrepo to get a syntax included. With this change, we need two PRs on two separate repos to get a change in. Similarly, if bug reports against a syntax are reported, they will be reported in the downstream project repositories (yes, ideally users would directly report them upstream, but from experience I know that they will not). All of these issues have to be transported "upstream" to the syntaxes repository, duplicating the amount of issues that need to be managed. ChangeLogs and commit messages are similarly affected by this and often need to be duplicated in two places. - (Downstream) testing and syntax files are disconnected: In
bat, we currently have a few tests in place that ensure certain things about the syntax set that is shipped withbat. We are planning to extend these kinds of tests in the future. If we move the syntaxes to another repository, there is a risk of catching bugs "too late" (only after changes have been merged upstream at the point where the submodule is updated). - I currently don't see this as a problem, but there might very well be situations where different downstream projects have conflicing views about the "design" of the syntax repository.
Potential issues:
- "pretty much any .sublime-syntax should be accepted" => this is not how we approach this in
bat. I'm quite hesitant to add new syntaxes due to more runtime overhead (see https://github.com/sharkdp/bat/issues/925#issuecomment-615629015). We currently have the following "policy" in place: syntax request issue template. How would we deal with something like this? We could probably have a list of "prioritized" categories with (1) syntaxes in included insublimehq/Packages, (2) other popular syntaxes, by some measure (3) less popular syntaxes (4) exotic syntaxes. Downstream projects could then decide on which categories to include? - "pretty much any .sublime-syntax should be accepted as long as it's not a duplicate of an existing one" => in
bat, we currently have a few syntaxes that override base syntaxes inPackages. At the moment we have custom JavaScript and Haskell syntaxes, I believe.
In summary, I would currently actually tend towards keeping things as they are. Maintaining bat is a large amount of work already and I see a certain chance that this change would increase the amount of maintenance work. If someone can convince me otherwise, I'm happy to consider implementing this.
Related discussion on the syntect repo: https://github.com/trishume/syntect/issues/168
Pasting my comment from a different issue:
I think the whole community would benefit if you can extract out the syntax highlighting data and maintain it separately (maybe a git submodule?). I haven't see anyone else in the rust ecosystem do it as good as you and they all have issues like this (it works in bat). Let's try to reduce duplication work and keep the efforts in one place.
I understand the concerns about git submodule. I am not knowledgable about syntect, but I would like to enquire about the feasibility of doing the syntax data as a crate? That way, we can keep it in this repo (using cargo workspaces).
Keats originally opened this up about having a central dumping ground for syntaxes which had some downsides (bloat, worse development experience for bat). We've seen though that there are cases where tools are fine "just doing whatever bat does" (broot, delta). I'm yet another person who wants to copy bat in my own tool, git-dive, a potential replacement for git-blame (so maybe https://github.com/sharkdp/bat/issues/1536 won't be needed if this works out?).
delta is taking the approach of depending on bat. That requires pulling in a lot of stuff that I won't be using. The alternative I wanted to propose was one or multiple of
- A syntax and a theme asset crate
- A
HighlightingAssetscrate(s) (unsure if syntax and theme should be split out into separate crates)- Digging into this code, it feels there is a lot of this is generally useful
These crates would reside in the bat repo and be part of a bat workspace. I know managing multiple crates adds extra overhead. cargo-release has been a big help for me and I would be willing to take input for what, if anything, would be needed for bat to use it. Of course, there are also several other tools in that space.
There is still changelog management (I'm hopeful [cargo-changelog])https://crates.io/crates/cargo-changelog) will help) and selecting versions. For versions, I use conventional commit style so I'm less likely to miss breaking changes when choosing my version but I would even be fine if these crates bumped their major version on every release.
@epage Thank you for your input. I'm willing to consider this now that I have some experience with cargo workspaces from another project.
The alternative I wanted to propose was one or multiple of
* A syntax and a theme asset crate * A `HighlightingAssets ` crate(s) (unsure if syntax and theme should be split out into separate crates) * Digging into this code, it feels there is a lot of this is generally useful
I'm not sure I understand what the difference is? What would the first "syntax and theme asset crate" be? A very low-level crate that basically just includes the binary assets and one or two functions to get access to them (in a &[u8] sense)?
While that would allow even more flexibility, I'd still lean towards keeping the number of crates minimal. If we would have a proper cargo workspace though, we could also cleanly separate the bat crate into a library and the CLI.
Yes, the crates would be
- Low level access to just the themes and syntaxes (decompresses if needed so people can integrate it with syntect)
- The current asset loader (from disk or in-memory from above crate(s))
The reason I was thinking multiple crates
- Access to the in-memory syntaxes and themes is more universal than the strategy for loading them from disk but the disk loader is still likely useful enough
- (lower priority) SSGs might want the syntax support but not care about themes
I'm seemingly in the same position as @epage where I just want to do whatever bat does in terms of providing extra syntaxes (and to a lesser extent themes) without having to pull in all of bat as a lib
I'd be up for putting in the work to either provide it as a separate crate(s) in a workspace here, or just providing it as a separate crate on crates.io if it's deemed too much effort to maintain here
I went ahead and threw together a quick crate that roughly has the core functionality that I would want. All it currently provides:
- Ways to load the embedded set of syntaxes, themes, and acknowledgements (with feature flags for each one to avoid pulling in unneeded embedded assets)
- The
LazyThemeSetimplementation instead of syntect'sThemeSetto avoid having to deserialize and decompress the entire set just to use a single theme
Some misc internal changes include switching the create.sh shell file that bat uses to a cargo-xtask task along with changing the license detection logic from matching on snippets to askalono (The crate that powers cargo-deny's license detection logic)
This doesn't add the asset loader logic that bat currently uses, but that should be a pretty simple addition (I just don't need it for my purposes currently)
Re:
I'm only interested in following this route, if it really leads to less maintenance work overall.
It would be pretty easy to switch things over and integrate some variation of my crate into bat, but I'm pretty doubtful that it will reduce the maintenance work. I'd be happy to just provide the crate separately and quietly keep things up to date with bat's assets in the background