tree-sitter-bash
tree-sitter-bash copied to clipboard
Corpus of shell scripts
Hi I just learned of this project through a reddit comment [1]
I saw that you have some example shell scripts as test cases for the parser. FWIW I have a corpus of 11,568 shell scripts here:
http://www.oilshell.org/blob/wild/
Here's a report from running the Oil parser over this corpus (which I wrote):
http://www.oilshell.org/release/0.6.pre6/test/wild.wwz/index.html
There is a bunch of background on the blog, for example: http://www.oilshell.org/blog/2017/11/10.html
I put a link to your parser here, along with 3-4 other parsers (although it's not clear how complete they are.)
https://github.com/oilshell/oil/wiki/ExternalResources
[1] https://www.reddit.com/r/commandline/comments/9p6nb2/complete_command_line_flags_in_vimemacs_or/?utm_content=full_comments&utm_medium=message&utm_source=reddit&utm_name=frontpage
Hey, that corpus looks like it would be very useful! Currently, I'm still progressing toward getting every file from bash-it parsing correctly. There are still 29 files that have parse errors, out of 277 files total. You can see the current parse output and timings on our CI. Most files parse in less than a millisecond.
Thanks for adding the link! Oil could also be useful as a reference. As far as completeness, I think this parser is still significantly less complete than Oil's (or shfmt), but thanks to error recovery it's still good enough to be useful in a editor / IDE.
OK interesting, so the reason you're not using shfmt (or Oil) is because you want partial parse results for incomplete code?
I have been thinking about how to add that to Oil. For example, most shells complete variable names like echo ${<TAB>, but they don't use their own parsers to do so! They use ad hoc mechanisms.
I'd be interested in hearing more about the kind of results you need from a bash parser for editor integration.
In other words I think the editor parsing problem occurs in the interactive shell as well!
(also FWIW we are discussing a shell-agnostic autocompletion protocol on https://oilshell.zulipchat.com/, with a zsh dev and Elvish dev. I'm not sure that is completely related to your interests but I thought I would mention it.)
OK interesting, so the reason you're not using
shfmt(or Oil) is because you want partial parse results for incomplete code?
There are several reasons why we're using this in Atom as opposed to other parsers:
- We need a uniform syntax tree interface that works across many languages, so we can't use single-language parsers like
shfmt. We've added support for the Tree-sitter library in the editor, so we can use Tree-sitter parsers for many different languages. - We need a library interface that we can call from JavaScript without the overhead of running a separate GC'd language like Go or Python
- We need parsing to be extremely fast because we do it every time you type a key.
- As you said, we need a mostly-complete syntax tree even if the code is invalid.
In other words I think the editor parsing problem occurs in the interactive shell as well!
That's an interesting point! It'd be great to have better autocomplete in my terminal. The fish shell seems to provide this, though I haven't used it much (since it's not bash-compatible) and am not sure how it works.
OK thanks for the explanation, that makes a lot of sense!
Are you going to complete external tools in addition to the shell language (variables, etc.)? That is, when you type ls --<TAB> or git <TAB> in Atom, will it (or does it) complete flags and git subcommands?
Atom doesn't have functionality for that, but it'd be very cool to add. I know that the Fish shell does it by invoking the command with the --help flag and parsing the output.
Yes, I learned recently that bash dynamically greps the output of ls to complete flags. It does it every time you hit TAB.
This is in contrast to zsh, which bakes the possible flags and their descriptions into a completion script. I think fish also does it that way.
We're working on a language-server like protocol for shell autocompletion:
https://github.com/oilshell/oil/wiki/Shell-Autocompletion
https://github.com/oilshell/oil/wiki/Shellac-Protocol-Proposal (very early, literally wrote this 2 days ago)
If you're interested in that functionality for Atom, you can talk with us at https://oilshell.zulipchat.com/ (it's public but you need to sign in with Github unfortunately). I am going to contact the authors of some other shells as well to get feedback on whether it's feasible to implement a client.
I want to try implementing this protocol in VimScript too.
Found this by accident and I'm curious - how much success are you achieving with a grammar for POSIX Shell and Bash? I can imagine you would end up with a very large and complex grammar to take care of all the edge cases, if it's at all possible.
I'd like for this parser to handle the union of POSIX shell and bash, but I haven't focused on POSIX shell yet, since there are still some problems with our bash parsing.
I think it's doable though. It makes it easier that the parser has error recovery so we don't have to handle every single edge case in order to improve the experience of users editing bash in Atom.
Is it possible to build POSIX/sh as a parser and have bash use that as a foundation? It's probably not that useful from a tree-sitter point of view but it is interesting from a shell developer point-of-view.
This corpus was amazing, I can confirm the entire list parses successfully on master, thanks!
@amaanq Wow I just glanced through the commits, it looks like you fixed like 20 or 50 bugs ??? That's very impressive
https://github.com/tree-sitter/tree-sitter-bash/commits/master
They all look similar to corners I encountered for Oil (now renamed Oils) -- looks like a lot of work!
I'm interested in writing some tree-sitter grammars, but I don't know much about it ... I would have thought it's a more lenient parser to begin with , i.e. it accepts more? Because it's used for syntax highlighting and so forth?
I'd be interested in the script that tests this grammar against the corpus, out of curiosity
Thanks for the update!
Yeah this repo needed some love :)
About grammars, the tree-sitter docs are a good starting point, but they are a bit lacking in certain special/edge cases, and it's really hard to get started imo, it took me a while to grok grammar writing.
It does typically accept more than the language expects, this is mainly because it'll produce less state counts and a lighter binary - but if being more conforming doesn't have much of a penalty to state count, there's no reason to not add it then :)
The script is in the repo in script/parse-examples, all I did was unarchive the tarball in /examples.
Hm when I look here it's using a couple git repos like Bash-it?
https://github.com/tree-sitter/tree-sitter-bash/blob/master/script/parse-examples
OK I see that's mentioned in the second comment
Our corpus is here - http://www.oilshell.org/blob/wild/ -- that's over 11,000 scripts :)
Anyway I think this could be a useful reference for writing a TreeSitter grammar for YSH (the shell with data tYpes, inspired by pYthon, etc.)
is it on git anywhere? That'd be preferred to add to our CI
Sure, I imported it here
https://github.com/oilshell/wild-corpus
(and I'm impressed how fast git at managing all these tiny files)
thanks!