tree-sitter-bash Corpus of shell scripts

Hi I just learned of this project through a reddit comment [1]

I saw that you have some example shell scripts as test cases for the parser. FWIW I have a corpus of 11,568 shell scripts here:

http://www.oilshell.org/blob/wild/

Here's a report from running the Oil parser over this corpus (which I wrote):

http://www.oilshell.org/release/0.6.pre6/test/wild.wwz/index.html

There is a bunch of background on the blog, for example: http://www.oilshell.org/blog/2017/11/10.html

I put a link to your parser here, along with 3-4 other parsers (although it's not clear how complete they are.)

https://github.com/oilshell/oil/wiki/ExternalResources

[1] https://www.reddit.com/r/commandline/comments/9p6nb2/complete_command_line_flags_in_vimemacs_or/?utm_content=full_comments&utm_medium=message&utm_source=reddit&utm_name=frontpage

Oct 18 '18 16:10 andychu

Hey, that corpus looks like it would be very useful! Currently, I'm still progressing toward getting every file from bash-it parsing correctly. There are still 29 files that have parse errors, out of 277 files total. You can see the current parse output and timings on our CI. Most files parse in less than a millisecond.

Thanks for adding the link! Oil could also be useful as a reference. As far as completeness, I think this parser is still significantly less complete than Oil's (or shfmt), but thanks to error recovery it's still good enough to be useful in a editor / IDE.

Oct 18 '18 17:10 maxbrunsfeld

OK interesting, so the reason you're not using shfmt (or Oil) is because you want partial parse results for incomplete code?

I have been thinking about how to add that to Oil. For example, most shells complete variable names like echo ${<TAB>, but they don't use their own parsers to do so! They use ad hoc mechanisms.

I'd be interested in hearing more about the kind of results you need from a bash parser for editor integration.

In other words I think the editor parsing problem occurs in the interactive shell as well!

(also FWIW we are discussing a shell-agnostic autocompletion protocol on https://oilshell.zulipchat.com/, with a zsh dev and Elvish dev. I'm not sure that is completely related to your interests but I thought I would mention it.)

Oct 18 '18 17:10 andychu

OK interesting, so the reason you're not using shfmt (or Oil) is because you want partial parse results for incomplete code?

There are several reasons why we're using this in Atom as opposed to other parsers:

We need a uniform syntax tree interface that works across many languages, so we can't use single-language parsers like shfmt. We've added support for the Tree-sitter library in the editor, so we can use Tree-sitter parsers for many different languages.
We need a library interface that we can call from JavaScript without the overhead of running a separate GC'd language like Go or Python
We need parsing to be extremely fast because we do it every time you type a key.
As you said, we need a mostly-complete syntax tree even if the code is invalid.

In other words I think the editor parsing problem occurs in the interactive shell as well!

That's an interesting point! It'd be great to have better autocomplete in my terminal. The fish shell seems to provide this, though I haven't used it much (since it's not bash-compatible) and am not sure how it works.

Oct 18 '18 17:10 maxbrunsfeld

OK thanks for the explanation, that makes a lot of sense!

Are you going to complete external tools in addition to the shell language (variables, etc.)? That is, when you type ls --<TAB> or git <TAB> in Atom, will it (or does it) complete flags and git subcommands?

Oct 18 '18 17:10 andychu

Atom doesn't have functionality for that, but it'd be very cool to add. I know that the Fish shell does it by invoking the command with the --help flag and parsing the output.

Oct 18 '18 17:10 maxbrunsfeld

Yes, I learned recently that bash dynamically greps the output of ls to complete flags. It does it every time you hit TAB.

This is in contrast to zsh, which bakes the possible flags and their descriptions into a completion script. I think fish also does it that way.

We're working on a language-server like protocol for shell autocompletion:

https://github.com/oilshell/oil/wiki/Shell-Autocompletion

https://github.com/oilshell/oil/wiki/Shellac-Protocol-Proposal (very early, literally wrote this 2 days ago)

If you're interested in that functionality for Atom, you can talk with us at https://oilshell.zulipchat.com/ (it's public but you need to sign in with Github unfortunately). I am going to contact the authors of some other shells as well to get feedback on whether it's feasible to implement a client.

I want to try implementing this protocol in VimScript too.

Oct 18 '18 19:10 andychu

Found this by accident and I'm curious - how much success are you achieving with a grammar for POSIX Shell and Bash? I can imagine you would end up with a very large and complex grammar to take care of all the edge cases, if it's at all possible.

Oct 19 '18 13:10 mvdan

I'd like for this parser to handle the union of POSIX shell and bash, but I haven't focused on POSIX shell yet, since there are still some problems with our bash parsing.

I think it's doable though. It makes it easier that the parser has error recovery so we don't have to handle every single edge case in order to improve the experience of users editing bash in Atom.

Oct 19 '18 17:10 maxbrunsfeld

Is it possible to build POSIX/sh as a parser and have bash use that as a foundation? It's probably not that useful from a tree-sitter point of view but it is interesting from a shell developer point-of-view.

Mar 11 '21 19:03 docwhat

This corpus was amazing, I can confirm the entire list parses successfully on master, thanks!

Aug 22 '23 20:08 amaanq

@amaanq Wow I just glanced through the commits, it looks like you fixed like 20 or 50 bugs ??? That's very impressive

https://github.com/tree-sitter/tree-sitter-bash/commits/master

They all look similar to corners I encountered for Oil (now renamed Oils) -- looks like a lot of work!

I'm interested in writing some tree-sitter grammars, but I don't know much about it ... I would have thought it's a more lenient parser to begin with , i.e. it accepts more? Because it's used for syntax highlighting and so forth?

I'd be interested in the script that tests this grammar against the corpus, out of curiosity

Thanks for the update!

Aug 22 '23 21:08 andychu

Yeah this repo needed some love :)

About grammars, the tree-sitter docs are a good starting point, but they are a bit lacking in certain special/edge cases, and it's really hard to get started imo, it took me a while to grok grammar writing.

It does typically accept more than the language expects, this is mainly because it'll produce less state counts and a lighter binary - but if being more conforming doesn't have much of a penalty to state count, there's no reason to not add it then :)

The script is in the repo in script/parse-examples, all I did was unarchive the tarball in /examples.

Aug 23 '23 03:08 amaanq

Hm when I look here it's using a couple git repos like Bash-it?

https://github.com/tree-sitter/tree-sitter-bash/blob/master/script/parse-examples

OK I see that's mentioned in the second comment

Our corpus is here - http://www.oilshell.org/blob/wild/ -- that's over 11,000 scripts :)

Anyway I think this could be a useful reference for writing a TreeSitter grammar for YSH (the shell with data tYpes, inspired by pYthon, etc.)

Aug 24 '23 04:08 andychu

is it on git anywhere? That'd be preferred to add to our CI

Aug 24 '23 04:08 amaanq

Sure, I imported it here

https://github.com/oilshell/wild-corpus

(and I'm impressed how fast git at managing all these tiny files)

Aug 24 '23 17:08 andychu

thanks!

Aug 24 '23 19:08 amaanq

tree-sitter-bash tree-sitter-bash copied to clipboard

Corpus of shell scripts

tree-sitter-bash
tree-sitter-bash copied to clipboard