sh icon indicating copy to clipboard operation
sh copied to clipboard

interp: consider adding some coreutils as builtins

Open mvdan opened this issue 8 years ago • 27 comments

coreutils as of 8.27 has a total of 103 executables, ranging from wc to test and sort.

Plenty of scripts depend on these. If we really want interp to be platform-independent, we will have to implement some of these as builtins as they won't be available in some systems like Windows (and Mac?).

Bash 4.4 already has a few as builtins:

 $ for c in $(pacman -Ql coreutils | grep '\/bin' | sed 's@.*/@@'); do type -a $c; done | grep builtin
[ is a shell builtin
echo is a shell builtin
false is a shell builtin
printf is a shell builtin
pwd is a shell builtin
test is a shell builtin
true is a shell builtin

We already have all of these except for [ and test - see #92.

I obviously won't implement all 103 in one go, but I could start with the most common and simple. Unfortunately, these being GNU programs they all have tons of options and gotchas, like cat having a dozen options.

One open question is whether these should always be builtins like echo. Other options include:

  • Use them as fallback only if the system version is not available
  • Enable them via a flag, like "extended builtins"

For the sake of simplicity, I'm inclined towards them always being builtins. The only downside is scripts that depend on GNU flags that are somewhat obscure. But these scripts wouldn't be portable to begin with.

mvdan avatar Apr 25 '17 13:04 mvdan

A good way to make the list of 103 executable shorter is to filter it through the list of POSIX Shell utilities, which has 160 executables: http://pubs.opengroup.org/onlinepubs/9699919799/idx/utilities.html

Those docs are also a good start, as each command is simpler and has less options - cat only has a single option.

The number of executables that are both in coreutils and in the POSIX utilities list are 60:

$ comm -12 <(pacman -Ql coreutils | grep '\/bin' | sed 's@.*/@@') posix | column -c 70
basename        echo            mv              stty
cat             env             nice            tail
chgrp           expand          nl              tee
chmod           expr            nohup           test
chown           false           od              touch
cksum           fold            paste           tr
comm            head            pathchk         true
cp              id              pr              tsort
csplit          join            printf          tty
cut             link            pwd             uname
date            ln              rm              unexpand
dd              logname         rmdir           uniq
df              ls              sleep           unlink
dirname         mkdir           sort            wc
du              mkfifo          split           who

Some of these like true and echo we already have. And most of the ones we don't yet have are common and seem fairly simple to implement.

mvdan avatar Apr 25 '17 14:04 mvdan

@andreynering this might be of interest to you if you want unix scripts to work on Windows

mvdan avatar Apr 25 '17 14:04 mvdan

@mvdan Yeah, probably it's a lot of work, but would definitely be helpful.

The most basic buildins would already be very helpful:

  • rm
  • cp
  • mv
  • mkdir
  • ls
  • touch
  • chmod

andreynering avatar Apr 25 '17 16:04 andreynering

Also worth noting that there are some fairly common commands that aren't part of coreutils. sed and grep come to mind, for example.

mvdan avatar Apr 25 '17 17:04 mvdan

@andreynering see the issue above - I'm currently looking at piggybacking off someone else's coreutils implementation.

mvdan avatar Jan 12 '18 11:01 mvdan

It looks like that coreutils project has lost traction, so it could be years until anyone completes it and makes it importable. I see two alternatives.

Option one: we add an easy way to bundle a coreutils implementation like BusyBox, https://github.com/uutils/coreutils, or any other implementation that can be bundled as a single static binary. Then, one could include said static binary alongside the Go binary, or inside the Go binary as a compressed variable/constant somewhere.

Option two: we implement the most common coreutils tools ourselves.

Option one is much, much less work. I definitely don't want to reimplement coreutils, especially not with all of GNU's knobs and portability nightmares.

I realise that option one won't be easy to set up, may be slower than native Go implementations, and may make binaries larger. However, it's a much saner option in the long run, one can choose the coreutils implementation, and it still solves the portability issue, which is the main concern.

If anyone would like to work on it, please speak up. The API would likely be something like:

// ExecWithBusybox will use a busybox-like binary at path to execute all programs belonging to a
// busybox implementation. For example, executing {"cp", "foo", "bar"} would actually execute
// {"/path/to/busybox", "cp", "foo", "bar"}.
func ExecWithBusybox(path string, next ModuleExec) ModuleExec

Then it would be up to the application to decide how to bundle the busybox binary itself. I don't think that implementation detail belongs in our API.

mvdan avatar Oct 28 '18 17:10 mvdan

I'd like to throw in another idea which makes implementing both options viable: Allow custom builtins.

API could look something like:

type Builtin func(ctxt context.Context, argv []string) error
func (r *Runner) SetBuiltin(name string, handler Builtin)  (old Builtin)

The handler could then exec busybox or call into an coreutils like API.

ebfe avatar Oct 28 '18 17:10 ebfe

Is BusyBox compatible with Windows? I don't think so.

Another alternative would be forking the go-coreutils project and make some of the already implemented tools (mv, ls, rm, cp, etc...) importable.

andreynering avatar Oct 28 '18 18:10 andreynering

Is BusyBox compatible with Windows? I don't think so.

Good point. However, note that more modern implementations like uutils do support Windows, so that should be a better option if you prioritise portability.

Another alternative would be forking the go-coreutils project

The amount of work required would be non-trivial, and I don't intend to maintain such a project, so I'll give that a pass myself :) It's also not directly related to a shell package, so it doesn't need to live here. Others are welcome to do that work separately, of course.

func (r *Runner) SetBuiltin(name string, handler Builtin) (old Builtin)

I'm starting to think that the current ModuleExec API is enough to do what you want here. For example, all that your wrapper would do is (ignoring the old bit):

func (r *Runner) SetBuiltin(name string, handler Builtin) {
    r.Exec = ModuleExec(func(ctx context.Context, path string, args []string) error {
        if args[0] == name {
            return handler(ctx, args)
        }
        return r.Exec(ctx, path, args)
    }
})

Of course, that doesn't really scale if one has hundreds of overriding layers like these. But one could have a single layer for all the busybox "builtins", which is kinda like what I was suggesting earlier. The earlier version would be like:

func ExecWithBusybox(path string, next ModuleExec) ModuleExec {
    return ModuleExec(func(ctx context.Context, path string, args []string) error {
        switch args[0] {
        case "cp", "all-other-coreutils":
            mc, _ := FromModuleContext(ctx)
            cmd := exec.CommandContext(ctx, path, args)
            cmd.Dir = mc.Dir
            cmd.Stdin = mc.Stdin
            cmd.Stdout = mc.Stdout
            cmd.Stderr = mc.Stderr
            // etc etc
            return cmd.Run()
        }
        return next(ctx, path, args)
    })
}

So, really, both APIs be implemented outside of the interp package. I think ExecWithBusybox is a bit more complex, but they're both just thin wrappers over what one can already do with ModuleExec. So perhaps we shouldn't do anything after all.

mvdan avatar Oct 28 '18 20:10 mvdan

func (r *Runner) SetBuiltin(name string, handler Builtin) (old Builtin)

I'm starting to think that the current ModuleExec API is enough to do what you want here.

I don't think ModuleExec can override the existing builtins. (Though I'm unsure how useful this is outside maybe trap and adding currently unimplemented ones).

Of course, that doesn't really scale if one has hundreds of overriding layers like these.

I was thinking more of something like this, which doesn't stack up multiple layers:

type Runner struct {
	[...]
	builtins map[string]Builtin
	[...]
}

func (r *Runner) SetBuiltin(name, handler Builtin) (Builtin) {
	// Or just expose the builtins map.
	old = r[name]
	r.builtins[name] = handler
	return old
}

and adjusting isBuiltin/builtinCode use the map.

ebfe avatar Oct 28 '18 21:10 ebfe

Ah, I had forgotten that builtins were hard-coded into the interpreter.

At first I wanted to leave that to ModuleExec, but I figured that it would make the module too complex. DefaultExec already does enough stuff - imagine if it had to also take care of builtins. I was worried that it would be too hard for others to replace r.Exec, since I imagined that most people didn't care about builtins.

But perhaps we could layer them too. That is, declare DefaultExec as var DefaultExec = ExecWithBuiltins(ExecStartProcess) (where ExecStartProcess would be the current DefaultExec, just using os/exec).

Then it would be possible to bypass the entirety of the builtins - one just has to do r.Exec = ExecStartProcess or r.Exec = myBuiltins(ExecStartProcess). And overriding a single builtin would still be possible, via r.Exec = myCpBuiltin(ExecWithBuiltins(ExecStartProcess))).

This seems like it exposes less API, and by folding builtins into ModuleExec, makes that module more powerful.

mvdan avatar Oct 28 '18 21:10 mvdan

I'd like to have more time to help this move forward, but as for everyone, time is limited. Anyway, bringing this some discussion may be helpful.

Someone in the Task repo mentioned that https://github.com/u-root/u-root has many builtins implemented in Go. This folder contains the main packages. In some cases the core implementation was done on a separated package here, which means we could import and use them.

As I said before, having just the basic builtins like cp, mv, rm and mkdir would add a huge value for those on Windows as these are the builtins used 90% of the time, at least on automation tools like Task.

@mvdan What would you expect of someone trying to move this forward? Would it be acceptable to import this project and use it directly on interp/builtin.go? Or maybe you prefer to have an abstraction layer using ModuleExec and move that into another package?

andreynering avatar Sep 06 '21 14:09 andreynering

I had seen the u-root coreutils, but I did not know they were exposed as non-main packages too. That sounds wonderful.

Their go.mod is multiple times as big as ours though, so I'm uneasy about adding a direct dependency. Even with Go 1.17's trimmed module graphs, forcing anyone that imports the interpreter to depend on u-root and its dependencies feels overkill.

That said, I wouldn't oppose a subpackage, like mvdan.cc/sh/v3/interp/coreutils exposing a ModuleExec API. That way, it will be possible to import the vanilla interpreter, and thanks to trimmed module graphs, one shouldn't even need to download u-root unless necessary.

mvdan avatar Sep 07 '21 21:09 mvdan

I started to look into u-root as a solution, as described above. See https://github.com/u-root/u-root/issues/2527. I'll wait another week or so to see if they get back to me, as it would be best to coordinate the needed changes with them.

mvdan avatar Oct 19 '22 13:10 mvdan

But perhaps we could layer them too. That is, declare DefaultExec as var DefaultExec = ExecWithBuiltins(ExecStartProcess) (where ExecStartProcess would be the current DefaultExec, just using os/exec).

Unfortunately, layering like that with the existing ExecHandler API won't work, because the shell needs to know the set of builtins for commands like type or builtin. But perhaps a thought for v4 - rethink the "exec handler" so that it can absorb builtins entirely. Then the user can replace or extend the builtins as they please.

mvdan avatar Oct 23 '22 13:10 mvdan

I'm actually not sure why we were talking about builtins above. We can support providing coreutils implemented in Go without treating them as builtins, via ExecHandler. They would not execute a real program, instead running a bit of Go code. They wouldn't work if one ran builtin cp, but I don't think anyone would expect that - they are not actually builtins per the POSIX spec or Bash manual.

Anyway, I made some progress on this over the last couple of weeks. Thanks to @JohnHardy for nudging me along :) I made cp importable in u-root, which unfortunately required a non-trivial refactor, since half of the implementation was in a package main. See https://github.com/mvdan/u-root-coreutils/tree/coreutils.

You can see our side of the changes at https://github.com/mvdan/sh/tree/93-coreutils. The two changes are:

  • A new package, interp/coreutils, which exposes a func Handle of type interp.ExecHandlerFunc.
  • Allow chaining multiple exec handlers via interp.ExecHandlers, where one returning interp.SkipHandler means that we run the next one rather than stopping. This is helpful for coreutils, as one may want to attempt to run a bundled coreutil, falling back to executing a program normally.

I'll probably add support for similar chaining in the other handlers, because "run some logic in some cases, fall back to the default logic in others" is a rather common need.

mvdan avatar Dec 15 '22 23:12 mvdan

I forgot to mention: I am looking for feedback on this design, as well as the added dependency on u-root, before I spend more time modifying u-root to export more of their coreutils. Note that the added dependency is somewhat large, but it's only pulled in if one explicitly imports the new coreutils package.

cc @andreynering for go-task, as well as @zimbatm @riacataquian @ebfe @theclapp

mvdan avatar Dec 15 '22 23:12 mvdan

Err, worth flagging that the u-root Go module currently depends on our module:

https://github.com/u-root/u-root/blob/904692535c70f103396524ae535a2e7bc89cb75a/go.mod#L44

If we upstream our changes in the future, then we'd have a cyclic module dependency. Allowed, but still icky. I might have to make coreutils a nested or separate Go module for the sake of avoiding the cycle. The alternative would be to ask u-root to split their module in two, so that their shell on top of our interpreter isn't in the same module as their coreutils, but that seems somewhat invasive as a request.

mvdan avatar Dec 15 '22 23:12 mvdan

Yet another fun quirk: u-root appear to not support Windows at all: https://github.com/mvdan/sh/actions/runs/3708802693/jobs/6286748332

That's probably fine for the project in general, but it would be nice if their coreutils were somewhat portable. That could be more work in upstream, if they'd allow it. It's not great news given that I think the main driver for this feature request is Windows portability.

mvdan avatar Dec 15 '22 23:12 mvdan

Thanks @mvdan for taking the time to work on this!

I have the impression that bringing support to Windows should be easy, at least with regard to fixing that particular error.

It's looks to me like just a matter of declaring a _windows.go version of this file:

  • https://github.com/mvdan/u-root-coreutils/blob/20b984a9fd5cbfb3f1878f82b8e50fc1ec2d3a07/pkg/cp/cmd_plan9.go#L9
  • https://github.com/mvdan/u-root-coreutils/blob/20b984a9fd5cbfb3f1878f82b8e50fc1ec2d3a07/pkg/cp/cmd_others.go#L12

andreynering avatar Dec 16 '22 11:12 andreynering

Yeah, this particular one is easy to fix. The tricky part will be whether upstream cares about testing and supporting these coreutils on Windows. If they do not, it might fall on us to do that, and they might break the coreutils at any point due to lack of CI for Windows.

mvdan avatar Dec 19 '22 12:12 mvdan

Allow chaining multiple exec handlers via interp.ExecHandlers, where one returning interp.SkipHandler means that we run the next one rather than stopping. This is helpful for coreutils, as one may want to attempt to run a bundled coreutil, falling back to executing a program normally.

This is now https://github.com/mvdan/sh/pull/964. I wanted to get it reviewed and merged before I send the first PR for coreutils.

mvdan avatar Jan 15 '23 15:01 mvdan

The ExecHandlers API is now merged, I improved my u-root soft fork to also pass the current directory, and the 93-coreutils branch is updated with both of those changes. The tests now work :)

Posted another comment at https://github.com/u-root/u-root/issues/2527#issuecomment-1399551639 to hopefully start the upstreaming process. They seem open to it.

mvdan avatar Jan 22 '23 17:01 mvdan

Also, I'm pretty happy with the coreutils API in this module. You can see it at https://github.com/mvdan/sh/compare/master...93-coreutils. It's currently one sub-package in the same module, but it might end up being a separate module (see my thoughts in https://github.com/u-root/u-root/issues/2527#issuecomment-1399563006). In any case, @andreynering, I'd like to hear your thoughts as a future API user.

mvdan avatar Jan 22 '23 18:01 mvdan

The code looks great to me.

If it helps to prevent any problems, I see no problem in making it a separate module. In this case, it's important to add some documentation with the link to the README, so people know it exists.

andreynering avatar Jan 23 '23 12:01 andreynering

A separate package or module is indeed a bit harder to find, so I'd add a link in the interp package godoc. I don't think I want to add more content to the README specific to just one or two packages. The README is large enough as it is :)

mvdan avatar Jan 23 '23 16:01 mvdan

Any news on this? Thanks 🙏

leaanthony avatar Jan 29 '24 09:01 leaanthony