tinygo icon indicating copy to clipboard operation
tinygo copied to clipboard

wasm: improve malloc/free heap tracking

Open aykevl opened this issue 3 years ago • 21 comments
trafficstars

Using a slice requires a lot less in code size than a map. This is visible when compiling a very small "hello world" style program.

Before tracking memory in malloc/free: 2873 bytes With tracking using a map: 6551 bytes With a slice instead of a map: 3532 bytes

Of course, most of this code size increase won't be visible with https://github.com/tinygo-org/tinygo/pull/3142, but it's still a saving of around 3kB in this minimal example.

aykevl avatar Sep 15 '22 10:09 aykevl

Should we be concerned about the change from O(1) to O(n) time for tracking allocations?

dgryski avatar Sep 15 '22 14:09 dgryski

Perhaps. But the GC itself is also rather inefficient. We could of course switch this algorithm to an algorithm with the same complexity as the one in the GC (which is probably only marginally better).

aykevl avatar Sep 15 '22 15:09 aykevl

Fair enough. Let's leave the simple implementation for now and if it becomes an issue we can maybe make a build-tag to drop in a map-based implementation for people who care about it.

dgryski avatar Sep 15 '22 16:09 dgryski

Looks like this might be an actual error? https://github.com/tinygo-org/tinygo/actions/runs/3059652993/jobs/4943124870#step:9:58

deadprogram avatar Sep 15 '22 17:09 deadprogram

Yes, that looks like a real bug. I'm glad #3148 added some tests!

aykevl avatar Sep 15 '22 20:09 aykevl

I think the tension here, correct me if I'm wrong, is basically you can't override this (a part of how GC works) with build tags until tinygo adds them. chicken egg. Then, the next tension is proving the performance is worse vs having the next impl prove it isn't worse.

Strings are routinely used in plugins. ex regular expression matches etc. It would help in the future balance size/perf in a less conjecturey way. Ex place a representative main.go here that exports regex functions, and benchmark using them. I think that would help balance some discussions in general. A bias towards size can exist with guards against severe performance regressions.

my 2p

codefromthecrypt avatar Sep 16 '22 01:09 codefromthecrypt

Is it possible to reference the -opt flag and pick between slice and map?

It is possible, but wouldn't help much when it comes to algorithmic complexity. The heap allocator itself is also O(n) in the worst case, where n is the number of blocks (heap size divided by 16, usually). This is why I expect this PR won't have a lot of effect in practice.

Generally when it comes to performance, I want to act on data. In this case, I do have data on code size (because it is easy to measure) but I do not have data on performance. We can always change the implementation if benchmarks suggest this is a bottleneck.

aykevl avatar Sep 29 '22 11:09 aykevl

Generally when it comes to performance, I want to act on data. In this case, I do have data on code size (because it is easy to measure) but I do not have data on performance.

I do agree with the sentiment but feel that if there are clear pathological cases then those are also similarly clear.

But I didn't know allocation is similarly pathological - in that case indeed this does not need a change.

Is it fair to say TinyGo's wasi target shouldn't be used in cases where performance is a concern given the allocator's limitations?

anuraaga avatar Sep 29 '22 14:09 anuraaga

If performance is a concern, maybe you shouldn't be using WASM...

To be fair, we have the gc disabled for some of our services: -gc=leaking is fine for short-lived processes if their initial heap is set correctly.

dgryski avatar Sep 29 '22 14:09 dgryski

If performance is a concern, maybe you shouldn't be using WASM...

WASM is the only way to extend Envoy for now so we're focusing on it for better or worse 😅 Unfortunately a leaking mode also isn't supported there, but maybe it should be.

For a bit of context, we're trying to bring Coraza WAF to Envoy via WASM - for performance reasons, we've swapped out several libraries from Go to C(++) or Rust

https://github.com/anuraaga/coraza-wasm-filter/tree/main/lib

The biggest was regexp itself with ~5x overall performance improvement from swapping in libre2. But other's like aho-corasick also provide 20% - the numbers are e2e not microbenchmark.

I think this means that while native will of course perform the best, wasm does have potential for reasonable performance. While I am curious what may be causing such drastic performance difference, if TinyGo itself is designed in a way that we expect lower performance than other compilers, it wouldn't be productive to do a deep dive. So I'm just trying to understand the vision there.

If small code size is the end all (as tiny would represent which makes a lot of sense!) then I would suggest that browser rather than wasi would be the target that fits better with the vision. But that's just a straw-man's proposal. Sorry if too much noise

anuraaga avatar Sep 29 '22 15:09 anuraaga

Performance of TinyGo compiled code varies a lot. I've seen cases where TinyGo outperformed standard Go by a wide margin (for integer heavy code that doesn't allocate memory). For other cases, TinyGo will be a lot slower. In general, you can expect C like code to perform well but code that uses the heap, goroutines, etc to perform much worse.

In particular, for WebAssembly, TinyGo has to jump through hoops to get it supported. WebAssembly is just a really weird instruction set that is incredibly difficult for a Go-like language to target whereas other architectures are much simpler to support. This is getting improved slowly (with exception handling, stack switching, and a GC) but this will likely take years. Once these features are supported in WebAssembly and used by TinyGo, I expect TinyGo binaries will perform not very different from C and Rust (perhaps a bit slower but not by much). Until then, we have to keep our workarounds in place which inevitably means code will run slower.

That said, the GC could certainly be improved. It works, but it wasn't optimized for speed (for example, the heap is fully conservative). I believe @dgryski is investigating how to improve this.

aykevl avatar Sep 29 '22 15:09 aykevl

The two performance improvements I see for the current garbage collector (without a large scale rewrite) are:

  1. increase the size of the mark work queue; an overflow requires an additional scan of the entire heap
  2. make note of which allocations do not contain any pointers and this don't require scanning

The first one is easy (although a bit tricky to get right so as not to affect performance on low-end embedded machines) and the second one requires a bit more work to to track the newly required information.

dgryski avatar Sep 29 '22 19:09 dgryski

Thanks for the context everyone - definitely would be nice if the GC spec goes somewhere especially for polyglot! Those GC optimizations, especially the second one seems quite compelling too.

anuraaga avatar Sep 30 '22 01:09 anuraaga

TinyGo is likely to benefit from the stack switching proposal (for goroutines) but not the garbage collection one (because Go supports interior pointers which the proposal doesn't allow.)

dgryski avatar Sep 30 '22 01:09 dgryski

but not the garbage collection one (because Go supports interior pointers which the proposal doesn't allow.)

It can probably still use the WebAssembly GC if all struct methods (etc) are made heap objects. For example, a struct like this:

type Point struct {
    X, Y int
}

could be compiled like this:

type Point struct {
    X, Y *int
}

This way, it is possible to have interior pointers at the cost of a large heap increase. This will be even worse for things like byte slices (where it is possible to construct a pointer to each individual byte) but there may be workarounds for that too, like fat pointers. Still, it'll be rather difficult to get this working.

aykevl avatar Sep 30 '22 12:09 aykevl

@anuraaga one way you could help here is by making a good (realistic) GC benchmark. Having a good benchmark would help a lot to improve the GC.

aykevl avatar Sep 30 '22 12:09 aykevl

Updated the PR. The PR was failing because the tests actually contained a bug: they passed pointers around as uintptr. The uintptr type is an integer, not a pointer, so the GC wasn't tracking it and the memory was freed before all references were gone. The first commit fixes the bug, the second commit is the original commit of this PR.

aykevl avatar Sep 30 '22 12:09 aykevl

One of the gc-heavy benchmarks I've been playing with is the binarytrees program from the Benchmarks Game: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-go-2.html

dgryski avatar Sep 30 '22 16:09 dgryski

The uintptr type is an integer, not a pointer, so the GC wasn't tracking it and the memory was freed before all references were gone.

This was intentional - the tests are calling malloc and the pointer should be valid until free is called even treated as an integer. If this code was c++ it would not be tracked either, hence why the allocator needs to track.

I'm not sure why but I believe I had issues using unsafe.Pointer in trackAlloc, it had to be []byte to work.

anuraaga avatar Sep 30 '22 23:09 anuraaga

This was intentional - the tests are calling malloc and the pointer should be valid until free is called even treated as an integer.

...you are entirely correct. Yes, the pointer should remain valid even if it is treated as uintptr.

I'm not sure why but I believe I had issues using unsafe.Pointer in trackAlloc, it had to be []byte to work.

That's interesting, sounds like a bug actually. I'll need to investigate this.

aykevl avatar Oct 01 '22 00:10 aykevl