go-ipld-prime Condensing codegen output size

Condensing codegen output size

Open warpfork opened this issue 4 years ago • 1 comments

Let's talk about the codegen output size.

Codegen is a useful tool. However, the lighter its requirements, the more useful and widely usable it becomes. Output size, if it grows out of hand, can limit the usability of codegen systems. So, we should keep an eye on output size, and try to minimize it where possible.

what is output size, what does it affect, and how do we prioritize it?

Output size can refer to one of two things:

the size of the generated source code (I'll often refer to this as GSLOC, Generated Source Lines of Code, for short; though bytes may be a better measure than line count).
the size of the final binary assembly (in some architecture; results may vary) (I'll sometimes refer to this as AS, Assembly Size, for short).

The generated source size can affect a couple of things:

user impressions! (things that spit out large amounts of code are more scary)
how version control handles this information
- is the sheer size a concern? (can be mitigated if diffs from changes are minimal)
- are the diffs readable? (not always a concern; but, it's also certainly nice if some of it is readable)
how navigable the output is
- can IDEs reasonably usefully jump the user to definitions? (will the file be too big to open quickly?)

Output size is not our only priority of course: there are a number of priorities we balance in the design of codegen. A HACKME_tradeoffs document in the codegen directory has a nice enumeration of these values (and that document also talks about which ones have been prioritized).

I suspect we're going to find that in this implementation, we mostly care about generated source size, and mostly for its UX reasons. (This implementation is meant to be very featureful; if someone has a hard constraint on assembly output size, it's likely they either won't use codegen at all (schema systems do have runtime implementations too!) or will use a different codegen implementation entirely that was designed with a maximum priority on assembly size[‡].) But we can still try to minimize all dimensions of output size.

There's probably lots of stuff we can do to improve those two traits, and much of it which won't trade down on other values (such as speed) as long as we implement it well. But... what, exactly?

how can we measure this?

Before we go on to ideas for improvements, let's talk about how to measure this.

Total size is fairly easy:

for source size: run gen; look at the output file sizes!
for assembly size: run go build -o, look at the output file size. (This is arguably an approximation, but probably a survivable one. We should probably primarily track linux-amd64 just for consistency; beyond that, as far as I know, I'm willing to just hope other architectures won't have orders-of-mag different results.)
- I see some tools like https://github.com/jondot/goweight exist for this, though I haven't reviewed to see how much value they add over just scripting it in quick bash.

Figuring out where to actually target for improvements: harder:

for source size:
- we do have individual methods responsible for generating most parts, so, we could write custom tests which count their cumulative output size as we generate, and print stats at the end. Sort of limited utility since it presumes all these method boundaries match what we could dedup, which they... don't. But it might be a start!
- I wonder if there's any handy library for doing something "like gzip, but just spit me out the table of what's common".
- simplest, if nonscientific: just throw your scroll wheel around in the consolidated output file and sample what you land on most often.
- ...others?
for binary size:
- I could imagine writing some scripts which take -gcflags -S output, separate by function and type name, and do some simple counting of instruction count. (Does tooling like this exist already? Or are there easier ways to hook for this kind of info?)
  - Bonus: would also be able to smoothly grow this to report things like how much of the weight of a function is actually inlined from other functions, etc. (Would require more work than just basic parsing, but all the info's there.) (Whether inlining happens I know there are many ways to ask about, but this is the way that would also indicate assembly size outcome.)
- I see some tools like https://github.com/bradfitz/shotizam which look pretty neato. Might be worth looking into.
- ...others?

I'd suggest we might use the schema-schema as a corpus for these tests. Other cases and some reduced cases will also be good for study. The schema-schema provides a good holistic example for a large synthesized real system.

ideas for improvements

Some things that come to mind as very likely to produce improvements:

All the state machine switches around ma.state and la.state in assemblers are very redundant right now. Extracting them to helper methods that take params of pointers to the relevant state might be possible and should reduce GSLOC significantly.
- Doing this without a performance hit may require measurement and care, but if it works, it should be easy to implement.
- Unclear if this will result in AS reduction even if it results in GSLOC reduction (compiler may inline a lot of it; and in fact we'll often hope it will, for runtime speed reasons).
- Some of these switches might be more complicated to extract than others: some just validate simple state and panic; others contain a need to call a valueFinishTidy helper function that's type-specific and really ought to be concrete for inlining/performance reasons, and that'll be tricky.
Same as above but with na.m switches. Assembler state transitions based on the 'maybe' state are very redundant right now.
- Similar, but possibly a bit trickier, to the effort for the ma.state swtich extraction. (Sometimes the na.m switches return ErrWrongKind, which contains more specific info, and might complicate efficient extraction.)
A bunch of AssignNode bodies have very duplicated output that should be easy to extract. Especially the non-type-specific fallback paths.
- Much of this is using generic Node interfaces anyway, so extracting it is trivial and should have no performance hit.
A bunch of AssignNull bodies might be in a similar realm as AssignNode: fairly duplicated bodies.
- I've checked this less, but it there's probably something here. (Might also turn out to be already covered by a subset of "extract na.m switches".)
Methods which are present just to suit the Node and NodeAssembler interfaces, but are just stubs that emit a constant ErrWrongKind, are very unfortunate and definitely a large weight.
- Hard to do something about this.
  - We currently have schema type names in these error messages. Reducing the GSLOC without dropping this information is... tricky.
- Dropping the type information from the errors might be something we have to consider if the size cost of these is dire enough.
- Trying to use feature detection interfaces to dodge the needs for these methods entirely is an interesting research problem, but seems unlikely to fly (would add a lot of interface checks; not likely to be tenable for speed).
Many errors could be extracted and given short names (in some cases, extracted as types; in some cases even just becoming consts), where currently the error text is just repeated in full.
- Very low hanging fruit here.
- Even some things that have types extracted already may benefit from functions that make their creation even more terse in the generated source.
  - The WrongKind errors might be an example of this, since they're so darn prevalent.
  - This would reduce GSLOC size but probably not AS; since the extracted functions would be trivial, we'd expect the compiler to often inline them again.

This is a non-scientific list of what's come to mind first as I've scrolled around in some output. It's not an exclusive list; it's very likely there's other ideas that would also make great improvements.

footnotes

[‡] - It does occur to me that we may be approaching the point where it will be a good idea to pull this codegen system out into its own repo, with its own issue tracker, yes. We have a codegen system which has been incubated in this core repo; it is not intended to be the only codegen system ever: competing implementations will be very welcome, especially those with different priorities. I'm not sure if today's the day to bite that bullet, but it's getting closer.

Nov 04 '20 12:11 warpfork

Quick note on something to research more in the future: I'm carooming around the (psuedo-)assembler output of -gcflags -S, and I'm finding a lot more methods with completely autogenerated bodies doing not-so-useful things than I expected. If we can figure out how to reduce the size and number of these, we can probably shrink the final binary output size significantly.

For example: in looking at this output on a schema-schema run, I see a _TypeStr_Builder.AssignLink method. It's doing quite a lot of work -- some tests, some jumps, shoveling some data around, a duffcopy (!) or two (!!), the runtime.morestack check, and a call to _TypeStruct__Assembler.AssignLink.

None of which I really expected to see here. This method being autogenerated, sure. I was hoping to see it be mostly inlined, and then mostly stripped away to become relatively compact, though... this is one of the statically ErrWrongKind-returning methods. The assembly here is both doing a lot of work I didn't expect, and also taking up a lot of binary size. The extra instructions don't bother me in terms of runtime costs, because this is an error path; but the extra size, added up across every method like this on every type that has a builder (which is all of them -- twice, because of type-vs-repr level) is a bit concerning.

(_TypeStruct__Assembler.AssignLink is then, to its credit, inlined as expected, and not generating a CALL to mixin.MapAssembler.AssignLink, which it certainly could have. (So, apparently the go compiler won't inline a function that already contains other inlines? Seems very strange to me, but, is a description that fits what I'm seeing. Alright; moving on.) However, the amount of data being shuffled to set up this ErrWrongKind value is also a bit, ehhm,... there's room for improvement on this error type, let's put it that way. It shouldn't take dozens of instructions of shuffling to set up these errors.)

These observations are made from go1.15.1 and should probably be verified again on a newer compiler version.

Nov 04 '20 14:11 warpfork

go-ipld-prime go-ipld-prime copied to clipboard

Condensing codegen output size

what is output size, what does it affect, and how do we prioritize it?

how can we measure this?

ideas for improvements

footnotes

go-ipld-prime
go-ipld-prime copied to clipboard