universal-headers Textual approach can work

This implements a textual approach to making universal headers. We diff the different versions and add #if/#endif blocks. You end up with stuff like:

extern ssize_t readlinkat (int __fd, const char *__restrict __path,
      char *__restrict __buf, size_t __len)
#if defined x86_64-linux-gnu.2.32 OR _ZIG_UH_TEST
     __THROW __nonnull ((2, 3)) __wur __attr_access ((__read_only__, 3, 4));
#endif //x86_64-linux-gnu.2.32|
#if defined x86_64-linux-gnu.2.33 OR _ZIG_UH_TEST
     __THROW __nonnull ((2, 3)) __wur __attr_access ((__write_only__, 3, 4));
#endif //x86_64-linux-gnu.2.33|

@andrewrk talked about universal headers being needed to solve some problem, but I don't understand what you are looking for besides size reduction. Does this strategy help you?

For size reduction, combining these headers together saves about 90%

dvanderson@librem14:~/code/universal-headers$ du -shc headers/i*
1.1M	headers/i386-linux-musl
3.1M	headers/i486-linux-gnu.2.32
3.2M	headers/i486-linux-gnu.2.33
3.2M	headers/i486-linux-gnu.2.34
3.3M	headers/i486-linux-gnu.2.35
3.3M	headers/i486-linux-gnu.2.36
3.1M	headers/i586-linux-gnu.2.32
3.2M	headers/i586-linux-gnu.2.33
3.2M	headers/i586-linux-gnu.2.34
3.3M	headers/i586-linux-gnu.2.35
3.3M	headers/i586-linux-gnu.2.36
3.1M	headers/i686-linux-gnu.2.32
3.2M	headers/i686-linux-gnu.2.33
3.2M	headers/i686-linux-gnu.2.34
3.3M	headers/i686-linux-gnu.2.35
3.3M	headers/i686-linux-gnu.2.36
3.1M	headers/ia64-linux-gnu.2.32
3.1M	headers/ia64-linux-gnu.2.33
3.2M	headers/ia64-linux-gnu.2.34
3.2M	headers/ia64-linux-gnu.2.35
3.3M	headers/ia64-linux-gnu.2.36
65M	total
dvanderson@librem14:~/code/universal-headers$ du -sh uh_headers/
6.3M	uh_headers/

Combining less similar headers gives less savings (2/3 for these)

dvanderson@librem14:~/code/universal-headers$ du -shc headers/x86*
27M	headers/x86_64-freebsd.12.3-gnu
28M	headers/x86_64-freebsd.13.1-gnu
3.1M	headers/x86_64-linux-gnu.2.32
3.2M	headers/x86_64-linux-gnu.2.33
3.2M	headers/x86_64-linux-gnu.2.34
3.3M	headers/x86_64-linux-gnu.2.35
3.3M	headers/x86_64-linux-gnu.2.36
3.1M	headers/x86_64-linux-gnu-x32.2.32
3.2M	headers/x86_64-linux-gnu-x32.2.33
3.2M	headers/x86_64-linux-gnu-x32.2.34
3.3M	headers/x86_64-linux-gnu-x32.2.35
3.3M	headers/x86_64-linux-gnu-x32.2.36
1.1M	headers/x86_64-linux-musl
6.4M	headers/x86_64-macos.11-none
6.4M	headers/x86_64-macos.12-none
6.5M	headers/x86_64-macos.13-none
27M	headers/x86-freebsd.12.3-gnu
28M	headers/x86-freebsd.13.1-gnu
161M	total
dvanderson@librem14:~/code/universal-headers$ du -sh uh_headers/
54M	uh_headers/

Look at src/textdiff/example.bash for how I'm using/testing it.

One of the nice things about this strategy is testability - we can pull a specific version from the universal headers and we should get back the same files as that original version (modulo whitespace).

The rest of this description is some details about how it works and what problems I ran into.

Strategy is to maintain 2 work-in-progress files for each header. One has all the lines from all the versions. The other keeps the version list for each line in the first file. Here's how it looks in vimdiff

  82669376 extern ssize_t readlinkat (int __fd, const char *__restrict __pat│  x86_64-linux-gnu.2.32|x86_64-linux-gnu.2.33|                         
  82669376       char *__restrict __buf, size_t __len)                      │  x86_64-linux-gnu.2.32|x86_64-linux-gnu.2.33|                         
  82669376      __THROW __nonnull ((2, 3)) __wur __attr_access ((__read_only│  x86_64-linux-gnu.2.32|                                               
  82669376      __THROW __nonnull ((2, 3)) __wur __attr_access ((__write_onl│  x86_64-linux-gnu.2.33|                                               
  82669376 #endif                                                           │  x86_64-linux-gnu.2.32|x86_64-linux-gnu.2.33|

To add a new version to the work-in-progress we

normalize whitespace (convert tabs to spaces)
add context (prepend each line with a hash - see below for why)
diff it against the current work-in-progress (literally call diff -wdN)
update both work-in-progress files

Once all the versions have been folded in, outputHeaders goes through the work-in-progress files and adds #if/#endif blocks where the versions change.

Why the context hash stuff?

Since we are adding our own #if/#endif blocks, we need to ensure ours don't interleave with existing ones. It's okay if everything is nested, but if version A had #endif and version B had #endif /* new comment */ then we can't combine in the naive way:

#if defined A
#endif
#endif /* version A */
#if defined B
#endif /* new comment */
#endif /* version B */

The solution here is to preprocess the file and prepend each line with a hash to keep #if/#endif blocks from separate versions from becoming entangled. This can also happen with comments, because you can't put functioning #if/#endif in the middle of /* */ comments.

Some random things I did not expect:

there are lines inside /* */ comments that start with #ifndef because the comment is giving an example (so the code needs to track if it is inside a comment or not)
there can be whitespace between # and if

Testing

Right now each #if that we add gets an extra _ZIG_UH_TEST added to the end. This allows testHeaders to only evaluate those #if/#endif blocks and keep all the rest.

I've run every header dir successfully in at least one batch.

Stuff that could be improved

context hash probably should grab continuation lines (sometimes the #if condition is split onto multiple lines)
instead of maintaining comments, strip them?
do something intelligent with sym links rather than copying them?

Sep 29 '23 01:09 david-vanderson

Thanks for exploring this! These are some promising results indeed.

To answer your question, yes, size reduction is the main goal here. I want to ship potentially even a lot more sets of headers than these along with the Zig compiler, and want to keep the installation size down to a reasonable amount. At the same time, I don't want to ship files that are compressed with a generic file compression strategy because I want to keep these properties for the Zig installation:

One can edit any source file from lib/ and the changes are reflected immediately.
C compilation errors that point to header file source locations point to a location inside the zig installation rather than inside a cache directory.

Furthermore, I think that providing a set of "universal headers" as a standalone deliverable could be something interesting to offer in and of itself. Other projects might like to benefit from it, and perhaps contribute to this project in return.

This patch is making me think, maybe we could go with this textual approach you have explored, but with an additional preprocessing step to try to normalize them even more? For example, you're already normalizing the whitespace, but what if we tried to normalize the order of top-level declarations, too?

To flesh out the idea a little more, start with some more straightforward normalizations:

remove all parameter names
remove all comments
remove empty lines
remove whitespace between # and if (and similar whitespace)
flatten out all multi-line #define statements into single-line
normalize #if defined(A) and #if !defined(A) to #ifdef A and #ifndef A respectively (since the latter are fewer bytes)
remove extern keyword from function declarations

Finally, the complicated part. Mark every declaration as having a set of input names and output names. For example, #define a b+c has the input set of {b, c} and output set of {a}. A #include would have a set of inputs for all inputs inside it, recursively, and a set of outputs for all outputs inside it, recursively.

Our goal is to perform a sort of the declarations for each file, such that the sorted result has minimal textual differences across header files.

For a given header file, for each named declaration, we will create two sort keys: primary, and then the name itself.

For the primary, create a bitset with a bit for every header file of that name (e.g. a bit for every stdio.h file in the headers), populate that bit based on whether or not the name exists. This bitset, interpreted as an unsigned integer, is now used as the primary sort key.

#ifdef and similar constructs are treated as having the set of inputs for all inputs inside it, recursively, and a set of outputs for all outputs inside it, recursively. Then the same sorting is recursively applied inside of them.

The idea here is that after this aggressive normalization is performed, textual approach could work even better, especially when combining less similar headers.

What do you think about this idea?

Oct 12 '23 00:10 andrewrk

Thanks for the explanation! I think I understand what you are going for and agree with those installation properties.

The main worry I have is ensuring correctness through the transformation - to maintain the ability to partially evaluate a universal header for a given version and verify it's the "same" as the original header. I think there's also some utility to having the universal header be somewhat readable by us in the future debugging things, but that worry is more vague.

My guess is that a few of those normalizations (like removing comments and empty lines) will be both easy to verify and give us most of the savings.

Let me implement the easy ones, think more about the harder ones, and come back with more info.

Oct 12 '23 13:10 david-vanderson

The main worry I have is ensuring correctness through the transformation - to maintain the ability to partially evaluate a universal header for a given version and verify it's the "same" as the original header.

On that note, @marler8997 has been working on a verification tool.

Oct 12 '23 16:10 andrewrk

Progress update. Added stripping comments and newlines. Can now compile all headers together:

$ du -sh headers/
1.8G	headers/
$ du -sh uh_headers/
149M	uh_headers/
$ grep -rh ZIG_UH uh_headers/* | wc -c
52635229

We are down from 1.8G to 150M. And of that, about 1/3 is just the overhead of the huge extra #if lines. They look like this:

#if defined aarch64_be-linux-gnu.2.32 OR defined aarch64_be-linux-gnu.2.33 OR defined aarch64_be-linux-gnu.2.34 OR ...

I'm thinking of doing an extra processing step at the end to collapse some of those. So it could detect that the #if has all of the aarch64_be* and replace it with a single aarch64_be_all. The downside here is that mapping would need to be known to zig so that when you ask for aarch64_be-linux-gnu.2.32 it automatically includes aarch64_be_all.

The code could generate an extra header file that looks like this:

#if defined aarch64_be-linux-gnu.2.32 OR defined aarch64_be-linux-gnu.2.33 OR defined aarch64_be-linux-gnu.2.34 ...
#define aarch64_be_all
#endif

And then add that header to the top of every file I guess. Would that make sense as a way to organize these?

I'll go down that route unless you have a better plan.

Just eyeballing the headers, I don't think doing more sophisticated minimization will buy much savings, and they all require a much deeper understanding of C/C++ syntax. Let's see if this is good enough.

Oct 20 '23 15:10 david-vanderson

Here's a fun point of comparison: Zig's current installation size for libc headers is 91M. So you have already achieved increasing the number from 56 to 451 different libc header sets, with only 1.6x the size. I think this is within shippable range, however, I still think we can do even better.

As for reducing large sets of #if together: It's ok to introduce more requirements on what must be defined by the compiler in order to use these headers. For example, for aarch64, I think it should only check the __aarch64__ preprocessor definition, and rely on the compiler to properly set that.

Maybe we can have a look at some common, large sets of if conditions and see if there are any similar simplifications that can be made?

Oct 20 '23 19:10 andrewrk

This is awesome.

The clause reduction problem is text-book SAT, but you can probably get 90% close to the best theoretical solution just by applying some good old-fashioned human knowledge.

I do think that reordering definitions (what Andrew suggested a few comments ago) might also prove to be a worthwhile reduction (as it acts like a full removal of a #if clause), but I personally am not fully familiar with how much more complicated it would make things.

Oct 20 '23 19:10 kristoff-it

Down to 103M.

Take a look at src/textdiff/reductions.zig - each entry lists all the headers that go together and the last line is what we replace them with. Some headers are in multiple reductions. With this set, there aren't that many remaining huge #if statments (judging from grep -rn ZIG_UH uh_headers/).

The replacements themselves are completely UNTESTED, and I made up some of them. Just want to see if this is the right path before digging into exactly what they should be. There will be a large cleanup pass if this works.

What do you think?

Oct 23 '23 02:10 david-vanderson

Wow, incredible! That's only a 13% increase from status quo!

I'll make some time this week to tinker with this.

Oct 24 '23 18:10 andrewrk

Thanks! I've added src/textdiff/HOWTO.txt to help get you going.

Oct 24 '23 19:10 david-vanderson

This is mindblowing/amazing, depending on how one sees things. I am watching this closely now. Good job, @david-vanderson !

Somewhat unrelated question before this gets rolled out: how far back do we want to go with regards to glibc? RHEL7 ships with 2.17, the oldest "supported" distro I am aware of; it's EOL will be on 2024-06-30. Next one is RHEL 8, which ships glibc 2.28.

Oct 25 '23 05:10 motiejus

:heart: from the past: https://github.com/ziglang/zig/pull/15573

#if (defined __FreeBSD__ AND defined __GLIBC__) OR __GLIBC_MINOR__ == 32 OR __GLIBC_MINOR__ == 33 OR _ZIG_UH_
#define dn_expand  __dn_expand
#define dn_skipname  __dn_skipname
#endif /*_ZIG_UH_*/

Oct 25 '23 06:10 motiejus

universal-headers universal-headers copied to clipboard

Textual approach can work

The rest of this description is some details about how it works and what problems I ran into.

Why the context hash stuff?

Testing

Stuff that could be improved

universal-headers
universal-headers copied to clipboard