Document required features for next boostrapping breakpoint
I know that mescc is the next C compiler used after this one in the bootstrapping chain but I'm not sure which exact features are required to skip it. Currently I'm just working on whatever I think would be fun but if anybody knows specifics so that I could better target my work to actually help the bootstrapping effort it would be appreciated.
I note https://github.com/cosinusoidally/tcc_bootstrap_alt; this is already effectively able to skip mescc, just not in a super robust way.
Currently we go M2-Planet -> mescc -> tcc.
I suspect tcc has significantly greater requirements than M2-Planet is currently able to provide, or perhaps even that could be added without fairly major reworks.
Personally, I think in the long term, a more robust solution would be cutting out mescc and tcc entirely, and replacing them with a ported? or yet to exist? compiler and assembler written in M2-Planet.
I suspect tcc has significantly greater requirements than M2-Planet is currently able to provide
Do you think it would be easier to get qbe (https://github.com/oriansj/M2-Planet/issues/58) and cproc (https://github.com/michaelforney/cproc) to compile instead?
I note cosinusoidally/tcc_bootstrap_alt;
I'm not sure I entirely understand this.
The initial bootstrap compiler is actually written in JavaScript. It is a JS port of a cut down very early version of tcc. At this stage we only have a simple C complier so we must convert this JS code to C in order to compile it.
Why isn't the compiler just written in C without having to do any JavaScript(??) transpiling?
tcc_1_7 is very cut down. eg there's no support for things like goto and the preprocessor is very limited (no ifdefs etc).
M2-Planet has this functionality right now does that mean we can skip tcc_1_7?
Do you think it would be easier to get qbe (https://github.com/oriansj/M2-Planet/issues/58) and cproc (https://github.com/michaelforney/cproc) to compile instead?
Honestly unsure, but I will spend a bit of time looking through those codebases.
The initial bootstrap compiler is actually written in JavaScript. It is a JS port of a cut down very early version of tcc. At this stage we only have a simple C complier so we must convert this JS code to C in order to compile it.
Why isn't the compiler just written in C without having to do any JavaScript(??) transpiling?
This project has an odd history, I think that is an artifact of that strange history. There was an attempt to integrate this into live-bootstrap without any JavaScript things.
tcc_1_7 is very cut down. eg there's no support for things like goto and the preprocessor is very limited (no ifdefs etc).
M2-Planet has this functionality right now does that mean we can skip tcc_1_7?
I think this is saying tcc_1_7 doesn't have support for those things.
Updated to remove already implemented features.
From an initial cursory glance at tcc_1_7 it seems we'll need (in no particular order):
- [ ] Casting between types (pointer and integer)
From looking at tcc_1_7 it seems there are a few categories of casting:
Deref and cast assignment
*(char *)ind++ = c;
*(int *)p->addr = val;
Building too old tcc is probably not super useful for non-x86 arches :( . Maybe we'll be able to get 0.9.27 or perhaps bootstrappable fork tinycc (which is about 6 months before 0.9.27 with some extra patches).
Note that tcc also relies on types being of certain size. E.g. int must be 4 bytes (even though that is not specified in C standard)
Aat least on tcc 0.9.27 dlopen is probably not needed for bootstrapping. Neither mescc supports it nor we need dynamic libraries. mmap is not needed either. That's in tccrun.c which we don't need.
static and inline functions I guess can be mostly done by ignoring those keywords.
Octal codes are also not needed in the new tcc.
Building too old tcc is probably not super useful for non-x86 arches
Ah, true. The list was also more along the lines of "if we implement it and it ends up not being useful we'll at least still be closer to a real C compiler".
Note that tcc also relies on types being of certain size. E.g. int must be 4 bytes (even though that is not specified in C standard)
Oh, that's probably not going to be fun to sort out.
Aat least on tcc 0.9.27 dlopen is probably not needed for bootstrapping. Neither mescc supports it nor we need dynamic libraries. mmap is not needed either. That's in tccrun.c which we don't need.
Is there a list of features which mescc supports that M2-Planet doesn't at the moment? Or are there any hints in the changelogs?
static and inline functions I guess can be mostly done by ignoring those keywords.
Yeah, that's what I'm thinking since we can't really do multiple translation unit compilation.
Octal codes are also not needed in the new tcc.
I already spent the time to implement them. Oops. I'll have a PR up soon.
Note that tcc also relies on types being of certain size. E.g. int must be 4 bytes (even though that is not specified in C standard)
Is that ever not the case on modern unix systems? I know M2-Planet uses 8 bytes on 64 bit, but that is probably unusual, eg gcc uses 4 bytes for ints on 64 bit Linux:
$ cat a.c
main(){
printf("%d\n", sizeof(int));
}
$ gcc a.c
....
$ ./a.out
4
$ file a.out
a.out: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=112ea215162c7fa5607b6f82f685eb3f4b51c7ac, for GNU/Linux 3.2.0, not stripped
I don't think this has been mentioned in this thread, but in terms of live-bootstrap the next compiler in the chain is the bootstrappable fork of tcc https://gitlab.com/janneke/tinycc/ (iirc currently the mes-0.27 branch of that repo). That version can be built by mescc (https://github.com/fosslinux/live-bootstrap/blob/master/steps/tcc-0.9.26/pass1.kaem) . Source of the exact version used in live-bootstrap is linked from here https://github.com/fosslinux/live-bootstrap/blob/master/steps/tcc-0.9.26/sources
When I wired up my tcc_bootstrap_alt project to live-bootstrap I conditionally replaced tcc-mes in steps/tcc-0.9.26/pass1.kaem (current proof of concept branch diff is https://github.com/fosslinux/live-bootstrap/compare/master...cosinusoidally:live-bootstrap:tcc_bootstrap_alt-refactor_nov24 ). This fork will still build a bit idential copy of the bootstrappable fork of tcc. My PoC is no longer mergeable but may be useful as a reference.
Aat least on tcc 0.9.27 dlopen is probably not needed for bootstrapping. Neither mescc supports it nor we need dynamic libraries. mmap is not needed either. That's in tccrun.c which we don't need.
Yep, for tcc_bootstrap_alt mmap/dlsym are only needed for the very early versions of tcc (since those are pure jit compilers, but I did add the ability to output a custom executable format to tcc_1_7/tcc_js). I polyfill mmap/dlsym when using cc_x86 (and a couple of other places): https://github.com/cosinusoidally/tcc_bootstrap_alt/blob/master/tcc_js/loader_support_cc_x86.c#L558 https://github.com/cosinusoidally/tcc_bootstrap_alt/blob/master/tcc_js/loader_support_cc_x86.c#L667
There are lots of quirks and rough edges in tcc_bootstrap_alt. I am currently working on a cleaned up version https://github.com/cosinusoidally/tcc_simple that relies purely on heavily cut down versions of tcc-0.9.27, but it'll probably be a while before I finish that (and I'm currently a bit stalled as the next step is to port around 12kloc to a simplified dialect of C).
Building too old tcc is probably not super useful for non-x86 arches :( . Maybe we'll be able to get 0.9.27 or perhaps bootstrappable fork tinycc (which is about 6 months before 0.9.27 with some extra patches).
Ah it was mentioned, sorry. The links I posted to it may be useful either way.
and I'm currently a bit stalled as the next step is to port around 12kloc to a simplified dialect of C.
That sounds quite a lot of work though :(. It's probably easier to fix at least some of the rough edges of M2-Planet. We might not be able to fix everything, but could perhaps bring down 12kloc to some reasonable number.
Is that ever not the case on modern unix systems? I know M2-Planet uses 8 bytes on 64 bit, but that is probably unusual, eg gcc uses 4 bytes for ints on 64 bit Linux:
Most (x86 and x64) systems have exactly 4 byte ints and it's so common that people come to rely on them being exactly 4 bytes even though the standard just says 2 bytes or larger. The way I understand it the problem will come from TCC expecting them to be exactly 4 bytes and e.g. relying on overflow or expecting casting to int to mask off exactly 4 bytes.
Source of the exact version used in live-bootstrap is linked from here https://github.com/fosslinux/live-bootstrap/blob/master/steps/tcc-0.9.26/sources
Thanks, this is very useful. With the bootstrapping effort split across several different repos it's a little difficult to figure out exactly what's going on.
There are lots of quirks and rough edges in tcc_bootstrap_alt. I am currently working on a cleaned up version https://github.com/cosinusoidally/tcc_simple that relies purely on heavily cut down versions of tcc-0.9.27, but it'll probably be a while before I finish that (and I'm currently a bit stalled as the next step is to port around 12kloc to a simplified dialect of C).
It seems like this bootstrap chain deliberately avoids M2-Planet, is there a greater reason other than writing compilers is fun?
If M2-Planet was able to compile tcc_27_refactor from the new repo would we then be able to avoid mescc (for x86 only?)? Does tcc_27_refactor require less features than what I listed above for tcc_1_7?
I also still don't understand how the javascript transpiling works in a bootstrapping context or why it's there in the first place. Does the new repo also use javascript transpiling?
I polyfill mmap/dlsym when using cc_x86 (and a couple of other places):
Would "real" dlopen and mmap work with M2-Planet? I guess we would need to be able to build PIE executables using the existing tooling?
Also worth looking at mescc tests: https://git.savannah.gnu.org/cgit/mes.git/tree/lib/tests/scaffold
Presumably they were written as a stepping stone to build tinycc.
There are lots of quirks and rough edges in tcc_bootstrap_alt. I am currently working on a cleaned up version https://github.com/cosinusoidally/tcc_simple that relies purely on heavily cut down versions of tcc-0.9.27, but it'll probably be a while before I finish that (and I'm currently a bit stalled as the next step is to port around 12kloc to a simplified dialect of C).
It seems like this bootstrap chain deliberately avoids
M2-Planet, is there a greater reason other than writing compilers is fun?
To further minimise dependencies, but all the early stages are still buildable with cc_x86/M2-Planet/tcc/gcc. I did use M2-Planet during development though since it is more robust than cc_x86, generates better error messages, and the codebase is much easier to understand.
If
M2-Planetwas able to compiletcc_27_refactorfrom the new repo would we then be able to avoidmescc(for x86 only?)? Doestcc_27_refactorrequire less features than what I listed above fortcc_1_7?
If M2-Planet could compile tcc_27_refactor then the bootstrap path would be:
M2-Planet -> tcc_27_refactor -> tcc_27
Note that tcc_27_refactor will only generate .o files though so a linker would be needed (tcc_bootstrap_alt has it's own elf loader/linker for that purpose). For testing purposes I'm currently just dynamically linking to glibc, but that is not possible to do directly with M2-Planet generated code since M2-Planet uses a non-standard calling convention (plus M2-Planet does not generate .o elf files).
tcc_27_refactor is implemented in the same dialect of C as tcc_27. This is a larger subset of C than tcc_1_7 uses. The intention with tcc_27_refactor was to cut out any code not needed to compile tcc_27. I've been using the --coverage gcc compiler flag to figure out which code is not needed. I may have got a bit carried away stripping down the built in assembler though (probably not a problem though as assembly is only used in a couple of places for support code).
From the README: tcc_boot_min - tcc_simple_c ported to the M2_simple_asm.c dialect. Will also be self hosted. tcc_boot_max - tcc_27_refactor ported to the M2_simple_asm.c dialect. Will also be self hosted.
tcc_boot_min is complete (see https://github.com/cosinusoidally/tcc_simple/blob/master/tcc_boot_min/tcc_boot_min.c ). It can be compiled by M2-Planet, but I have not yet done the required plumbing to get the M2-Planet generated code to run.
tcc_boot_max is stalled at an early stage (https://github.com/cosinusoidally/tcc_simple/blob/master/tcc_boot_max/tcc_boot_max.c ). If I eventually complete it all the other tcc source files in that directory will shrink to nothing and I will be left with just tcc_boot_max.c
For both tcc_boot_min and tcc_boot_max I am able to incrementally port them to a simplified dialect. The tcc_boot_min.c and tcc_boot_max.c files are both compiled with tcc_simple_c (which only implements that simple dialect).
I also still don't understand how the javascript transpiling works in a bootstrapping context or why it's there in the first place. Does the new repo also use javascript transpiling?
It's not "transpiling" as such. It's more like simple text transformations (eg substitute instances of the word "var" with the word "int", and substitute "int" in place of "function"). https://github.com/cosinusoidally/tcc_bootstrap_alt/blob/master/tcc_js/js_to_c.c is the translator. There's a bit more to it than that, but it's a fairly simple translation process.
The reason behind doing this is because I got interested in bootstrapping because of another project I created https://github.com/cosinusoidally/mishmashvm . In mishmashvm I added tcc as a C JIT compiler to a Javascript VM. I needed to be able to bootstrap tcc using only a JS VM. Originally I was using an Emscripten compiled version of tcc for this purpose. I felt the Emscripten compiled version was difficult to audit so I set about creating tcc_bootstrap_alt as an alternative bootstrap path based on hand modified versions of tcc, starting the bootstrap process with a pure hand written js version of tcc (tcc_js).
In tcc_simple I do not use the JS approach, though in theory it should be fairly simple for me to auto translate the simplified C code into JS.
I polyfill mmap/dlsym when using cc_x86 (and a couple of other places):
Would "real"
dlopenandmmapwork withM2-Planet? I guess we would need to be able to build PIE executables using the existing tooling?
dlsym/dlopen won't really work for M2-Planet since it does not support dynamic linking. There's also the question of what people would want to dynamically link to from M2-Planet? In general it's not possible to call into preexisting .so files (because of the calling conventions mismatch) plus M2-Planet itself cannot generate .so files. When I was calling between M2-Planet generated code and tcc generated code I had to create a bunch of "trampolines" that switched between calling conventions.
mmap just needs the use of a syscall, but using the mmap syscall would likely break builder-hex0 (which is why I just polyfilled it in tcc_bootstrap_alt).
@cosinusoidally Sorry for the late reply, I've been busy adding features to M2-Planet and forgot to reply.
We're almost at the point where we can compile tcc_1_7 using M2-Planet but I can't really figure out how it's supposed to be used.
If I compile tcc_1_7 using GCC it immediately SEGFAULTs. Is there a relatively easy way for us to assert that tcc_1_7 behaves correctly when built with M2-Planet?
It doesn't need to be running code, outputting a hexdump of the generated assembly would also be enough to ensure that we're behaving correctly.
If there is no easy testing solution what would the next tcc_* step that provides an easy way of verifying that M2-Planet works as expected?
Good to hear you have made so much progress.
I've just checked with a fresh master checkout of https://github.com/cosinusoidally/tcc_bootstrap_alt and I am able to build tcc_1_7 with gcc:
$ cd tcc_1_7
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ gcc -m32 -I . tcc.c -ldl
$ ./a.out
tcc 1_7 start
glo: f7cea000 f7cea000
prog: f7caa000
argc 1
tcc version 0.9.2 - Tiny C Compiler - Copyright (C) 2001 Fabrice Bellard
usage: tcc [-Idir] [-Dsym] [-llib] [-i infile] infile [infile_args...]
I'm on a Ubuntu 22.04 amd64 system with gcc multilib installed (so I can build 32 bit binaries). It is also possible to build from within a 32 bit bionic buildd chroot (see the README, it contains instructions for setting up). It should also build with a stock tcc-0.9.27.
Early versions of tcc are jit compilers, but I have added a flag that allows tcc_1_7 to output a custom executable format. If you use the -r flag it will compile the code and emit a binary called tcc_boot.o. Note it will also still jit and run the program at the same time:
$ ./a.out -r dlsym_wrap.c
...
reloc at: 0xf7c1a3f8 to: 0xf7cca5d0
mk_reloc_global: 1470430224
resolve_extern_syms: strtof 1
Generating object file
tcc 1_7 start
glo: f7bcc000 f7bcc000
prog: f7b8c000
argc 1
tcc version 0.9.2 - Tiny C Compiler - Copyright (C) 2001 Fabrice Bellard
usage: tcc [-Idir] [-Dsym] [-llib] [-i infile] infile [infile_args...]
$ ls -l tcc_boot.o
-rw-rw-r-- 1 foo foo 88522 Apr 8 20:39 tcc_boot.o
$ sha256sum tcc_boot.o
e5144b7b28a63470a107fd95c527cc0da8c5c80abd46426c147015417a8e149d tcc_boot.o
If you are successful you should get the same hash. Note that should be the same as the hash you would get if you ran ./mk_from_bootstrap_seed from the base directory of the project. eg:
$ pwd
/tmp/blah/tcc_bootstrap_alt
$ time ./mk_from_bootstrap_seed &> /dev/shm/log
real 0m48.218s
user 0m20.147s
sys 0m28.071s
$ sha256sum tcc_10/tcc_boot.o
e5144b7b28a63470a107fd95c527cc0da8c5c80abd46426c147015417a8e149d tcc_10/tcc_boot.o
Note the name is hard coded to tcc_boot.o. You can load the file either with loader.c or tcc_1_7 and the -R dummy flag eg with tcc_1_7:
$ cat /tmp/hello.c
main(){puts("hello world");}
$ ./a.out -R dummy /tmp/hello.c
...
Reloc type RELOC_ADDR32 at f7c37298
dlsym: strtod
global_reloc: strtod 6 f7d67650 global_reloc_num: 1
Reloc type RELOC_ADDR32 at f7c373e2
dlsym: strtof
global_reloc: strtof 6 f7d675d0 global_reloc_num: 1
Reloc type RELOC_ADDR32 at f7c373f8
argc 4
running loader
tcc 1_7 start
glo: f7be9000 f7be9000
prog: f7ba9000
argc 2
tcc 1_7 compile done
dlsym: puts
hello world
will load and run tcc_boot.o, and then jit compile and run hello.c