ocaml-solo5 icon indicating copy to clipboard operation
ocaml-solo5 copied to clipboard

Support of OCaml 5.0 (cleaned version)

Open dinosaure opened this issue 2 years ago • 14 comments

This is the cleaned version of #122. From what I saw, I have some questions:

  • It seems that we ~export~ stub usleep but we don't implement it. We should probably implement it when Solo5 has a function to sleep (see solo5_yield)
  • Some functions are just exported (like qsort) but no implementation are needed - at least, at this level of our usual compilation path. I wonder what happens to these functions. If they are exported, is it because they are required? By whom? For what?

Finally, this PR only offers something that compiles. The project has not yet tested with a unikernel. It should be reconfirmed that qubes-firewall works and if we can get

  • dns-resolver (small unikernel)
  • unipi (HTTP & Git)
  • opam-mirror (block device)

This would be very good to confirm that this PR can be merged and that the MirageOS world can move to OCaml 5.

dinosaure avatar Jan 11 '23 13:01 dinosaure

IMHO this is far from ready, we didn't really review much, we just got it to work.

haesbaert avatar Jan 11 '23 13:01 haesbaert

A precision about words used in my message:

  • stub is when we actually provide an implementation which raises an error
  • export is when we just describe the function (no implementation is done here, only strcpy is a special case when it is provided by Solo5)
  • define is about values

dinosaure avatar Jan 11 '23 13:01 dinosaure

From the cirrus CI, there is one warning and one error:

Entering directory '/tmp/cirrus-ci-build/test'
(cd _build/solo5 && /.opam/5.0.0/bin/x86_64-solo5-none-static-cc -I/.opam/5.0.0/solo5-sysroot/include/nolibc/ -include _solo5/overrides.h -O2 -fno-strict-aliasing -fwrapv -pthread -D_FILE_OFFSET_BITS=64 -I/.opam/5.0.0/.opam-switch/build/ocaml-solo5.0.8.1/nolibc/include -include _solo5/overrides.h -I/.opam/5.0.0/.opam-switch/build/ocaml-solo5.0.8.1/openlibm/include -I/.opam/5.0.0/.opam-switch/build/ocaml-solo5.0.8.1/openlibm/src -nostdlib -I/.opam/5.0.0/solo5-sysroot/include/nolibc/ -include _solo5/overrides.h -O2 -fno-strict-aliasing -fwrapv -pthread -g -I /.opam/5.0.0/solo5-sysroot/lib/ocaml -o startup.o -c startup.c)
startup.c:16:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
1 warning generated.
File "dune", line 2, characters 7-11:
2 |  (name main)
           ^^^^
(cd _build/solo5 && /.opam/5.0.0/solo5-sysroot/bin/ocamlopt -w @[email protected]@30..39@[email protected]@[email protected] -strict-sequence -strict-formats -short-paths -keep-locs -g -o main.exe manifest.o startup.o .main.eobjs/native/dune__exe__Main.cmx -g -w +A-4-41-42-44 -bin-annot -strict-sequence -principal -safe-string -color always -cclib '-z solo5-abi=hvt')
ld.lld: error: /.opam/5.0.0/solo5-sysroot/lib/ocaml/libasmrun.a(domain.n.o) has an STT_TLS symbol but doesn't have an SHF_TLS section
cc: error: linker command failed with exit code 1 (use -v to see invocation)
File "caml_startup", line 1:
Error: Error during linking (exit code 1)

Would be great if someone more into TLS could take a look at the TLS issue (I suspect this is due to usage of clang/llvm both as compiler and as linker).

hannesm avatar Jan 12 '23 10:01 hannesm

startup.c:16:1: warning: control reaches end of non-void function [-Wreturn-type]

Indeed the solo5_app_main (in test/startup.c and also example/startup.c) does not return anything (and probably never reach that point), mirage/mirage-xen returns 0, and the tests suites in Solo5 return SOLO5_EXIT_SUCCESS or SOLO5_EXIT_FAILURE. I guess return 0; should be fine here.

ld.lld: error: /.opam/5.0.0/solo5-sysroot/lib/ocaml/libasmrun.a(domain.n.o) has an STT_TLS symbol but doesn't have an SHF_TLS section

It's interesting that the error was already spotted in Solo5 in https://github.com/Solo5/solo5/blob/7ba74eb9f645e07570cd35a833eddec4918eaf64/tests/test_tls/test_tls.c. And the test is deactivated when compiled with lld. The check in lld seems to be https://github.com/llvm-mirror/lld/blob/64b024a57c56c3528d6be3d14be5e3da42614a6f/ELF/Symbols.cpp#L112.

palainp avatar Jan 12 '23 11:01 palainp

Thanks for your feedback all, I integrated most of what you said, only the TLS problem remains now.

dinosaure avatar Jan 13 '23 11:01 dinosaure

Thanks for all the work @palainp @kit-ty-kate @dinosaure @haesbaert

About the thread-local storage, I see some issues:

  • as @haesbaert said the prefered failure behaviour would be to fail loudly or work (i.e. no silent failure / overwrite of memory)
  • to get to that point, I suspect we'll need to understand how the OCaml allocates stuff in the thread local storage area (but additionally, arbitrary C bindings may use __thread variable as well)
  • the linker error with llvm is something we should address (as pointed out by @palainp this is likely work needed in solo5 (adjusting linker scripts - see https://github.com/Solo5/solo5/pull/436 and https://github.com/Solo5/solo5/pull/420)

I started to read https://maskray.me/blog/2021-02-14-all-about-thread-local-storage to get a better understanding of thread local storage, but won't have much time this month to work on a more thorough review or patches.

hannesm avatar Jan 17 '23 09:01 hannesm

EDIT: The following was due to a faulting linker script. This is now fixed with the latest release.

Along with the solo5 PR (https://github.com/Solo5/solo5/pull/542), I tried to correctly use the TLS variant II for my processor (https://github.com/Solo5/solo5/blob/8c3a744f998b9977cfc6cd29e0cc40ae2efba167/tests/test_tls/test_tls.c#L52 TLS point to an address after the area) on OpenBSD. And it works (at least a few steps further)!

However I still have a page fault error:

  __|  _ \  |  _ \ __ \
\__ \ (   | | (   |  ) |
____/\___/ _|\___/____/
Solo5: Bindings version v0.7.1-66-gf1daea2
Solo5: Memory map: 512 MB addressable:
Solo5:   reserved @ (0x0 - 0xfffff)
Solo5:       text @ (0x100000 - 0x21bfff)
Solo5:     rodata @ (0x21c000 - 0x22cfff)
Solo5:       data @ (0x22d000 - 0x2e3fff)
Solo5:       heap >= 0x2e4000 < stack < 0x20000000
Solo5: trap: type=#PF ec=0x0 rip=0x1f8f70 rsp=0x1ffffe98 rflags=0x10002
Solo5: trap: cr2=0x0
Solo5: ABORT: cpu_x86_64.c:181: Fatal trap

The 0x1f8f70 address is something really strange here:

00000000001f8f70 <caml_alloc_final_info>:
  1f8f70:       4c 8b 1d 89 70 e0 ff    mov    -2068343(%rip),%r11        # 0 <domain_field_caml_young_limit>
  1f8f77:       4c 33 1c 24             xor    (%rsp),%r11
  1f8f7b:       55                      push   %rbp
  1f8f7c:       48 89 e5                mov    %rsp,%rbp
  1f8f7f:       41 53                   push   %r11
  1f8f81:       41 56                   push   %r14
  1f8f83:       bf 70 00 00 00          mov    $0x70,%edi
  1f8f88:       e8 b3 22 01 00          callq  20b240 <caml_stat_alloc_noexc>
  1f8f8d:       49 89 c6                mov    %rax,%r14
  1f8f90:       48 85 c0                test   %rax,%rax
  1f8f93:       74 0f                   je     1f8fa4 <caml_alloc_final_info+0x34>
  1f8f95:       ba 70 00 00 00          mov    $0x70,%edx
  1f8f9a:       4c 89 f7                mov    %r14,%rdi
  1f8f9d:       31 f6                   xor    %esi,%esi
  1f8f9f:       e8 7c 3b f8 ff          callq  17cb20 <memset>
...

The top instruction is related to stack protection and for all other function I inspected it uses a retguard symbol (as example the previous function in the same C file):

00000000001f8f20 <caml_final_release>:
  1f8f20:       4c 8b 1d c1 2f 0e 00    mov    929729(%rip),%r11        # 2dbee8 <__retguard_3085>
  1f8f27:       4c 33 1c 24             xor    (%rsp),%r11
  1f8f2b:       55                      push   %rbp
  1f8f2c:       48 89 e5                mov    %rsp,%rbp
...

I don't understand why for that specific function the behavior has changed :(

palainp avatar Feb 01 '23 08:02 palainp

I'm curious, now that solo5 0.8 hit the release road, what is the status of this PR? Does it work on some (all?) platforms? Should we proceed (with review, more testing) in order to get MirageOS working with OCaml 5?

A brief message would be welcome, maybe @palainp knows the details (since he was involved in the most recent commits).

Thanks a lot.

hannesm avatar May 19 '23 08:05 hannesm

Thanks @hannesm for the reminder. From my POV it would be great to restart the CI test (the last commit was pushed when solo5 was not released in its last version) and eventually catch any error. With this PR pinned I currently use Ocaml 5 with Xen and spt targets really fine :) Also testing on other systems should be great!

palainp avatar May 19 '23 09:05 palainp

Hi, thank you again for this work ! I managed to build eio-solo5 on top of that and assemble the network stack up to dream.(https://github.com/TheLortex/eio-solo5)

Using this proof of concept I've been able to test solo5-hvt with ocaml 5 / eio and didn't encounter any issue.

TheLortex avatar May 19 '23 10:05 TheLortex

Hi, thank you again for this work ! I managed to build eio-solo5 on top of that and assemble the network stack up to dream.(https://github.com/TheLortex/eio-solo5)

Using this proof of concept I've been able to test solo5-hvt with ocaml 5 / eio and didn't encounter any issue.

Great to hear. From my point of view, we should first review this PR, test it thoroughly and release it (without switching the effective layer in MirageOS), also evaluate performance etc.

I understand that developing the cool new stuff is exciting, but if nobody cares about upstreaming and pushing the PR to something that we're happy to maintain, there'll be all these vendoring / pin to random branches (that are a nightmare to update).

So, TL;DR: my approach is still "upstream first". Since I'm a bit short on time (and my OCaml 5 knowledge is barely existant), I asked whether there are others (including you :) that would go through the code and thoroughly review it. :)

hannesm avatar May 19 '23 10:05 hannesm

@fabbing @shym, thank you for the review!

I'm only speaking for myself, but I totally approve if the building could be reworked (even if it doesn't work with 5.0.0, as it doesn't seem possible to deal with both 4.14.1 and 5+, it doesn't matter if 5.0 is skipped in the process).

One thing that still needs work is the current absence of munmap (currently STUB_IGNORE) which leads to memory leaks (probably only visible with bigger tests than the toy examples). So far, I've written code for that which needs to be reviewed and tested in the pending #129 PR to this PR :sweat_smile: or in #131 for the 4.14 backport.

palainp avatar Jan 30 '24 12:01 palainp

Why not use only one sed command per file in the Makefile to improve readability and to avoid multiple command executions and temporary files?

The advantage of having several seds is that you can contextualise them in relation to their commits and the PRs that have integrated them. These modifications can be difficult to understand in their current form, and historical information helps us to better understand the ins and outs of these arbitrary choices.

Shouldn't a comment be added to nolibc/stubs‧c to explain the choice between preferring one of STUB_ABORT, STUB_IGNORE, STUB_WARN_ONCE?

More comments is better :+1:.

We think it would be interesting to target OCaml 5.1.1 directly, as it brings bug and performance fixes. The build system has also changed a lot since then, which would make it difficult to support both 5.0.0 and 5.1.1.

This PR is always set aside because these versions (5.0 and 5.1) lack compaction. The unikernels we are developing are mainly services. We therefore need a runtime capable of using memory that can be defragmented (and that the GC compacts) in order to provide these services for as long as possible.

As @palainp pointed out, we may need to test allocation logic other than dlmalloc. Here again, we need to do some testing (in particular with unikernels such as dns-resolver or tlstunnel) with such an allocator to see whether it corresponds to our use or not.

We will be delighted to have an extension of this PR with OCaml 5.2. And I'd be delighted to see more details of a possibility to experiment with OCaml 5 while keeping OCaml 4.14 support (I'll be more available for February).

dinosaure avatar Jan 31 '24 10:01 dinosaure

Dear @fabbing @shym,

your review is highly welcome. I do agree with @dinosaure that likely we should target OCaml 5.2 (which allows compaction), and skip 5.0, 5.1 series.

What I'm curious about is that we have some review comments in https://github.com/mirage/ocaml-solo5/pull/122 (or are they all addressed?) -- would be great if you could take a look at them (and comment those which are resolved, maybe re-raise them here if they are not yet resolved).

And then there is https://github.com/mirage/ocaml-solo5/pull/129 -- If I understand correctly, that's mostly additional mmap stuff. In the end, it would be nice to agree on a single PR (this one?), and put the mmap changes here as well.

In terms of ocaml-solo5 and OCaml 4 / OCaml 5 support: I don't think it is worth to have ocaml-solo5 that supports both versions of OCaml at the same time, so we should move the OCaml 4 support to a dedicated branch (in order to keep it up to date when OCaml upstream changes), and support only OCaml 5 in the main branch.

hannesm avatar Feb 01 '24 16:02 hannesm