tool-conventions
tool-conventions copied to clipboard
Linking.md: Handling relocation against external symbols (e.g. imported data symbols)
In attempting to switch emscripten over to using the llvm wasm backend + lld we've hit a fairly significant road block. The problem is this: Emscripten allows for functions to be imported from JS, but it also allow for data addresses to be imported. See src/library.js:2118
. Here we see emscripten allocating some static data space at runtime and then passing the address to the wasm module as a global import.
Currently in Linking.md and in lld, when a data symbol is undefined at static link time any relocations of type R_WEBASSEMBLY_MEMORY_ADDR will write zeros to the target location.
However, both s2wasm and asm2wasm, can handle undefined data symbols at linktime and convert these into get_global XXX
rather than i32.const XXX
which is used for defined globals.
I talked through some options with @jgravelle-google and @binji and I also spoke Roland McGrath who has worked the GNU tools for linking and loading. The options I see are the following:
- Modify emscripten to remove the need to import data symbols into otherwise statically linked binaries.
- Have special relocation type that the linker can use to handle both defined globals (via i32.const) and undefined globals (vis get_global)
- Ask users to annotate their external global data (with some kind of dll_import attribute), and have the code gen in llvm know to use
get_global
for such symbols rather thani32.const
@NWilson who has been working with me of the lld stuff.
After going back and forth a lot I've become pretty convinced that (2) is the best option.
(1) might be possible as short term fix, but the emscripten-side fix could be fairly involved. We also don't know of other users of emscripten are doing similar things. The other downside of this approach is that we know we have to handle this case for the dynamic linking case anyway, so we can't ignore it.
(3) is not great since only windows dll currently contains such annotations, and emscripten is built on the principle that you should able to through existing code at it with minimal mods. It is also more error prone. Codegen time also seems like the wrong time to make the decision since it really the linker than knows if a given symbol is defined or not.
The downside (2) is that we are kind of going back on ourselves, having just moved away from using wasm globals to model data address in the object format, we are now moving back towards them in the executable format. I'm ok with this, but its unfortunate that we have two models for data symbols.
The other downside of (2) is that it only works for relocations in the code section. Data section relocations will need to be applied at runtime. There is however already code in emscripten for generating and calling a helper function to apply these fixup. I the long run I hope this will be re-usable for the dynamic linking case too.
Hum. Some thoughts.
(1) Modify emscripten. In the short-term that would make a certain amount of sense. Anything we put in the toolchain is a longer-term commitment than the particular internal implementation details of Emscripten's library. In fact, the example you cite (https://github.com/kripken/emscripten/blob/incoming/src/library.js#L2118) is tzname
, which is especially silly thing Emscripten is doing. There are chunks of libc that Emscripten randomly implements in JavaScript, rather than using the Musl implementation and going via the syscall interface (for timezone handling, the Wasm port of Musl needs a couple of Wasm-specific syscalls to do timezone handling in the "kernel"). Not only would it be possible to fix tzname
, it would in fact be desirable to rip it out along with other bits in library.js
, and just use the stock Musl port. I appreciate that's effort for the Emscripten folks, but really the more of Musl they use the better.
(2) Special relocation to allow const.i32
to be replaced with get_global
. There are really big downsides to this, as you say. I don't mind going back to globals for data symbols in principle (these are after all "true globals" that will actually be used by the code) - but the issue of not being able to handle the MEMORY_ADDR_I32 relocations in the same way is not nice.
(3) Annotate symbols loaded from JS. It sort-of goes against the static linking model. And again you have the same problem as in (2) - any symbol with that annotation can't have its address taken to put in a MEMORY_ADDR_I32 relocation.
I have a question - how on earth does Emscripten "allocate memory" in the first place? When we statically link the Wasm file, there's no spare room! We have the 1024 empty guard page at the bottom, then the defined data space and the stack, and then the __heap_base starts right after that where malloc has control. Presumably they've hacked up their malloc so that they can carve out chunks from the heap-space, and prevent malloc from treading on the allocations they made on the JS side?
I'm leaning towards (1) - if Emscripten used Musl's tzname, and Musl's malloc, and only implemented the syscalls (rather than replacing bits of Musl) then we'd be OK. The problem is only arising because when they moved to the "syscalls" model they didn't quite finish it off (not that I'm blaming them for it - it's just what happened).
This seams related to something I've been thinking about a while ago. I wondered if I could use the llvm wasm backend + lld without emscripten to cross compile any program, maybe even a whole distribution to wasm files. Therefore, I would need to use any makefile unchanged and just use a small wrapper for cc, ld and ar. I got stuck at 2 places:
- A few options passed to ld, sometimes even by clang, like rpath for example, weren't supported. I'm not exactly sure what all of them do, so I just filtered them out or substituted them with other options in yet another wrapper script with mixed results.
- I couldn't figure out how to generate a file that behaves like an .so file. Creating .a and .o files wasn't a problem, I can just pass them to lld and it'll put them into a single wasm file. But creating an .so like file was a Problem, sice I don't want a shared library to be merged into the executable, I want it to just allow the Symbols provided in the shared library to be resolved or imported by a dynamic runtime linker later, but I couldn't figure out how I could get lld to do that.
When a symbols from JS is imported, then the JS is like a library. I think such a JS library should provide a list of symbols in a textfile or something, and lld should be able to take shared libraries and such lists an threat them the same, as symbols that can be imported. So I think that would be option 4: Let the user provide a list of which symbols from a js library have to be imported, if it is a shared library use it as such a list, but don't use annotations in the program that needs the library, if a symbol isn't defined in a shared library or such list, do the same as other linkers and just produce an error that there are unresolved symbols. And if the option to allow undefined symbols is used, require that all unresolved symbols get resolved at runtime.
Wasm currently doesn't have anything like a .so
. Dynamic libraries are a way off, and don't have many use-cases? Currently the mental model is statically-linked executable (although it's maybe more like statically-linked .so
that's callable from the web page).
It is possible to use Wasm without Emscripten, I've put together a minimal toolchain on GitHub called "Minscripten" that just includes a port of Musl, the JS syscalls side of Musl, and a JavaScript linker that takes the LLD output and some JS files, and creates a module loader incorporating the JS input and linking it to the Wasm file (which isn't modified at all once LLD has written it).
@NWilson
Not only would it be possible to fix tzname, it would in fact be desirable to rip it out along with other bits in library.js, and just use the stock Musl port. I appreciate that's effort for the Emscripten folks, but really the more of Musl they use the better.
I'm leaning towards (1) - if Emscripten used Musl's tzname, and Musl's malloc, and only implemented the syscalls (rather than replacing bits of Musl) then we'd be OK. The problem is only arising because when they moved to the "syscalls" model they didn't quite finish it off (not that I'm blaming them for it - it's just what happened).
That is not historically correct. We moved to syscalls where it seemed to make sense (e.g. printing), but thought that it was better for code size to use as much of the browser's date/time functionality as possible.
Perhaps we should remeasure and rethink that, of course, a lot has changed since then. But the linux syscall interface is probably not going to be optimal for the web use case in all things.
Right - the actual timezone handling stuff itself needs to be done by the browser in fact, but that can be done via adding new Wasm-specific syscalls rather than replacing portions of Musl with a JS implementation. What I mean by "didn't finish off the conversion to syscalls" is just that Emscripten isn't using the syscall model for all libc-to-JS bridging. I agree that you stopped where you wanted to - but didn't go so far as to convert all the bridging to use the syscall interface.
I don't see how (1) solves the problem though -- is it just an issue with tzname
? I imagine this functionality is used by other libraries built with emscripten as well, right? If so, we can't realistically tell them all to change. Or am I misunderstanding?
Ah, just realized @sbc100 mentioned this above.
I would have thought that it is pretty rare. If you have some static variables you want to allocate storage for upfront, how many people are going to use the complex JavaScript stuff that Emscripten uses to allocate space for it - only Emscripten devs will know how to do that. I would have thought 99% of people will just declare their static variable in a C/C++ file instead. And that's all that Emscripten has to do - just move the definition of those variables into a C file to include in libc.a, rather than allocate storage for them behind LLD's back.
So a minimal fix in Emscripten ought not to be too complex, nor will it mess with much (if any) third-party code.
For dynamic linking, the design is different (even though some parts might be similar) - you simply need the entire module's address space to be offsetable relative to a base address that's not known at compile-time. It wouldn't necessarily be quite the same as allowing individual symbols to be included into the linear memory at runtime, which is in fact very different from how most dynamic linking is done. In ELF for example, everything is done at the module level in terms of calculation of base address and so on; you never import individual symbols, only ever whole shared libraries/modules which are relocated en bloc. Hence libc will have a per-module base address that symbol relocations go through, rather than allowing individual undefined symbols to be pulled in from some external source and individually relocated.
Wasm currently doesn't have anything like a .so. Dynamic libraries are a way off, and don't have many use-cases?
There are a lot of use-cases for shared libraries. The most common one is if a lot of programs use the same library. For example, a lot of Programs use zlib1g, if you want to have more then one that uses it on the same webpage, and you statically link it, the browser has to essentially download it twice. With a shared library, it only needs to download it once. Additionally, such libraries could be put on CDNs, similar to what's done with regular popular JS libs. Another use case would be loading different modules at runtime. A lot of programs use shared libraries as modules and load & unload modules only if needed or load all modules in a directory or something similar.
Sure, it isn't that important yet, but I hope that they may become possible someday.
It is possible to use Wasm without Emscripten, I've put together a minimal toolchain on GitHub called "Minscripten"
I know. I've been playing arround with https://github.com/yurydelendik/wasmception, it's a bit incomplete, but it's a really easy way to quickly get a working toolchain.
To clearify my idea on how to solve this issue, wasm-ld already has a --allow-undefined-file= option. How about allowing multiple of such files, maybe even drop the option and specify an extension or header so that they can be used as input files. These files could be used instead of a dll_import attribute to decide if get_global or i32.const should be used, or is that not possible in lld? Normally, a linker can get this information from the shared libraries it links against, so why not require the JS libraries & runtimes to specify which symbols they provide using such a list file? And before the program can run, all strong symbols have to be resolved, how would lld check if all necessary symbols have been resolved if it doesn't get the info which symbols don't need to get resolved?
@Daniel-Abrecht, I think we already support what you want in terms on --allow-undefined. The linker will looks for a .imports
file alongside any .a
file it finds. This means --allow-undefined-file should not be needed, and instead each library can provide its own list. This is how the musl port we use on the waterfall (https://build.chromium.org/p/client.nacl.sdk/waterfall) works. However, I see this as a stopgap. Really this kind of information should be stored in symbol attributes in the .o file it self, not in a side-file.
The real problem here is that any dynamic linker (currently emscripten acts as the dynamic linker) needs to be able to supply data symbols (not just function symbols) to a module. The module cannot know statically the addresses of symbols in other modules, and we don't want the dynamic linker to write to the code section... so we can't use i32.const
for these addresses, we must use something like get_global
. And you are correct that lld
knows exactly there we need one vs the other. The problem is that this is not known by the compiler. We don't know until static link time. This is why why I think think we need to do what is called "linker relaxation" in this case, where the linker can modify the instructions are part of a relocation entry. There is president for this the AMD64 SysV ABI apparently.
@NWilson, regarding the current state of emscripten, they reason this works is because emscripten itself creates the wasm memory object on the JS side, and controls where the stack and heap go. When it uses lld it always runs with --import-memory. All it needs to know is where the wasm modules static data lives. Anything outside that range is fair game controlled by emscripten.
The linker will looks for a .imports file alongside any .a file it finds. This means --allow-undefined-file should not be needed, and instead each library can provide its own list. This is how the musl port we use on the waterfall (https://build.chromium.org/p/client.nacl.sdk/waterfall) works.
This is close to what I want, but I think it's the wrong way around. Musl needs some symbols like the __syscall symbols, but I don't think it should be concerned with where these symbols come from. Instead, I think emscripten and co. should provide a list of symbols they provide and that list should be passed to lld explicitly.
However, I see this as a stopgap. Really this kind of information should be stored in symbol attributes in the .o file it self, not in a side-file.
I don't see how the authors of c programs and libraries should know in advance how other people provide these symbols. For example, there is no reason the __syscall symbols have to be provided by js, they could also come from an .o file. How things are linked depends on how they are linked, not on the source code.
@Daniel-Abrecht I see what you mean and I think we share the same objective. I see this working via some kind of special library. So for example emscripten could provide, libemscripten.so, which tells the linker about the availability of certain symbols at runtime (allowing them to be undefined at static link time). But I think we are getting a bit off topic for this particular issue. Suffice to say I think we agree that we need to be able to import global addresses as well as functions from the environment (e.g. other dlls or the loader/host/emscripten).
So is the idea for (2) that for defined symbols the linker would just rewrite the immediate field of the i32.const
and for undefined symbols it would rewrite the whole instruction to get_global
? I agree that's a little ugly, but it's not unprecedented; linkers on other platforms also rewrite instruction sequences (e.g. for TLS model relaxation).
I also don't think (3) is necessarily terribly bad either though. I agree that for static linking, code shouldn't need to specify exactly where a dependency will come from, but we are talking here about a symbol that will not be defined at all in the static link. That means its import model is "outside" of the static link environment and it's in the model(s) used by syscalls and/or dynamic libraries. I don't think it's a problem if the syscall layer (i.e. the linking model "underneath" the static link) doesn't work exactly like the static link layer.
And I don't think it's written here anywhere but we've been assuming that we'll have a 2-level namespace (as opposed to an ELF-style flat namespace) for dynamic linking. We can debate that but in any case it's not necessarily a problem if the dynamic library model is different from the static linking model, as it is for non-ELF platforms.
So overall I think my preference might be be to first decide whether we really need to have the ability for symbols to be transparently downgraded from "must be resolved in the static link" to "imported into the final binary", or if it's OK to require them to be explicitly marked (either in the code or via some file passed to the linker) as importable in the final binary. If the latter, then we can just go with option 3. In making that decision, I guess we should take any performance implications into account (how many symbols will use different conventions and how much slower that might be) and whether we want to have any difference between functions (where it's easy to auto-downgrade) and data (where we have this problem).
In the short term though, (1) is by far the simplest. Emscripten really doesn't need to do these odd static allocations. It really wouldn't be hard to wean Emscripten off that strategy, given that it simply involves adding a .c file to libc.a that declares the storage for the variables, rather than declaring their storage in JavaScript.
- environ. Should be just replaced with Musl's file https://git.musl-libc.org/cgit/musl/tree/src/env/__environ.c rather than allocated in library.js. This one's odd because I can't immediately see why Emscripten isn't using it already.
- stdin/stderr/stdout. I'm using Musl's versions of these without any trouble in my toolchain. Surely Emscripten doesn't need to implement its own stdio? The emrun support and browser console support can all be done through the syscall interface (ie some fds direct their output to the console). But even using Emscripten's existing stdio implementation, rather than switching to Musl's, it would be trivial to just move the static memory allocations into an object file.
- tzname/daylight/timezone. OK, I agree that Musl's
__tz.c
source file can't be used in Emscripten. Again though there's no difficulty to moving the memory allocations into a .c file.
In case it's not clear what I'm suggesting - currently Emscripten has code that looks like this:
// In library.js "allocate" some storage that LLD doesn't know anything about
// for these three libc symbols:
tzname: '{{{ makeStaticAlloc(2*Runtime.QUANTUM_SIZE) }}}',
daylight: '{{{ makeStaticAlloc(1) }}}',
timezone: '{{{ makeStaticAlloc(1) }}}',
// Replace with a three-line C file that's added to libc.a, so that the storage
// declaration is visible to the linker. Hence the linking model stays as a nice
// simple static-linkage one, rather than adding lots of fancy dynamic stuff just
// for these few symbols.
char *tzname[2] = { 0, 0 };
long timezone = 0;
int daylight = 0;
Everything else is internal (eg __tm_current
) and could be allocated lazily with malloc, or again moved into an object file rather than a JS file.
Adding toolchain support for these "external data symbols" would be an overly-general workaround to maintain ABI compatibility with a historical internal implementation detail from Emscripten's asm.js implementation. It's a big thing to add to the Wasm linking ABI, because it involves standardising a relocation format (for data relocations) that will be used by a runtime linker - so it's quite a big thing to be adding to the toolchain. I personally don't see the need for it, and would afraid that's it's as likely to get in the way as it is to help when you come to add a fully-featured dynamic linking implementation.
I agree that (1), could be the simplest way to go. I'm just a little worried about modifying emscripten at this low level, mostly because I'm not experienced and working in that codebase. And I was tempted by (3) because we will for sure need this for DLLs. I think we can be fairly confident that llvm will need to be able to generate get_global
code for external data references so I don't think that part would be wasted effort just for the static linking case. But I will try to implement (1) first since you make some good points here.
@NWilson "environ. Should be just replaced with Musl's file https://git.musl-libc.org/cgit/musl/tree/src/env/__environ.c rather than allocated in library.js. This one's odd because I can't immediately see why Emscripten isn't using it already."
Emscripten lets you set env vars from JS as well as C. It was more convenient to do that from JS, but probably it's possible to do it in C as well.
In general, it sounds like (1) proposes removing the ability to do static allocations from JS (or to only allow such allocations if they are never passed to compiled C). It's probably a good idea to discuss this on the emscripten mailing list first to see if people are using that feature outside of emscripten's own system libraries. Personally I'm skeptical it's worth doing just to get around a temporary limitation of lld (assuming when lld can do dynamic linking it will support this anyhow, iiuc), but if no one on the mailing list is opposed, I'm not opposed either.
I've been investigating replacing makeStaticAlloc() and simply putting the globals in C file.
This has proved more complicated than intiially thought since emscripten doesn't have any way to export C global to JS (there is no EXPORTED_GLOBALS anymore). So we could need to great an accessor function for each global on the C side and then export that. I suppose this is why it was dont this way in the first place.
Is it still worth it? Or should I move onto importing globals on the wasm side instead?
Even if I great such wrapper functions I would then need to implicitly add them to EXPORTED_FUNCTIONS, otherwise they would be GC'd at link time. At least that is what my experiments show.
emscripten doesn't have any way to export C global to JS
Not normally, yeah. But for dynamic linking emscripten does export the addresses of all the C globals to JS (the backend metadata contains namedGlobals
, which maps global names to their offset in the binary, that's then emitted in the JS glue). That's how globals are linked in dynamic linking in asm.js and asm2wasm currently. Not sure if that helps here, but in principle we could do the same thing, after linking the wasm we know the addresses of each global? I may not fully understand all of this though, sorry.
This has proved more complicated than intiially thought since emscripten doesn't have any way to export C global to JS (there is no EXPORTED_GLOBALS anymore). So we could need to great an accessor function for each global on the C side and then export that. I suppose this is why it was dont this way in the first place.
I'm confused, I thought that the "fake globals" were for this purpose, so that JS could get the address of data symbols and read/write to those addresses from JS. Sorry for late reply. I agree we definitely need a way to allow for that, accessing globals from JS is a much more basic use case than allocating them!
Well from what I could tell emscripten doesn't really support this feature. I will investigate what alon mentioned about the the namedGlobals
used in dynamic linking. But from my investigations so it looks like the approach of changing emscripten to match the requirements of lld would require some significant emscripten changes that I'm not very confident about working on myself.
I can help with the emscripten side of accessing named globals from wasm, if you want.
Do you mean giving JS access the named globals that are defined by wasm? That would be great. Is that a feature that would be useful in general to emscripten users? Right now it looks like namedGlobals is only used in the side module case (unless I'm missing something).
Right now s2wasm doesn't add the namedGlobals the METADATA but that would be pretty easy to fix.
I think it would be useful in general, yeah - better interop between JS and compiled code is good and I expect we'll see more of this with lld etc. It's also nice for debugging actually.
We could do something like set Module.namedGlobals
to a JS map between names and addresses. For dynamic linking that could be done dynamically I guess, but for static linking it could be just emitted in the JS based on the info from lld I think.
My main concern is not emitting a huge map with tons of stuff in every build. Perhaps it needs to only be stuff that JS libraries actually need, for example?
Yes, I imagine we could be doing somthing like EXPORTED_FUNCTIONS to choose which globals to export. If fact it looks like there used to be something called EXPORTED_GLOBALS in emscripten that got removed.
In fact, perhaps you could really help me out and try to move the time global into the C side? e.g. have the library.js import the tzname global from C/C++ rather than declaring it. This global is needed on both JS and native sides. It would be good example because it would involve JS library code that depends on a native global so meta DCE might even come into play.
Is that something you might be able to take a look at?
Sure, happy to help with that. The simplest thing is probably to
- define
tztime
in C - define a
get_tztime()
function in C - export it to JS when necessary
- call that from JS to get the address
Later, as an optimization, we could remove that function once we have a plan for directly accessing C globals from JS. Thoughts?