graphene
graphene copied to clipboard
Compiled functions are repeated due to use of `static inline`
Our header files define functions as static inline
. That means than when a function is not inlined by compiler, is treated as static
, i.e. separate for a compilation unit. As a result, if it's used in multiple compilation units, it gets repeated in the binary:
$ nm libsysdb.so | cut -d' ' -f3 | sort | uniq -cd | sort -nr
54
53 get_cur_thread
50 get_cur_tid
49 __lock
45 __unlock
20 create_lock
19 qstrgetstr
17 locked
17 die_or_inf_loop
15 qstrsetstr
15 qstrfree
10 qstrempty
10 clear_lock
8 thread_wakeup
8 shim_get_tcb
8 set_handle_fs
8 destroy_lock
7 size_align_up
7 __set_free_mem_area
7 __ref_inc
7 __ref_dec
7 memory_migrated
7 get_mem_obj_from_mgr_enlarge
7 get_ipc_msg_size
7 create_mem_mgr_in_place
7 create_mem_mgr
6 init_align_up
6 free_mem_obj_to_mgr
5 __shim_tcb_init
5 set_tls
5 LINUX_PROT_TO_PAL
5 is_internal
5 get_thread_handle_map
5 debug_setbuf
4 thread_sleep
4 shim_tcb_init
4 set_cur_thread
4 qstrcopy
4 pal_get_tcb
4 dentry_get_path_into_qstr
4 create_event
3 shim_sigismember
3 set_handle_map
3 set_event
3 __range_not_ok
3 pal_context_set_retval
3 lock_created
3 is_internal_tid
3 get_ipc_msg_with_ack_size
3 enable_locking
3 create_lock_runtime
3 access_ok
2 thread_setwait
2 spinlock_unlock
2 spinlock_lock
2 _spinlock_is_locked
2 spinlock_is_locked
2 shim_sigdelset
2 set_default_tls
2 pal_context_get_sp
2 pal_context_get_retval
2 pal_context_copy
2 LINUX_OPEN_FLAGS_TO_PAL_OPTIONS
2 install_new_event
2 event_handle
2 dentry_get_name
2 debug_spinlock_take_ownership
2 debug_spinlock_giveup_ownership
2 clear_event
2 __bytes2hexstr
That's compiled with DEBUG. Without DEBUG the list is much shorter:
23 lock.isra.0
12 lock.isra.0.constprop.0
4 __ref_dec.part.0
4 die_or_inf_loop
2 install_new_event
(I'm not sure if the first two are due to this issue of a quirk of GCC compilation).
I haven't checked but (at least with DEBUG) these repeated functions might account for a significant part of the LibOS binary (and some of libpal as well).
These functions should be defined as inline
. However, that currently breaks the build during linking. For example, after changing create_event
from static inline
to inline
:
ld: shim_async.o: in function `init_async':
/home/pawel/graphene/LibOS/shim/src/shim_async.c:130: undefined reference to `create_event'
ld: bookkeep/shim_handle.o: in function `rs_handle':
/home/pawel/graphene/LibOS/shim/src/bookkeep/shim_handle.c:760: undefined reference to `create_event'
ld: ipc/shim_ipc_helper.o: in function `init_ipc_helper':
/home/pawel/graphene/LibOS/shim/src/ipc/shim_ipc_helper.c:181: undefined reference to `create_event'
ld: sys/shim_epoll.o: in function `shim_do_epoll_create1':
/home/pawel/graphene/LibOS/shim/src/sys/shim_epoll.c:53: undefined reference to `create_event'
I suspect that the definitions are static inline
because the above did not work, and that's probably because we have a custom linker script.
Proposed solution
Change all instances of static inline
in header files to inline
, and fix the build process so that we can use inline
.
More information
Wikipedia, emphasis mine:
Some recommend an entirely different approach, which is to define functions as
static inline
instead of inline in header files.[2] Then, no unreachable code will be generated. However, this approach has a drawback in the opposite case: Duplicate code will be generated if the function could not be inlined in more than one translation unit. The emitted function code cannot be shared among translation units because it must have different addresses. This is another drawback; taking the address of such a function defined asstatic inline
in a header file will yield different values in different translation units. Therefore,static inline
functions should only be used if they are used in only one translation unit, which means that they should only go to the respective .c file, not to a header file.
Ad. the wiki quote: all of this is just a workaround for us not using Link Time Optimization and having all code in C files. We should be doing this, but AFAIK our hackery in the linker scripts needs first to be removed (and ideally all our linker scripts altogether), but this is (again: AFAIK) blocked on cleaning up IPC/checkpointing, which uses some linker magic.
CC: @boryspoplawski
When compiled with DEBUG, -O2 isn't specified. See Scripts/Makefile.configs. So this isn't surprise. Can you please check DEBUG + -O2?
Agreed with @mkow, but is there a particular problem with the current situation? Is it just annoying for debugging, or something else is blocked by this?
When compiled with DEBUG, -O2 isn't specified. See Scripts/Makefile.configs. So this isn't surprise. Can you please check DEBUG + -O2?
Roughly similar to just -O2:
8 __lock.constprop.3
8 __lock
5 __lock.constprop.4
4 die_or_inf_loop
4 __set_free_mem_area.part.2
4 __lock.constprop.6
3 __lock.constprop.5
2 install_new_event
2 free_mem_obj_to_mgr.isra.5.part.6
2 __unlock.part.1.constprop.5
2 __unlock
2 __ref_dec.part.3
Agreed with @mkow, but is there a particular problem with the current situation? Is it just annoying for debugging, or something else is blocked by this?
Since it turns out the issue is mostly with DEBUG=1
, yes, it's mostly annoying for debugging - the file size is blown up.
As for blocking... well, we were hesitant to enable inlining for a frequently-used function in #2177 because of this (because in debug-mode it would appear in 50+ copies, one for each .c
file it's used in).
Anyway, yes, I agree we should do both: fix the issue with inline
, and enable LTO. Assuming LTO works well with gcc - I don't have much experience with it.
(...) our hackery in the linker scripts needs first to be removed (and ideally all our linker scripts altogether), but this is (again: AFAIK) blocked on cleaning up IPC/checkpointing, which uses some linker magic.
Yes, we need to rework checkpointing first (it uses section sorting, don't ask me why it's done this way).
Regarding this issue: tbh I don't see the problem, we will get rid of this once we have LTO. For now I don't mind having 5 functions repeated couple of times (especially considering they are very small).
I guess it's not too serious, but having fifty-three copies of get_cur_thread
in the binary is painful to look at :)