illumos-joyent icon indicating copy to clipboard operation
illumos-joyent copied to clipboard

svc.startd hangs forever while ld.so attempts to link shared library in

Open stevenwilliamson opened this issue 7 years ago • 3 comments

We are seeing an issue where starting a service that uses a project (MySQL) at zone boot hangs forever. The svc.startd process is forked and is under the new contract but never gets as far as calling the services start method.

We are hanging in setting the project, for the to be launched process. Looking at a core file of svc.startd then, Were attempting to set the project of the current process before we exec the actual start method.

This hangs in ld.so.1 presumably because we are lazy loading a shared object and setting up bindings.

::stack ld.so.1___lwp_cond_wait+7(fbe501fc, fbe5020c, f8843800, fefd2526, 3, fbe5020c) ld.so.1is_dep_init+0x87(fbe50118, fbe10a20) ld.so.1elf_bndr+0x38c(fbe10a20, 238, fe1e3dd0) ld.so.1elf_rtbndr+0x14(238, fe1e3dd0, f8843994, fe1e4264, fe1e4262, 1000) 0xfbe10a20(82018c8, 82b3008, 0, ffffffff, 0, 0) libproject.so.1setproject+0x27(82018c8, 82b3008, 0, fe8f6183) librestart.so.1restarter_set_method_context+0x4e6(829e008, f8844e2c, f8844e48, 806df19, f8844f34, 0) exec_method+0x89(829ea00, 0, 81c4cb0, 829e008, f8, 1) method_run+0x61e(f8844fac, 0, f8844fa4, 8276080, fef32000, fbf60a40) method_thread+0x16c(8201650, 0, 0, 0) libc.so.1_thrp_setup+0x88(fbf60a40) libc.so.1_lwp_start(fbf60a40, 0, 0, 0, 0, 0) We're setting the mysql project using the mysql user

82018c8/s 0x82018c8: mysql

82b3008/s 0x82b3008: mysql Knowing where we are we can figure out that the function with no symbol is actually setproject_proc.

It looks like this function hasn't been mapped in yet and so the linker is now loading the libproject shared library and attempting to call it's init section. We can see this looking at the Rt_map objects:

{
    rt_public = {
        l_addr = 0xf59d0000
        l_name = 0xfef80389 "/usr/lib/libproject.so.1"
        l_ld = 0xf59d00b4
        l_next = 0
        l_prev = 0xfbf004f0
        l_refname = 0
    }
    rt_pathname = 0xfef80389 "/usr/lib/libproject.so.1"
    rt_padstart = 0xf59d0000
    rt_padimlen = 0x151d4
    rt_msize = 0x151d4
    rt_flags = 0x10420204
    rt_flags1 = 0x482
    rt_tlsmodid = 0
    rt_alias = 0
    rt_fpnode = 0xfbf004a0
    rt_runpath = 0
    rt_runlist = 0
    rt_depends = 0xfbf00f40
    rt_callers = 0xfbf00e80
    rt_handles = 0
    rt_groups = 0
    rt_fct = ufs.82.1.2812`elf_fct
    rt_priv = 0xfbf00b68
    rt_list = ufs.82.1.2812`lml_main
    rt_objfltrndx = 0xffffffff
    rt_symsfltrcnt = 0
    rt_symafltrcnt = 0
    rt_mode = 0x1901
    rt_sortval = 0x2
    rt_cycgroup = 0
    rt_stdev = 0x900040
    rt_stino = 0x3018
    rt_origname = 0xfef80389 "/usr/lib/libproject.so.1"
    rt_dirsz = 0x8
    rt_lmsize = 0x44c
    rt_copy = {
        rtc_r = 0
        rtc_s = 0
    }
    rt_auditors = 0
    rt_audinfo = 0
    rt_syminfo = 0xf59d0258
    rt_initarray = 0
    rt_finiarray = 0
    rt_preinitarray = 0
    rt_mmaps = 0xfbf00920
    rt_mmapcnt = 0x2
    rt_initarraysz = 0
    rt_finiarraysz = 0
    rt_preinitarraysz = 0
  ...snip
                 }
            }
        }
    }
    rt_init_thread = 0xb3
    rt_init = libproject.so.1`_init
    rt_fini = libproject.so.1`_fini
    rt_symintp = ufs.82.1.2812`elf_find_sym
}

The place we are stuck appears to be waiting on a condition variable here: https://github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/sgs/rtld/common/util.c#L636

Im not familiar with the dynamic linking process, im guessing there should be another thread that handles the init section of the to be loaded object and then signals this first thread via the condition variable that it can continue ?

Either way we see this hang approx 1 in ever 10 restarts on platform version 20170817T070613Z. We have not seen it on platform version 20170105T023718Z.

The only commit that I can see that seems to be around this area is this one https://github.com/joyent/illumos-joyent/commit/f7ba5f06c4b2ec24c3ac7cc02f134b3101cc3add but it's not clear if that could cause this behaviour or not.

I have a core file of svc.startd that can be provided if helpful.

stevenwilliamson avatar Sep 21 '17 12:09 stevenwilliamson

From a quick glance at the code, I think the thread that should be doing the initialization is 0xb3, and the function that should be doing the wakeup in that thread looks like it should be call_init(). No idea (yet) why that's not happening. Someone should be sending you a PM on uploading the core to see if we can figure out what's going on.

jasonbking avatar Sep 26 '17 18:09 jasonbking

I have the corefile. Apparently there's only one thread:

nowhere(~)[0]% mdb svc.startd.core
Loading modules: [ svc.startd libumem.so.1 libc.so.1 libuutil.so.1 libnvpair.so.1 libsysevent.so.1 ld.so.1 ]
> $L
lwpid 178 is the only lwp in core of process 28775.
> 

So nobody else can ever cv_signal this blocked thread.

The above dump of Rt_map suggests that thread 179 (0xb3, in rt_init_thread) was supposed to set the FLG_RT_INITDONE, cv_broadcast, and set the init_thread back to 0.

danmcd avatar Sep 28 '17 14:09 danmcd

Indeed, on a previous one i had a poke around with mdb -k to examine the threads (for my own sanity in case something really odd was going off and gcore was misbehaving), and there was only the one thread, stuck waiting on the cv.

stevenwilliamson avatar Sep 28 '17 14:09 stevenwilliamson