dmtcp Regression 3d573aa: vDSO handler hash

Commit 3d573aacc introduced a regression that programs linking against vDSO crashes with segment fault. The previous commit 2562534de9ab4cb472929 does not have these bugs.

The test environment has glibc-2.23 (Gentoo Prefix), linux-2.6.32 (CentOS 6.5). The bug cannot be reproduced on newer systems, such as a standalone Gentoo with glibc-2.24 and linux-4.7.0.

core dump for python:

$ dmtcp_launch python
[1]    30978 segmentation fault (core dumped)  dmtcp_launch python
$ gdb python core-python.30978 
Reading symbols from python...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 30978]
[New LWP 30983]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/fefs/disk/usr100/gentoo/lib64/libthread_db.so.1".
Core was generated by `python'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fcbaf1893b4 in hash_first (use_gnu_hash=<optimized out>, hash_table=<optimized out>, name=0x7fcbaf1bd721 " &rlim) == 0) failed'\n") at dmtcp_dlsym.cpp:149
149         return bucket[elf_hash(name) % nbucket]; // return index into symbol table
[Current thread is 1 (Thread 0x7fcbb0058740 (LWP 30978))]
(gdb) bt
#0  0x00007fcbaf1893b4 in hash_first (use_gnu_hash=<optimized out>, hash_table=<optimized out>, name=0x7fcbaf1bd721 " &rlim) == 0) failed'\n") at dmtcp_dlsym.cpp:149
#1  dlsym_default_internal_library_handler (handle=<optimized out>, symbol=0x7fcbaf1bd721 " &rlim) == 0) failed'\n", version=0x0, tags_p=0x7fff697e60c0, default_symbol_index_p=0x7fff697e60bc)
    at dmtcp_dlsym.cpp:326
#2  0x00007fcbaf189954 in dlsym_default_internal_flag_handler (handle=0x0, libname=0x7fcbaf1bd736 "\n", symbol=0x7fcbaf1bd721 " &rlim) == 0) failed'\n", version=0x0, addr=<optimized out>, tags_p=0x7fff697e60c0, 
    default_symbol_index_p=0x7fff697e60bc) at dmtcp_dlsym.cpp:397
#3  0x00007fcbaf189b9b in dmtcp::DmtcpMessage::DmtcpMessage (this=0x7fff697e63b0, t=<optimized out>) at dmtcpmessagetypes.cpp:32
#4  0x0000000000000000 in ?? ()

core dump for R (started via R's builtin shell wrapper)

$ dmtcp_launch R             
WARNING: ignoring environment value of R_HOME
[1]    32604 segmentation fault (core dumped)  bin/dmtcp_launch R

$ gdb sh core-R.32604    
Reading symbols from sh...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 32604]
[New LWP 32605]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/fefs/disk/usr100/gentoo/lib64/libthread_db.so.1".
Core was generated by `/fefs/disk/usr100/gentoo/bin/sh /disk/usr100/gentoo/usr/bin/R'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  hash_first (use_gnu_hash=0, hash_table=0xffffffffff700124, name=0x7fb286486721 "__vdso_clock_gettime") at dmtcp_dlsym.cpp:145
145         Elf32_Word nbucket = *hash_table++;
[Current thread is 1 (Thread 0x7fb287321740 (LWP 32604))]
(gdb) bt
#0  hash_first (use_gnu_hash=0, hash_table=0xffffffffff700124, name=0x7fb286486721 "__vdso_clock_gettime") at dmtcp_dlsym.cpp:145
#1  dlsym_default_internal_library_handler (handle=handle@entry=0x7fb28735d6c0, symbol=symbol@entry=0x7fb286486721 "__vdso_clock_gettime", version=version@entry=0x0, tags_p=tags_p@entry=0x7fff965c69b0, 
    default_symbol_index_p=default_symbol_index_p@entry=0x7fff965c69ac) at dmtcp_dlsym.cpp:326
#2  0x00007fb286452954 in dlsym_default_internal_flag_handler (handle=handle@entry=0x0, libname=libname@entry=0x7fb286486736 "linux-vdso", symbol=symbol@entry=0x7fb286486721 "__vdso_clock_gettime", 
    version=version@entry=0x0, addr=<optimized out>, tags_p=tags_p@entry=0x7fff965c69b0, default_symbol_index_p=0x7fff965c69ac) at dmtcp_dlsym.cpp:427
#3  0x00007fb286452b9b in dmtcp_dlsym_lib_fnc_offset (libname=libname@entry=0x7fb286486736 "linux-vdso", symbol=symbol@entry=0x7fb286486721 "__vdso_clock_gettime") at dmtcp_dlsym.cpp:560
#4  0x00007fb286468251 in dmtcp::ProcessInfo::serialize (this=0x7fb28483c408, o=...) at processinfo.cpp:615
#5  0x00007fb286469f04 in dmtcp_ProcessInfo_EventHook (event=event@entry=DMTCP_EVENT_PRE_EXEC, data=data@entry=0x7fff965c6d50) at processinfo.cpp:67
#6  0x00007fb286430dcb in dmtcp::DmtcpWorker::eventHook (event=DMTCP_EVENT_PRE_EXEC, data=0x7fff965c6d50) at dmtcpworker.cpp:577
#7  0x00007fb28643dea0 in dmtcpPrepareForExec (path=path@entry=0x8ebe00 "/disk/usr100/gentoo/usr/bin/uname", argv=argv@entry=0x8e6b30, filename=filename@entry=0x7fff965c7380, 
    newArgv=newArgv@entry=0x7fff965c7388) at execwrappers.cpp:351
#8  0x00007fb2864407a3 in execve (filename=0x8ebe00 "/disk/usr100/gentoo/usr/bin/uname", argv=0x8e6b30, envp=0x8eb270) at execwrappers.cpp:550
#9  0x00007fb286b1a062 in execve (filename=0x8ebe00 "/disk/usr100/gentoo/usr/bin/uname", argv=0x8e6b30, envp=0x8eb270) at ipc/ssh/ssh.cpp:504
#10 0x000000000041e4a2 in ?? ()
#11 0x000000000041ffd6 in ?? ()
#12 0x0000000000420b92 in ?? ()
#13 0x0000000000466f6f in ?? ()
#14 0x000000000043cc02 in ?? ()
#15 0x000000000044316c in ?? ()
#16 0x000000000044423c in ?? ()
#17 0x000000000044439a in ?? ()
#18 0x000000000043d72e in ?? ()
#19 0x000000000043dbdf in ?? ()
#20 0x00000000004460f5 in ?? ()
#21 0x000000000041eb0f in ?? ()
#22 0x0000000000420b92 in ?? ()
#23 0x000000000042276e in ?? ()
#24 0x000000000042086a in ?? ()
#25 0x000000000042276e in ?? ()
#26 0x000000000042086a in ?? ()
#27 0x000000000042276e in ?? ()
#28 0x00000000004216c0 in ?? ()
#29 0x000000000042276e in ?? ()
#30 0x0000000000420d27 in ?? ()
#31 0x000000000042276e in ?? ()
#32 0x000000000040ac02 in ?? ()
#33 0x00000000004096c7 in ?? ()
#34 0x00007fb285c32720 in __libc_start_main () from /fefs/disk/usr100/gentoo/lib64/libc.so.6
#35 0x0000000000409e69 in ?? ()

Apr 03 '17 06:04 heroxbd

I believe the issue was fixed by the later commits: 05450f32 and eb31cfce. Are you able to reproduce this with the latest 2.5 branch?

Apr 03 '17 13:04 rohgarg

Yes, it can be reproduced by c611a052b6ed3ade2b. I have run git bisect between c611a052b6 (bad) and tag 2.5.0 (good).

Apr 04 '17 01:04 heroxbd

Thanks, I suspect I know what's going on. I don't think we'll learn anything new from running a git-bisect. As you noted, the bug was introduced in commit 3d573aa (after the 2.5.0 release). I'll try to reproduce this locally and provide a fix.

Apr 04 '17 02:04 rohgarg

@heroxbd I am trying to follow the instructions described here on a CentOS-6.8 machine. Please let me know if there's anything specific required in order to reproduce this issue.

Apr 04 '17 03:04 rohgarg

@rohgarg Thank you very much for your effort.

If this does not work, I can prepare a docker/lxc image for you.

Apr 04 '17 03:04 heroxbd

@heroxbd Is there a way to short-circuit through the installation process? I left the installation running overnight and it still hadn't finished when I checked in the morning; I had to kill the job in the middle. I'm not sure about the progress it made. The other question I have is: if I restart the process will it continue from where it left off?

Apr 04 '17 12:04 rohgarg

@rohgarg Yes, it will continue from where it left off.

Apr 04 '17 13:04 heroxbd

Okay, I have been able to diagnose the issue now. Hopefully, I'll be able to provide a fix soon.

Apr 04 '17 18:04 rohgarg

So the issue is a little more involved than I had anticipated, and a general fix would require more changes to the exiting design. I think we'd have to defer this to a later release. We can keep this issue open, or open another one specifically to track progress for vdso-related changes. In the short term, for the next release, I propose that we hide the changes under a configure option or a run-time option.

Apr 11 '17 17:04 rohgarg

Thank you for the update. I think your proposal is the best way to move forward. I'd vote for a configure option.

Apr 14 '17 01:04 heroxbd

dmtcp
dmtcp copied to clipboard

Regression 3d573aa: vDSO handler hash_first segment fault

dmtcp dmtcp copied to clipboard

Regression 3d573aa: vDSO handler hash_first segment fault

dmtcp
dmtcp copied to clipboard