machinekit-hal halcmd doesn't stop on hal

Issue by Fyleo Mon May 29 16:58:15 2017 Originally opened as https://github.com/machinekit/machinekit/issues/1213

Hi,

I've got a segfault crash by halcmd that was due to a library (missconfigured xenomai 3) needed by ulapi-xenomai.so.

It was due to the dlopen call failling in ulapi_load(), well treated until hal_comp.c:hal_xinit() that return _halerrno in case sub routine return value is NULL (that was my case). However, halcmd.c:halcmd_startup() is only checking if the return of hal_init is negative (which is not the case) then continue by calling hal_ready(). In hal_ready when the mutex in hal_data is used (which is not initialized due to premature exit). It produce a segfault by accessing to address 0x4.

I don't know was is the best way to fix this.

Aug 03 '18 15:08 ArcEye

Comment by ArcEye Mon May 29 17:30:22 2017

We need code that can reliably reproduce the error to take it any further. It is not clear if you are talking about a standard start-up sequence or your own code. No idea what platform, image version, git version etc etc.

In other words more info please.

The hal_init should return a negative value, so it will be crucial to get the exact sequence that ends up with a test not returning a valid result.

Thanks for the report.

Aug 03 '18 15:08 ArcEye

Comment by Fyleo Mon May 29 17:56:50 2017

Sorry, i miss to tell some information. I'm on a beaglebone black wireless which run on the octavo SOC. As the official 3.8 xenomai kernel seem to not boot on this soc, i'm trying to run machinekit above official 4.4 xenomai kernel (currently running r105)

This kernel as xenomai V3 RT kernel. So i recompile machinekit from git with xenomai v3 library.

I made only minor change in rtapi (see current diff file in attachement). I will publish these change when it's working and i'm satisfied with the code (which i'm not right now).

It is during a standard start-up sequence :

machinekit@beaglebone:~/machinekit-build$ machinekit -l
MACHINEKIT - 0.1
Machine configuration directory is '/home/machinekit/machinekit/configs/sim.axis'
Machine configuration file is 'axis_mm.ini'
Starting Machinekit...
/home/machinekit/machinekit-build/scripts/realtime: line 174: 30677 Segmentation fault      halcmd ping
io started
/home/machinekit/machinekit-build/scripts/linuxcnc: line 726: 30683 Segmentation fault      $HALCMD loadusr -Wn iocontrol $EMCIO -ini "$INIFILE"
halcmd loadusr io started
/home/machinekit/machinekit-build/scripts/linuxcnc: line 737: 30686 Segmentation fault      $HALCMD loadusr -Wn halui $HALUI -ini "$INIFILE"
/home/machinekit/machinekit-build/scripts/linuxcnc: line 753: 30693 Segmentation fault      $HALCMD -i "$INIFILE" -f $CFGFILE
Shutting down and cleaning up Machinekit...
/home/machinekit/machinekit-build/scripts/linuxcnc: line 530: 30715 Segmentation fault      $HALCMD stop
/home/machinekit/machinekit-build/scripts/linuxcnc: line 530: 30718 Segmentation fault      $HALCMD unload all
/home/machinekit/machinekit-build/scripts/realtime: line 260: 30804 Segmentation fault      halcmd shutdown
Cleanup done
Machinekit terminated with an error.  You can find more information in the log:
    /home/machinekit/linuxcnc_debug.txt
and
    /home/machinekit/linuxcnc_print.txt
as well as in the output of the shell command 'dmesg' and in the terminal

The stack trace of halcmd when segfault is :

[Program received signal SIGSEGV, Segmentation fault.
0xb6fa062a in rtapi_test_and_set_bit (nr=0, bitmap=0x4)
    at rtapi/rtapi_bitops.h:81
81	    return (__atomic_fetch_or(bitmap + RTAPI_BIT_WORD(nr),
(gdb) i s
#0  0xb6fa062a in rtapi_test_and_set_bit (nr=0, bitmap=0x4)
    at rtapi/rtapi_bitops.h:81
#1  0xb6fa0706 in rtapi_mutex_get (mutex=0x4) at rtapi/rtapi.h:553
#2  0xb6fa14e0 in halg_ready (use_hal_mutex=1, comp_id=38)
    at hal/lib/hal_comp.c:351
#3  0x00013688 in hal_ready (comp_id=38) at hal/lib/hal.h:413
#4  0x000138fe in halcmd_startup (quiet=0, uri=0x0, 
    svc_uuid=0x3f890 "a42c8c6b-4025-4f83-ba28-dad21114744a")
    at hal/utils/halcmd.c:168
#5  0x0002050c in main (argc=2, argv=0xbefff074) at hal/utils/halcmd_main.c:278](url)

ulapi_load (ulapi_switch=0x3ca44 <rtapi_switch>) at rtapi/ulapi_autoload.c:169
169	    char *instance = getenv("MK_INSTANCE");
170	    char *debug_env = getenv("ULAPI_DEBUG");
171	    int size = 0;
175	    if (instance != NULL)
178	    if (debug_env)
181	    rtapi_set_msg_level(ulapi_debug);
184	    rtapi_set_logtag("ulapi");
198	    shm_common_init();
200	    globalkey = OS_KEY(GLOBAL_KEY, rtapi_instance);
201	    retval = shm_common_new(globalkey, &size,
204	    if (retval == -ENOENT) {
213	    if (retval < 0) {
221	    if (size < sizeof(global_data_t)) {
229	    if (global_data->magic != GLOBAL_READY) {
241	    ringbuffer_init(shm_ptr(global_data, global_data->rtapi_messages_ptr),
246	    global_heap = &global_data->heap;
249	    flavor = flavor_byid(global_data->rtapi_thread_flavor);
250	    if (flavor == NULL) {
257	    snprintf(ulapi_lib_fname,PATH_MAX,"%s/%s-%s%s",
258		     EMC2_RTLIB_DIR, ulapi_lib, flavor->name, flavor->so_ext);
257	    snprintf(ulapi_lib_fname,PATH_MAX,"%s/%s-%s%s",
258		     EMC2_RTLIB_DIR, ulapi_lib, flavor->name, flavor->so_ext);
257	    snprintf(ulapi_lib_fname,PATH_MAX,"%s/%s-%s%s",
261	    if ((ulapi_so = dlopen(ulapi_lib_fname, RTLD_GLOBAL|RTLD_NOW))  == NULL) {
262		errmsg = dlerror();
(gdb) p ulapi_lib_fname 
$5 = "/home/machinekit/machinekit-build/rtlib/ulapi-xenomai.so\000\067-A\000\006\n\aA\b\001\t\002\n\003\f\001\022\004\023\001\024\001\025\001\027\003\030\001\032\002\033\003\034\001\"\001\000\000\000\000\000\000\000\220\345\377\276", '\000' <repeats 40 times>, "\v\000\000\000\a\000\000\000\002\000\000\000t\001\000\000t\001\000\000$", '\000' <repeats 11 times>, "\004\000\000\000\000\000\000\000\036\000\000\000\a\000\000\000\002\000\000\000\230\001\000\000"...
(gdb) n
263		rtapi_print_msg(RTAPI_MSG_ERR,
267		return -ENOENT;
344	}
(gdb) p errmsg
$7 = 0x4c9b0 "libalchemy.so.0: shared object cannot be dlopen()ed"

xenomai (3.0.4) deb package was built with default configure flags with --enable-smp added

Tell me if you need more information.

Aug 03 '18 15:08 ArcEye

Comment by Fyleo Mon May 29 18:12:02 2017

Continuation of the trace through end of hal_init

_ulapi_init (modname=0xbeffea3c "HAL_hal_lib31647") at rtapi/ulapi_autoload.c:86
86		return -ENOSYS;
97	}
halg_xinitfv (use_hal_mutex=0, type=4, userarg1=0, userarg2=0, ctor=0x0, dtor=0x0, fmt=0xb6fad564 "hal_lib%ld", ap=...) at hal/lib/hal_comp.c:122
122		if (comp_id < 0) {
123		    HALFAIL_NULL(comp_id, "rtapi init(%s) failed", rtapi_name);
(gdb) p comp_id
$1 = -38
(gdb) n
97		WITH_HAL_MUTEX_IF(use_hal_mutex && (hal_data != NULL));
277	}
halg_xinitf (use_halmutex=0, type=4, userarg1=0, userarg2=0, ctor=0x0, dtor=0x0, fmt=0xb6fad564 "hal_lib%ld") at hal/lib/hal_comp.c:58
58	    return comp;
59	}
halg_xinitfv (use_hal_mutex=1, type=2, userarg1=0, userarg2=0, ctor=0x0, dtor=0x0, fmt=0xb6fad450 "%s", ap=...) at hal/lib/hal_comp.c:108
108		    if (hallib == NULL)
(gdb) p hallib
$2 = (struct hal_comp *) 0x0
(gdb) n
109			return NULL;
97		WITH_HAL_MUTEX_IF(use_hal_mutex && (hal_data != NULL));
277	}
halg_xinitf (use_halmutex=1, type=2, userarg1=0, userarg2=0, ctor=0x0, dtor=0x0, fmt=0xb6fad450 "%s") at hal/lib/hal_comp.c:58
58	    return comp;
59	}
hal_xinit (type=2, userarg1=0, userarg2=0, ctor=0x0, dtor=0x0, name=0x3d918 <comp_name> "halcmd31647") at hal/lib/hal_comp.c:36
36	    return c == NULL ? _halerrno : hh_get_id(&c->hdr);
37	}
hal_init (name=0x3d918 <comp_name> "halcmd31647") at hal/lib/hal.h:334
334	}
halcmd_startup (quiet=0, uri=0x0, svc_uuid=0x3f890 "a42c8c6b-4025-4f83-ba28-dad21114744a") at hal/utils/halcmd.c:157
157	    if (quiet) rtapi_set_msg_level(msg_lvl_save);
159	    hal_flag = 0;
161	    if (comp_id < 0) {
(gdb) p comp_id
$3 = 38

Aug 03 '18 15:08 ArcEye

Comment by ArcEye Tue May 30 11:52:34 2017

Sorry, i miss to tell some information. I'm on a beaglebone black wireless which run on the octavo SOC. As the official 3.8 xenomai kernel seem to not boot on this soc, i'm trying to run machinekit above official 4.4 xenomai kernel (currently running r105) This kernel as xenomai V3 RT kernel. So i recompile machinekit from git with xenomai v3 library. I made only minor change in rtapi

It is a rather specialised corner case, which is some relief. I have never hit anything quite the same before in various 'off piste' poking around sessions.

I have however previously seen errors where the return value from a failed call seems to be used as a data address for a subsequent call, with predictably fatal consequences.

Will do some poking around in the next few days.

Aug 03 '18 15:08 ArcEye

Comment by ArcEye Wed May 31 14:31:45 2017

@Fyleo

I made only minor change in rtapi (see current diff file in attachement).

Nothing attached, can you link to the diff or insert it please

Aug 03 '18 15:08 ArcEye

Comment by Fyleo Wed May 31 20:21:09 2017

Sorry here is the changeset xenomaiv3.patch.txt

Aug 03 '18 15:08 ArcEye

Comment by ArcEye Mon Jun 5 15:14:33 2017

Whilst the situation that brought about your segfault is unlikely to reoccur often, if at all, it does highlight some changes to function calls, that I have some misgivings about.

The old code more often used to pass a recipient pointer in the function call and the return value was for error checking.

An example of that was the hal_pin_xxx_newf() functions, which pass the address of the pointer to be initialised in the inst_struct and the return value, if not 0, indicated an error.

halx_pin_xxx_newf() returns a pin_pointer directly and only when that is tested as to whether it is NULL, is an error return generated, using _halerrno.

_halerrno is also never cleared, so it is not absolutely guaranteed that the _halerrno code relates to that error, if _halerrno was not set for some strange reason. ( there is only one usage of hal_errorcount(1) in the entire codebase which resets the errors)

If a call goes down several layers and ends up aborting in a NULL return, which gets passed back as a value, the test of if(comp_id < 0) looking for a negative value, will probably fail to detect the error, as in your case.

AFAICT there is no likelihood that the test in halcmd.c: halcmd_startup() will ever encounter a comp_id of 0, which translates as the first entry in the module register. This is a print with debugging fprintf() inserted at the same test point, note there have already been 72 modules registered prior to first call (0 -71).

MACHINEKIT - 0.1
Machine configuration directory is '/usr/src/machinekit/configs/sim/axis'
Machine configuration file is 'axis_mm.ini'
Starting Machinekit...
Component halcmd9017 has comp_id 72
io started
Component halcmd9022 has comp_id 76
halcmd loadusr io started
Component halcmd9026 has comp_id 98
Component halcmd9034 has comp_id 281
Component halcmd9040 has comp_id 289
Component halcmd9045 has comp_id 612
Component halcmd9050 has comp_id 669
Component halcmd9057 has comp_id 684
Component halcmd9064 has comp_id 737
task pid=9067
emcTaskInit: using builtin interpreter
Shutting down and cleaning up Machinekit...
Component halcmd9107 has comp_id 831
Component halcmd9110 has comp_id 835
Component halcmd9115 has comp_id 839
Component halcmd9121 has comp_id 843
Component halcmd9127 has comp_id 847
Component halcmd9133 has comp_id 851
Component halcmd9139 has comp_id 855
Component halcmd9145 has comp_id 859
Component halcmd9151 has comp_id 863
Component halcmd9157 has comp_id 867
Component halcmd9163 has comp_id 871
Component halcmd9169 has comp_id 875
Component halcmd9195 has comp_id 879
Cleanup done

Trying to make absolutely sure of it at source, is complicated by the way that rtapi_init() differs between kernel threads and userland and between kernel flavours.

Simply substituting if(comp_id < 0) with if(comp_id < 1) within halcmd.c : halcmd_startup() would prevent any similar errors from a freak NULL return being interpreted with an int cast as being a return of 0.

That is a possible 'fix' for your situation, leaving the general question of functions which may return a valid pointer or something else indicating an error in combination with _halerrno, moot.

Aug 03 '18 15:08 ArcEye

machinekit-hal
machinekit-hal copied to clipboard

halcmd doesn't stop on hal_init fail

machinekit-hal machinekit-hal copied to clipboard

halcmd doesn't stop on hal_init fail

machinekit-hal
machinekit-hal copied to clipboard