machinekit-hal
machinekit-hal copied to clipboard
halcmd doesn't stop on hal_init fail
Issue by Fyleo
Mon May 29 16:58:15 2017
Originally opened as https://github.com/machinekit/machinekit/issues/1213
Hi,
I've got a segfault crash by halcmd that was due to a library (missconfigured xenomai 3) needed by ulapi-xenomai.so.
It was due to the dlopen call failling in ulapi_load(), well treated until hal_comp.c:hal_xinit() that return _halerrno in case sub routine return value is NULL (that was my case). However, halcmd.c:halcmd_startup() is only checking if the return of hal_init is negative (which is not the case) then continue by calling hal_ready(). In hal_ready when the mutex in hal_data is used (which is not initialized due to premature exit). It produce a segfault by accessing to address 0x4.
I don't know was is the best way to fix this.
Comment by ArcEye
Mon May 29 17:30:22 2017
We need code that can reliably reproduce the error to take it any further. It is not clear if you are talking about a standard start-up sequence or your own code. No idea what platform, image version, git version etc etc.
In other words more info please.
The hal_init should return a negative value, so it will be crucial to get the exact sequence that ends up with a test not returning a valid result.
Thanks for the report.
Comment by Fyleo
Mon May 29 17:56:50 2017
Sorry, i miss to tell some information. I'm on a beaglebone black wireless which run on the octavo SOC. As the official 3.8 xenomai kernel seem to not boot on this soc, i'm trying to run machinekit above official 4.4 xenomai kernel (currently running r105)
This kernel as xenomai V3 RT kernel. So i recompile machinekit from git with xenomai v3 library.
I made only minor change in rtapi (see current diff file in attachement). I will publish these change when it's working and i'm satisfied with the code (which i'm not right now).
It is during a standard start-up sequence :
machinekit@beaglebone:~/machinekit-build$ machinekit -l
MACHINEKIT - 0.1
Machine configuration directory is '/home/machinekit/machinekit/configs/sim.axis'
Machine configuration file is 'axis_mm.ini'
Starting Machinekit...
/home/machinekit/machinekit-build/scripts/realtime: line 174: 30677 Segmentation fault halcmd ping
io started
/home/machinekit/machinekit-build/scripts/linuxcnc: line 726: 30683 Segmentation fault $HALCMD loadusr -Wn iocontrol $EMCIO -ini "$INIFILE"
halcmd loadusr io started
/home/machinekit/machinekit-build/scripts/linuxcnc: line 737: 30686 Segmentation fault $HALCMD loadusr -Wn halui $HALUI -ini "$INIFILE"
/home/machinekit/machinekit-build/scripts/linuxcnc: line 753: 30693 Segmentation fault $HALCMD -i "$INIFILE" -f $CFGFILE
Shutting down and cleaning up Machinekit...
/home/machinekit/machinekit-build/scripts/linuxcnc: line 530: 30715 Segmentation fault $HALCMD stop
/home/machinekit/machinekit-build/scripts/linuxcnc: line 530: 30718 Segmentation fault $HALCMD unload all
/home/machinekit/machinekit-build/scripts/realtime: line 260: 30804 Segmentation fault halcmd shutdown
Cleanup done
Machinekit terminated with an error. You can find more information in the log:
/home/machinekit/linuxcnc_debug.txt
and
/home/machinekit/linuxcnc_print.txt
as well as in the output of the shell command 'dmesg' and in the terminal
The stack trace of halcmd when segfault is :
[Program received signal SIGSEGV, Segmentation fault.
0xb6fa062a in rtapi_test_and_set_bit (nr=0, bitmap=0x4)
at rtapi/rtapi_bitops.h:81
81 return (__atomic_fetch_or(bitmap + RTAPI_BIT_WORD(nr),
(gdb) i s
#0 0xb6fa062a in rtapi_test_and_set_bit (nr=0, bitmap=0x4)
at rtapi/rtapi_bitops.h:81
#1 0xb6fa0706 in rtapi_mutex_get (mutex=0x4) at rtapi/rtapi.h:553
#2 0xb6fa14e0 in halg_ready (use_hal_mutex=1, comp_id=38)
at hal/lib/hal_comp.c:351
#3 0x00013688 in hal_ready (comp_id=38) at hal/lib/hal.h:413
#4 0x000138fe in halcmd_startup (quiet=0, uri=0x0,
svc_uuid=0x3f890 "a42c8c6b-4025-4f83-ba28-dad21114744a")
at hal/utils/halcmd.c:168
#5 0x0002050c in main (argc=2, argv=0xbefff074) at hal/utils/halcmd_main.c:278](url)
ulapi_load (ulapi_switch=0x3ca44 <rtapi_switch>) at rtapi/ulapi_autoload.c:169
169 char *instance = getenv("MK_INSTANCE");
170 char *debug_env = getenv("ULAPI_DEBUG");
171 int size = 0;
175 if (instance != NULL)
178 if (debug_env)
181 rtapi_set_msg_level(ulapi_debug);
184 rtapi_set_logtag("ulapi");
198 shm_common_init();
200 globalkey = OS_KEY(GLOBAL_KEY, rtapi_instance);
201 retval = shm_common_new(globalkey, &size,
204 if (retval == -ENOENT) {
213 if (retval < 0) {
221 if (size < sizeof(global_data_t)) {
229 if (global_data->magic != GLOBAL_READY) {
241 ringbuffer_init(shm_ptr(global_data, global_data->rtapi_messages_ptr),
246 global_heap = &global_data->heap;
249 flavor = flavor_byid(global_data->rtapi_thread_flavor);
250 if (flavor == NULL) {
257 snprintf(ulapi_lib_fname,PATH_MAX,"%s/%s-%s%s",
258 EMC2_RTLIB_DIR, ulapi_lib, flavor->name, flavor->so_ext);
257 snprintf(ulapi_lib_fname,PATH_MAX,"%s/%s-%s%s",
258 EMC2_RTLIB_DIR, ulapi_lib, flavor->name, flavor->so_ext);
257 snprintf(ulapi_lib_fname,PATH_MAX,"%s/%s-%s%s",
261 if ((ulapi_so = dlopen(ulapi_lib_fname, RTLD_GLOBAL|RTLD_NOW)) == NULL) {
262 errmsg = dlerror();
(gdb) p ulapi_lib_fname
$5 = "/home/machinekit/machinekit-build/rtlib/ulapi-xenomai.so\000\067-A\000\006\n\aA\b\001\t\002\n\003\f\001\022\004\023\001\024\001\025\001\027\003\030\001\032\002\033\003\034\001\"\001\000\000\000\000\000\000\000\220\345\377\276", '\000' <repeats 40 times>, "\v\000\000\000\a\000\000\000\002\000\000\000t\001\000\000t\001\000\000$", '\000' <repeats 11 times>, "\004\000\000\000\000\000\000\000\036\000\000\000\a\000\000\000\002\000\000\000\230\001\000\000"...
(gdb) n
263 rtapi_print_msg(RTAPI_MSG_ERR,
267 return -ENOENT;
344 }
(gdb) p errmsg
$7 = 0x4c9b0 "libalchemy.so.0: shared object cannot be dlopen()ed"
xenomai (3.0.4) deb package was built with default configure flags with --enable-smp added
Tell me if you need more information.
Comment by Fyleo
Mon May 29 18:12:02 2017
Continuation of the trace through end of hal_init
_ulapi_init (modname=0xbeffea3c "HAL_hal_lib31647") at rtapi/ulapi_autoload.c:86
86 return -ENOSYS;
97 }
halg_xinitfv (use_hal_mutex=0, type=4, userarg1=0, userarg2=0, ctor=0x0, dtor=0x0, fmt=0xb6fad564 "hal_lib%ld", ap=...) at hal/lib/hal_comp.c:122
122 if (comp_id < 0) {
123 HALFAIL_NULL(comp_id, "rtapi init(%s) failed", rtapi_name);
(gdb) p comp_id
$1 = -38
(gdb) n
97 WITH_HAL_MUTEX_IF(use_hal_mutex && (hal_data != NULL));
277 }
halg_xinitf (use_halmutex=0, type=4, userarg1=0, userarg2=0, ctor=0x0, dtor=0x0, fmt=0xb6fad564 "hal_lib%ld") at hal/lib/hal_comp.c:58
58 return comp;
59 }
halg_xinitfv (use_hal_mutex=1, type=2, userarg1=0, userarg2=0, ctor=0x0, dtor=0x0, fmt=0xb6fad450 "%s", ap=...) at hal/lib/hal_comp.c:108
108 if (hallib == NULL)
(gdb) p hallib
$2 = (struct hal_comp *) 0x0
(gdb) n
109 return NULL;
97 WITH_HAL_MUTEX_IF(use_hal_mutex && (hal_data != NULL));
277 }
halg_xinitf (use_halmutex=1, type=2, userarg1=0, userarg2=0, ctor=0x0, dtor=0x0, fmt=0xb6fad450 "%s") at hal/lib/hal_comp.c:58
58 return comp;
59 }
hal_xinit (type=2, userarg1=0, userarg2=0, ctor=0x0, dtor=0x0, name=0x3d918 <comp_name> "halcmd31647") at hal/lib/hal_comp.c:36
36 return c == NULL ? _halerrno : hh_get_id(&c->hdr);
37 }
hal_init (name=0x3d918 <comp_name> "halcmd31647") at hal/lib/hal.h:334
334 }
halcmd_startup (quiet=0, uri=0x0, svc_uuid=0x3f890 "a42c8c6b-4025-4f83-ba28-dad21114744a") at hal/utils/halcmd.c:157
157 if (quiet) rtapi_set_msg_level(msg_lvl_save);
159 hal_flag = 0;
161 if (comp_id < 0) {
(gdb) p comp_id
$3 = 38
Comment by ArcEye
Tue May 30 11:52:34 2017
Sorry, i miss to tell some information. I'm on a beaglebone black wireless which run on the octavo SOC. As the official 3.8 xenomai kernel seem to not boot on this soc, i'm trying to run machinekit above official 4.4 xenomai kernel (currently running r105) This kernel as xenomai V3 RT kernel. So i recompile machinekit from git with xenomai v3 library. I made only minor change in rtapi
It is a rather specialised corner case, which is some relief. I have never hit anything quite the same before in various 'off piste' poking around sessions.
I have however previously seen errors where the return value from a failed call seems to be used as a data address for a subsequent call, with predictably fatal consequences.
Will do some poking around in the next few days.
Comment by ArcEye
Wed May 31 14:31:45 2017
@Fyleo
I made only minor change in rtapi (see current diff file in attachement).
Nothing attached, can you link to the diff or insert it please
Comment by ArcEye
Mon Jun 5 15:14:33 2017
Whilst the situation that brought about your segfault is unlikely to reoccur often, if at all, it does highlight some changes to function calls, that I have some misgivings about.
The old code more often used to pass a recipient pointer in the function call and the return value was for error checking.
An example of that was the hal_pin_xxx_newf() functions, which pass the address of the pointer to be initialised in the inst_struct and the return value, if not 0, indicated an error.
halx_pin_xxx_newf() returns a pin_pointer directly and only when that is tested as to whether it is NULL, is an error return generated, using _halerrno.
_halerrno is also never cleared, so it is not absolutely guaranteed that the _halerrno code relates to that error, if _halerrno was not set for some strange reason.
( there is only one usage of hal_errorcount(1) in the entire codebase which resets the errors)
If a call goes down several layers and ends up aborting in a NULL return, which gets passed back as
a value, the test of if(comp_id < 0) looking for a negative value, will probably fail to detect the error, as in your case.
AFAICT there is no likelihood that the test in halcmd.c: halcmd_startup() will ever encounter a comp_id of 0, which translates as the first entry in the module register. This is a print with debugging fprintf() inserted at the same test point, note there have already been 72 modules registered prior to first call (0 -71).
MACHINEKIT - 0.1
Machine configuration directory is '/usr/src/machinekit/configs/sim/axis'
Machine configuration file is 'axis_mm.ini'
Starting Machinekit...
Component halcmd9017 has comp_id 72
io started
Component halcmd9022 has comp_id 76
halcmd loadusr io started
Component halcmd9026 has comp_id 98
Component halcmd9034 has comp_id 281
Component halcmd9040 has comp_id 289
Component halcmd9045 has comp_id 612
Component halcmd9050 has comp_id 669
Component halcmd9057 has comp_id 684
Component halcmd9064 has comp_id 737
task pid=9067
emcTaskInit: using builtin interpreter
Shutting down and cleaning up Machinekit...
Component halcmd9107 has comp_id 831
Component halcmd9110 has comp_id 835
Component halcmd9115 has comp_id 839
Component halcmd9121 has comp_id 843
Component halcmd9127 has comp_id 847
Component halcmd9133 has comp_id 851
Component halcmd9139 has comp_id 855
Component halcmd9145 has comp_id 859
Component halcmd9151 has comp_id 863
Component halcmd9157 has comp_id 867
Component halcmd9163 has comp_id 871
Component halcmd9169 has comp_id 875
Component halcmd9195 has comp_id 879
Cleanup done
Trying to make absolutely sure of it at source, is complicated by the way that rtapi_init() differs between kernel threads and userland and between kernel flavours.
Simply substituting if(comp_id < 0) with if(comp_id < 1) within halcmd.c : halcmd_startup()
would prevent any similar errors from a freak NULL return being interpreted with an int cast as being a return of 0.
That is a possible 'fix' for your situation, leaving the general question of functions which may return a valid pointer or something else indicating an error in combination with _halerrno, moot.