syslog-ng
syslog-ng copied to clipboard
Race condition between AH_CONFIG_CHANGED and threaded sources
syslog-ng
Version of syslog-ng
current master (191c4fb5617c84fe420284fc993ba6b2ad907644)
Platform
Linux
The threaded source and threaded fetcher unit tests started to crash occasionally. I can't reproduce the issue in my environment, but this occurs in different GitHub Actions and GitLab CI builds as well. For example: https://github.com/syslog-ng/syslog-ng/runs/5236319117
Fortunately, the core file was available:
Thread 3 (Thread 0x7fc6e3dd4640 (LWP 9661)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007fc6e597475c in g_mutex_lock_slowpath (mutex=0x7fc6e5b34e30 <stats_mutex>) at ../../../glib/gthread-posix.c:1493
#2 0x00007fc6e5aa9edb in stats_lock () at ../lib/stats/stats-registry.c:58
#3 0x00007fc6e5abdbf1 in log_tags_get_by_name (name=0x4086d6 "citrom") at ../lib/logmsg/tags.c:91
#4 0x00007fc6e5ab3c4a in log_msg_set_tag_by_name (self=0x7fc6d4004350, name=0x4086d6 "citrom") at ../lib/logmsg/logmsg.c:899
#5 0x0000000000404af6 in create_empty_message () at ../libtest/cr_template.c:89
#6 0x0000000000404b96 in create_sample_message () at ../libtest/cr_template.c:107
#7 0x0000000000404037 in _run_simple (s=0x19cb4d0) at ../lib/logthrsource/tests/test_logthrsourcedrv.c:181
#8 0x00007fc6e5ac66a4 in log_threaded_source_worker_run (s=0x19cc940) at ../lib/logthrsource/logthrsourcedrv.c:169
#9 0x00007fc6e5a5f6bc in _worker_thread_func (st=0x19cc940) at ../lib/mainloop-threaded-worker.c:79
#10 0x00007fc6e594bf5d in g_thread_proxy (data=0x19ae760) at ../../../glib/gthread.c:827
#11 0x00007fc6e56edd80 in start_thread (arg=0x7fc6e3dd4640) at pthread_create.c:481
#12 0x00007fc6e5802bdf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 2 (Thread 0x7fc6e4de8640 (LWP 9659)):
#0 0x00007fc6e5802f06 in epoll_wait (epfd=5, events=0x7fc6e5ba9790 <self+144>, maxevents=32, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007fc6e5b66470 in nn_poller_wait () from /usr/lib/libcriterion.so.3
#2 0x00007fc6e5b5b4d4 in nn_worker_routine () from /usr/lib/libcriterion.so.3
#3 0x00007fc6e5b5dd36 in nn_thread_main_routine () from /usr/lib/libcriterion.so.3
#4 0x00007fc6e56edd80 in start_thread (arg=0x7fc6e4de8640) at pthread_create.c:481
#5 0x00007fc6e5802bdf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 0x7fc6e4e0ba80 (LWP 9658)):
#0 g_str_hash (v=0x21) at ../../../glib/ghash.c:2333
#1 0x00007fc6e5aa91a1 in stats_cluster_hash (self=0x7ffc6c2df0e0) at ../lib/stats/stats-cluster.c:210
#2 0x00007fc6e590e976 in g_hash_table_lookup_node (hash_return=<synthetic pointer>, key=0x7ffc6c2df0e0, hash_table=0x19ae180 = {...}) at ../../../glib/ghash.c:472
#3 g_hash_table_lookup (hash_table=0x19ae180 = {...}, key=0x7ffc6c2df0e0) at ../../../glib/ghash.c:1511
#4 0x00007fc6e5aaa5f5 in stats_unregister_counter (sc_key=0x7ffc6c2df0e0, type=1, counter=0x19c5eb0) at ../lib/stats/stats-registry.c:304
#5 0x00007fc6e5abdeca in log_tags_reinit_stats () at ../lib/logmsg/tags.c:184
#6 0x00007fc6e5a3daf1 in run_application_hook (type=7) at ../lib/apphook.c:124
#7 0x00007fc6e5a3de0b in app_config_changed () at ../lib/apphook.c:260
#8 0x0000000000403e21 in start_test_threaded_source (s=0x19cb4d0) at ../lib/logthrsource/tests/test_logthrsourcedrv.c:131
#9 0x00000000004043fd in logthrsourcedrv_test_threaded_source_suspend_impl () at ../lib/logthrsource/tests/test_logthrsourcedrv.c:231
#10 0x00007fc6e5b41ec4 in criterion_internal_test_main () from /usr/lib/libcriterion.so.3
#11 0x00000000004043aa in logthrsourcedrv_test_threaded_source_suspend_jmp () at ../lib/logthrsource/tests/test_logthrsourcedrv.c:223
#12 0x00007fc6e5b4039b in run_test_child () from /usr/lib/libcriterion.so.3
#13 0x00007fc6e5b8d925 in bxfi_main () from /usr/lib/libcriterion.so.3
#14 0x00007fc6e572d7ed in __libc_start_main (main=0x7fc6e5b4d863 <main>, argc=1, argv=0x7ffc6c2df688, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc6c2df678) at ../csu/libc-start.c:332
#15 0x00000000004034ea in _start ()
I think the solution here is not to add more locks, but to run tags_reinit() at the right moment.
With dedicated worker threads now common, the app_config_changed hook is not right for reinitializing global state that is based on the configuration.
I just started to rework on_config_inited (by renaming it to post_config_init) and introduced a hook that runs after parsing but before any other init call (pre_config_init).
Threads are started in post_config_init() so log_tags_reinit() should run before that but only after the new config initialization was successful.
If we had a hook right before post_config_init() that could host these kind of reinit functions.
The app_config_changed() call was unnecessary in these tests (we just forgot to remove them when we started using on_config_inited to spawn threads), so I removed the wrong calls, which fixed the issue. I added the missing locks only to be consistent with how we access those data structures (init ()and deinit() already locks them too).
Don't we have the threads running at APP_CONFIG_CHANGED? Are the locks enough in production? There are other similar init functions.
On Fri, Feb 18, 2022, 08:12 László Várady @.***> wrote:
The app_config_changed() call was unnecessary in these tests (we just forgot to remove them when we started using on_config_inited to spawn threads), so I removed the wrong calls, which fixed the issue. I added the missing locks only to be consistent with how we access those data structures (init ()and deinit() already locks them too).
— Reply to this email directly, view it on GitHub https://github.com/syslog-ng/syslog-ng/issues/3920#issuecomment-1044041755, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFOK5TQDJU2A4MBIWKHPQ3U3XWPRANCNFSM5OV7LDZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
I think currently there are no threads running at the end of the reload process (when app-config-changed is triggered), but I will doublecheck.
As I see they do. APP_CONFIG_CHANGED is called at the very end of the reload process, e.g. on_config_inited() is called by this time, this means that LogThreadedSourceDriver already started its thread and there's nothing that would prevent this from posting messages to the pipeline. So I think there's a race with all reinit-style functions which changes global state based on the config.
Most of these are called by cfg_init() and not through the APP_CONFIG_CHANGED hook, like this:
gboolean
cfg_init(GlobalConfig *cfg)
{
...
if (!rcptid_init(cfg->state, cfg->use_uniqid))
return FALSE;
stats_reinit(&cfg->stats_options);
dns_caching_update_options(&cfg->dns_cache_options);
hostname_reinit(cfg->custom_domain);
host_resolve_options_init_globals(&cfg->host_resolve_options);
...
}
So at the moment only these are affected:
bazsi@bzorp:~/src/syslog-ng/syslog-ng$ git grep AH_CONFIG_CHANGED
lib/afinter.c: register_application_hook(AH_CONFIG_CHANGED, afinter_register_posted_hook, NULL, AHM_RUN_ONCE);
lib/apphook.c: run_application_hook(AH_CONFIG_CHANGED);
lib/apphook.h: AH_CONFIG_CHANGED, /* configuration changed, threads are running again */
lib/logmsg/tags.c: register_application_hook(AH_CONFIG_CHANGED, (ApplicationHookFunc) log_tags_reinit_stats, NULL, AHM_RUN_REPEAT);
lib/mainloop-control.c: register_application_hook(AH_CONFIG_CHANGED, _respond_config_reload_status, args, AHM_RUN_ONCE);
lib/msg-stats.c: register_application_hook(AH_CONFIG_CHANGED, (ApplicationHookFunc) stats_syslog_reinit, NULL, AHM_RUN_REPEAT);
lib/tests/test_apphook.c: register_application_hook(AH_CONFIG_CHANGED, _hook_counter, (gpointer)&triggered_count, AHM_RUN_ONCE);
lib/tests/test_apphook.c: register_application_hook(AH_CONFIG_CHANGED, _hook_counter, (gpointer)&triggered_count, AHM_RUN_ONCE);
I think all the static calls in cfg_init() should be migrated to use the apphook, and maybe the apphook should be called before on_config_inited(), so threads are not yet running.
OR we might simply introduce further hooks (pre_config_init, post_config_init, config_changed) and register every one of the above to right hook.
Yep, I was completely wrong. I'm marking this issue as a bug, updating the description as well.