syslog-ng icon indicating copy to clipboard operation
syslog-ng copied to clipboard

Race condition between AH_CONFIG_CHANGED and threaded sources

Open MrAnno opened this issue 3 years ago • 6 comments

syslog-ng

Version of syslog-ng

current master (191c4fb5617c84fe420284fc993ba6b2ad907644)

Platform

Linux

The threaded source and threaded fetcher unit tests started to crash occasionally. I can't reproduce the issue in my environment, but this occurs in different GitHub Actions and GitLab CI builds as well. For example: https://github.com/syslog-ng/syslog-ng/runs/5236319117

Fortunately, the core file was available:

Thread 3 (Thread 0x7fc6e3dd4640 (LWP 9661)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fc6e597475c in g_mutex_lock_slowpath (mutex=0x7fc6e5b34e30 <stats_mutex>) at ../../../glib/gthread-posix.c:1493
#2  0x00007fc6e5aa9edb in stats_lock () at ../lib/stats/stats-registry.c:58
#3  0x00007fc6e5abdbf1 in log_tags_get_by_name (name=0x4086d6 "citrom") at ../lib/logmsg/tags.c:91
#4  0x00007fc6e5ab3c4a in log_msg_set_tag_by_name (self=0x7fc6d4004350, name=0x4086d6 "citrom") at ../lib/logmsg/logmsg.c:899
#5  0x0000000000404af6 in create_empty_message () at ../libtest/cr_template.c:89
#6  0x0000000000404b96 in create_sample_message () at ../libtest/cr_template.c:107
#7  0x0000000000404037 in _run_simple (s=0x19cb4d0) at ../lib/logthrsource/tests/test_logthrsourcedrv.c:181
#8  0x00007fc6e5ac66a4 in log_threaded_source_worker_run (s=0x19cc940) at ../lib/logthrsource/logthrsourcedrv.c:169
#9  0x00007fc6e5a5f6bc in _worker_thread_func (st=0x19cc940) at ../lib/mainloop-threaded-worker.c:79
#10 0x00007fc6e594bf5d in g_thread_proxy (data=0x19ae760) at ../../../glib/gthread.c:827
#11 0x00007fc6e56edd80 in start_thread (arg=0x7fc6e3dd4640) at pthread_create.c:481
#12 0x00007fc6e5802bdf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fc6e4de8640 (LWP 9659)):
#0  0x00007fc6e5802f06 in epoll_wait (epfd=5, events=0x7fc6e5ba9790 <self+144>, maxevents=32, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007fc6e5b66470 in nn_poller_wait () from /usr/lib/libcriterion.so.3
#2  0x00007fc6e5b5b4d4 in nn_worker_routine () from /usr/lib/libcriterion.so.3
#3  0x00007fc6e5b5dd36 in nn_thread_main_routine () from /usr/lib/libcriterion.so.3
#4  0x00007fc6e56edd80 in start_thread (arg=0x7fc6e4de8640) at pthread_create.c:481
#5  0x00007fc6e5802bdf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fc6e4e0ba80 (LWP 9658)):
#0  g_str_hash (v=0x21) at ../../../glib/ghash.c:2333
#1  0x00007fc6e5aa91a1 in stats_cluster_hash (self=0x7ffc6c2df0e0) at ../lib/stats/stats-cluster.c:210
#2  0x00007fc6e590e976 in g_hash_table_lookup_node (hash_return=<synthetic pointer>, key=0x7ffc6c2df0e0, hash_table=0x19ae180 = {...}) at ../../../glib/ghash.c:472
#3  g_hash_table_lookup (hash_table=0x19ae180 = {...}, key=0x7ffc6c2df0e0) at ../../../glib/ghash.c:1511
#4  0x00007fc6e5aaa5f5 in stats_unregister_counter (sc_key=0x7ffc6c2df0e0, type=1, counter=0x19c5eb0) at ../lib/stats/stats-registry.c:304
#5  0x00007fc6e5abdeca in log_tags_reinit_stats () at ../lib/logmsg/tags.c:184
#6  0x00007fc6e5a3daf1 in run_application_hook (type=7) at ../lib/apphook.c:124
#7  0x00007fc6e5a3de0b in app_config_changed () at ../lib/apphook.c:260
#8  0x0000000000403e21 in start_test_threaded_source (s=0x19cb4d0) at ../lib/logthrsource/tests/test_logthrsourcedrv.c:131
#9  0x00000000004043fd in logthrsourcedrv_test_threaded_source_suspend_impl () at ../lib/logthrsource/tests/test_logthrsourcedrv.c:231
#10 0x00007fc6e5b41ec4 in criterion_internal_test_main () from /usr/lib/libcriterion.so.3
#11 0x00000000004043aa in logthrsourcedrv_test_threaded_source_suspend_jmp () at ../lib/logthrsource/tests/test_logthrsourcedrv.c:223
#12 0x00007fc6e5b4039b in run_test_child () from /usr/lib/libcriterion.so.3
#13 0x00007fc6e5b8d925 in bxfi_main () from /usr/lib/libcriterion.so.3
#14 0x00007fc6e572d7ed in __libc_start_main (main=0x7fc6e5b4d863 <main>, argc=1, argv=0x7ffc6c2df688, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc6c2df678) at ../csu/libc-start.c:332
#15 0x00000000004034ea in _start ()

MrAnno avatar Feb 17 '22 21:02 MrAnno

I think the solution here is not to add more locks, but to run tags_reinit() at the right moment.

With dedicated worker threads now common, the app_config_changed hook is not right for reinitializing global state that is based on the configuration.

I just started to rework on_config_inited (by renaming it to post_config_init) and introduced a hook that runs after parsing but before any other init call (pre_config_init).

Threads are started in post_config_init() so log_tags_reinit() should run before that but only after the new config initialization was successful.

If we had a hook right before post_config_init() that could host these kind of reinit functions.

bazsi avatar Feb 18 '22 06:02 bazsi

The app_config_changed() call was unnecessary in these tests (we just forgot to remove them when we started using on_config_inited to spawn threads), so I removed the wrong calls, which fixed the issue. I added the missing locks only to be consistent with how we access those data structures (init ()and deinit() already locks them too).

MrAnno avatar Feb 18 '22 07:02 MrAnno

Don't we have the threads running at APP_CONFIG_CHANGED? Are the locks enough in production? There are other similar init functions.

On Fri, Feb 18, 2022, 08:12 László Várady @.***> wrote:

The app_config_changed() call was unnecessary in these tests (we just forgot to remove them when we started using on_config_inited to spawn threads), so I removed the wrong calls, which fixed the issue. I added the missing locks only to be consistent with how we access those data structures (init ()and deinit() already locks them too).

— Reply to this email directly, view it on GitHub https://github.com/syslog-ng/syslog-ng/issues/3920#issuecomment-1044041755, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFOK5TQDJU2A4MBIWKHPQ3U3XWPRANCNFSM5OV7LDZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

bazsi avatar Feb 18 '22 07:02 bazsi

I think currently there are no threads running at the end of the reload process (when app-config-changed is triggered), but I will doublecheck.

MrAnno avatar Feb 18 '22 07:02 MrAnno

As I see they do. APP_CONFIG_CHANGED is called at the very end of the reload process, e.g. on_config_inited() is called by this time, this means that LogThreadedSourceDriver already started its thread and there's nothing that would prevent this from posting messages to the pipeline. So I think there's a race with all reinit-style functions which changes global state based on the config.

Most of these are called by cfg_init() and not through the APP_CONFIG_CHANGED hook, like this:

gboolean
cfg_init(GlobalConfig *cfg)
{
  ...
  if (!rcptid_init(cfg->state, cfg->use_uniqid))
    return FALSE;

  stats_reinit(&cfg->stats_options);

  dns_caching_update_options(&cfg->dns_cache_options);
  hostname_reinit(cfg->custom_domain);
  host_resolve_options_init_globals(&cfg->host_resolve_options);
  ...
}

So at the moment only these are affected:

bazsi@bzorp:~/src/syslog-ng/syslog-ng$ git grep AH_CONFIG_CHANGED
lib/afinter.c:  register_application_hook(AH_CONFIG_CHANGED, afinter_register_posted_hook, NULL, AHM_RUN_ONCE);
lib/apphook.c:  run_application_hook(AH_CONFIG_CHANGED);
lib/apphook.h:  AH_CONFIG_CHANGED,   /* configuration changed, threads are running again */
lib/logmsg/tags.c:  register_application_hook(AH_CONFIG_CHANGED, (ApplicationHookFunc) log_tags_reinit_stats, NULL, AHM_RUN_REPEAT);
lib/mainloop-control.c:  register_application_hook(AH_CONFIG_CHANGED, _respond_config_reload_status, args, AHM_RUN_ONCE);
lib/msg-stats.c:  register_application_hook(AH_CONFIG_CHANGED, (ApplicationHookFunc) stats_syslog_reinit, NULL, AHM_RUN_REPEAT);
lib/tests/test_apphook.c:  register_application_hook(AH_CONFIG_CHANGED, _hook_counter, (gpointer)&triggered_count, AHM_RUN_ONCE);
lib/tests/test_apphook.c:  register_application_hook(AH_CONFIG_CHANGED, _hook_counter, (gpointer)&triggered_count, AHM_RUN_ONCE);

I think all the static calls in cfg_init() should be migrated to use the apphook, and maybe the apphook should be called before on_config_inited(), so threads are not yet running.

OR we might simply introduce further hooks (pre_config_init, post_config_init, config_changed) and register every one of the above to right hook.

bazsi avatar Feb 18 '22 08:02 bazsi

Yep, I was completely wrong. I'm marking this issue as a bug, updating the description as well.

MrAnno avatar Feb 19 '22 15:02 MrAnno