dpvs
dpvs copied to clipboard
看起来是keepalived的bug
Full NAT模式 复现方法:
- 修改keepalived文件,去掉已有的real servers,添加新的real server;
- 调用sre_scripts/reload_keepalived.sh reload配置;
- ipvsadm -ln 查看rs,发现旧的real servers还在;
service keepalived reload
也会这样么? 是每次必现么? @mscbg 帮忙看看
sre_scripts/reload_keepalived.sh 这个是个什么脚本?修改RS是我们线上高频出现的事件,不太容易复现你的问题。当然keepalived有时候reload后dpvs没有处理成功,导致后续都无法成功的问题是存在的。@icymoon
解决方案:修改Keepalived配置文件后,sync,再reload Keepalived 原因:启动dpvs后,马上启动Keepalived,由于dpvs初始化较慢(10-15s),一开始Keepalived的check子进程连接dpvs失败直接退出,Keepalived主进程发现check子进程退出,则立即重启check子进程,有如下log: Jun 29 17:45:00 Keepalived_healthcheckers[169131]: IPVS: Can't initialize ipvs: Connection refused Jun 29 17:45:00 Keepalived[168271]: Healthcheck child process(169131) died: Respawning Jun 29 17:45:00 Keepalived[168271]: Starting Healthcheck child process, pid=169132
由于不断尝试,log打印速度过快,到后面只打印log: Jun 29 17:45:00 Keepalived[168271]: Healthcheck child process(169570) died: Respawning Jun 29 17:45:00 Keepalived[168271]: Healthcheck child process(169571) died: Respawning Jun 29 17:45:00 Keepalived[168271]: Healthcheck child process(169572) died: Respawning
待dpvs初始化完成,Keepalived的check子进程启动并配置成功,此时仍有大量log被抑制,如下log: Jun 29 17:45:28 journal: Suppressed 42826 messages from /user.slice/user-1001.slice Jun 29 17:45:30 journal: Suppressed 20345 messages from /user.slice/user-1001.slice
此时delete + add real servers就会出现此bug,因为修改的文件没被及时刷新到磁盘中(下面log中可见配置文件大小不对,应该是17647 Bytes),check子进程打开的文件有异常,导致直接退出,随后Keepalived主进程重启check子进程,走start逻辑,导致旧的配置RS配置未被删除,有如下log: Jun 29 17:46:46 Keepalived_healthcheckers[191507]: Got SIGHUP, reloading checker configuration Jun 29 17:46:46 journal: Suppressed 20344 messages from /user.slice/user-1001.slice Jun 29 17:46:46 Keepalived[168271]: Healthcheck child process(191507) died: Respawning Jun 29 17:46:46 Keepalived[168271]: Starting Healthcheck child process, pid=192557 Jun 29 17:46:46 Keepalived_healthcheckers[192557]: Registering Kernel netlink reflector Jun 29 17:46:46 Keepalived_healthcheckers[192557]: Registering Kernel netlink command channel Jun 29 17:46:46 Keepalived_healthcheckers[192557]: Opening file '/etc/keepalived/keepalived.conf'. Jun 29 17:46:46 Keepalived_healthcheckers[192557]: Configuration is using : 18831 Bytes
启动dpvs后,马上启动Keepalived,由于dpvs初始化较慢(10-15s),一开始Keepalived的check子进程连接dpvs失败直接退出,Keepalived主进程发现check子进程退出,则立即重启check子进程--此处是否考虑定时器延时重启check子进程或添加重试次数?之前发生过在check中添加了几行代码,导致产生coredump,由于主进程不断立即重启check子进程,codedump文件打爆了硬盘
也许可以尝试修改keepalive代码,确保dpvs启动并状态为ready后再继续初始化。。dpvs可能需要使用某种机制对外通告ready。比如写某个类似pid文件的方法。 或者让keepalived主动探测dpvs是否ready,比如尝试下某个get命令。
恩,当前的实现是在check子进程中使用DPVS_SO_GET_INFO来判断dpvs是否ready;未ready则主动退出,产生SIGCHLD;Keepalived主进程处理SIGCHLD信号--立即重启check.可以考虑在Keepalived主进程中做dpvs是否ready的判断
看起来在主进程里做比较好,dpvs没起来就不用走下去了。