aibrix redis is not that stable and quit from SIGTERM

🐛 Describe the bug

ubuntu@158-101-17-114:~$ kubectl logs -f aibrix-redis-master-84769768cb-j5rfb -p -n aibrix-system
1:C 16 Feb 2025 18:46:20.187 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 16 Feb 2025 18:46:20.187 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 16 Feb 2025 18:46:20.187 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
1:M 16 Feb 2025 18:46:20.187 * monotonic clock: POSIX clock_gettime
1:M 16 Feb 2025 18:46:20.189 * Running mode=standalone, port=6379.
1:M 16 Feb 2025 18:46:20.189 * Server initialized
1:M 16 Feb 2025 18:46:20.189 * Ready to accept connections tcp
1:signal-handler (1739731666) Received SIGTERM scheduling shutdown...
1:M 16 Feb 2025 18:47:46.562 * User requested shutdown...
1:M 16 Feb 2025 18:47:46.562 * Saving the final RDB snapshot before exiting.
1:M 16 Feb 2025 18:47:46.564 * DB saved on disk
1:M 16 Feb 2025 18:47:46.564 # Redis is now ready to exit, bye bye...

Steps to Reproduce

deploy on lambda cloud.

Expected behavior

should be very stable. I've never seen such issue

Environment

nightly

Feb 16 '25 18:02 Jeffwan

same here. it only happens on lambda instance + nvkind

Feb 18 '25 06:02 Jeffwan

the problem still exist.

Apr 29 '25 05:04 Jeffwan

Actually most of the containers crashed.

metadata-service

gpu-optimizer

gateway-plugin

redis-master

controller-manager

Apr 29 '25 05:04 Jeffwan

three categories

solid softwares like redis/controller/gateway-plugin, exitCode is 0. they all have error handling
our own written compinents, like gpu-optimizer, metadata service shows other error codes.
kuberay pod is not affected which is weird.

We are pretty sure it's due to kind setup

Apr 29 '25 05:04 Jeffwan

looks like worker node has enough resoures

Apr 29 '25 05:04 Jeffwan

I can not easily figure this out. Kind of hard to debug the kind problem here.

Apr 29 '25 06:04 Jeffwan