aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

redis is not that stable and quit from SIGTERM

Open Jeffwan opened this issue 10 months ago • 6 comments

🐛 Describe the bug

Image

ubuntu@158-101-17-114:~$ kubectl logs -f aibrix-redis-master-84769768cb-j5rfb -p -n aibrix-system
1:C 16 Feb 2025 18:46:20.187 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 16 Feb 2025 18:46:20.187 * Redis version=7.4.2, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 16 Feb 2025 18:46:20.187 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
1:M 16 Feb 2025 18:46:20.187 * monotonic clock: POSIX clock_gettime
1:M 16 Feb 2025 18:46:20.189 * Running mode=standalone, port=6379.
1:M 16 Feb 2025 18:46:20.189 * Server initialized
1:M 16 Feb 2025 18:46:20.189 * Ready to accept connections tcp
1:signal-handler (1739731666) Received SIGTERM scheduling shutdown...
1:M 16 Feb 2025 18:47:46.562 * User requested shutdown...
1:M 16 Feb 2025 18:47:46.562 * Saving the final RDB snapshot before exiting.
1:M 16 Feb 2025 18:47:46.564 * DB saved on disk
1:M 16 Feb 2025 18:47:46.564 # Redis is now ready to exit, bye bye...

Steps to Reproduce

deploy on lambda cloud.

Expected behavior

should be very stable. I've never seen such issue

Environment

nightly

Jeffwan avatar Feb 16 '25 18:02 Jeffwan

same here. it only happens on lambda instance + nvkind

Jeffwan avatar Feb 18 '25 06:02 Jeffwan

Image

the problem still exist.

Jeffwan avatar Apr 29 '25 05:04 Jeffwan

Actually most of the containers crashed.

metadata-service Image Image

gpu-optimizer

Image Image

gateway-plugin

Image Image

redis-master

Image Image

controller-manager

Image Image

Jeffwan avatar Apr 29 '25 05:04 Jeffwan

three categories

  • solid softwares like redis/controller/gateway-plugin, exitCode is 0. they all have error handling
  • our own written compinents, like gpu-optimizer, metadata service shows other error codes.
  • kuberay pod is not affected which is weird.

We are pretty sure it's due to kind setup

Jeffwan avatar Apr 29 '25 05:04 Jeffwan

Image Image

looks like worker node has enough resoures

Jeffwan avatar Apr 29 '25 05:04 Jeffwan

Image

I can not easily figure this out. Kind of hard to debug the kind problem here.

Jeffwan avatar Apr 29 '25 06:04 Jeffwan