Add a debug mode configuration to the current k8s deployment stack
Feature request
Is your feature request related to a problem? Please describe. Can we add a debug mode configuration to the current k8s deployment stack? When BE process crashes, it won't fail the healthness/readiness check that triggres deleting the current container and recreating/restarting a new container. In this way, we could log into the container to further troubleshoot, and restart the BE process manually. We found this debug mode is very necessary during today's large PCU crash debugging and would like the debug mode better supported in the future. Describe the solution you'd like One option is that instead of having starrocks_be as the entry point we have starrocks_wrapper as the entry point. starrocks_wrapper is a program that monitors be, restarts it if necessary, and kills itself if it wants pod to restart. Describe alternatives you've considered
Additional context
@yandongxiao
A great suggestion. In Debug mode, first, remove the Liveness Probe, and secondly, ensure that when BE crashes, it doesn't cause the entire Pod to restart. For the second feature, we might need to modify the entrypoint.
The BE entrypoint still needs to handle the signal properly.
Before the operator was launched, we deployed our own BE statefulset with a while loop at the end and didn't set any liveness/readiness probe. When BE process crashed, the pod/container still alive. and we could log into the pod to troubleshoot and restart the BE process. Below is the pseudo code. I think a better way is to use the entrypoint process to trap SIGTERM/SIGKILL and forward to the be process. Here are example that you can follow: https://unix.stackexchange.com/questions/146756/forward-sigterm-to-child-in-bash
./start_be.sh --daemon
while sleep 10; do
...
done
We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!
reactivate the issue
once theCOREDUMP_ENABLED=true is enabled, it automatically enter into debug mode
We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!