awx-operator icon indicating copy to clipboard operation
awx-operator copied to clipboard

Add liveness/readiness probes to web/task - fixes #414

Open erz4 opened this issue 2 years ago • 11 comments

SUMMARY

Added liveness & readiness probes to the awx-web container.

fixes #414

ISSUE TYPE
  • New or Enhanced Feature
ADDITIONAL INFORMATION

erz4 avatar Jan 16 '23 12:01 erz4

Hi @shanemcd how can we advance that PR?

erz4 avatar Jan 18 '23 12:01 erz4

We can consider aslo a liveness for awx-task container?

A command like this awx-manage run_dispatcher --running | grep '\[\]' return 0 if awx-task work properly on propagation of message and return 1 if there is some issue on comunication beetween task and postgres (for example when there is some connection interruption).

We still discuss it on matrix with @TheRealHaoLiu to also find some solutions to roolback connection with postgre.

tanganellilore avatar Jan 31 '23 13:01 tanganellilore

@tanganellilore

We can consider aslo a liveness for awx-task container?

A command like this awx-manage run_dispatcher --running | grep '\[\]' return 0 if awx-task work properly on propagation of message and return 1 if there is some issue on comunication beetween task and postgres (for example when there is some connection interruption).

We still discuss it on matrix with @TheRealHaoLiu to also find some solutions to roolback connection with postgre.

added for task, what do you think about the defaults?

erz4 avatar Feb 01 '23 09:02 erz4

I'm not sure about period, because command require some seconds (like 2 or 3) so i think that for the task we can use something like 10/15 seconds. Let me say, when task container not work, everythings behind UI, will not work, and you can see all tasks in pending (or failing). With 10 seconds and 3 consecutive failure means that after 35/40 seconds container will be restared in case of disconnection with db, so for me should be fine. In any case, users can customize these option on operator side.

tanganellilore avatar Feb 01 '23 12:02 tanganellilore

To avoid Molecule destroying the environment run: molecule test --destroy=never

gundalow avatar Feb 14 '23 15:02 gundalow

@erz4 from the community meeting

  • we would like to see the readiness and liveness probe parameter should be nested under a top level parameter and be hidden
  • give ability to disable readiness and liveness probe

TheRealHaoLiu avatar Feb 16 '23 02:02 TheRealHaoLiu

i will help troubleshoot the CI failure

TheRealHaoLiu avatar Feb 16 '23 02:02 TheRealHaoLiu

@erz4 from the community meeting

  • we would like to see the readiness and liveness probe parameter should be nested under a top level parameter and be hidden
  • give ability to disable readiness and liveness probe

@TheRealHaoLiu so every probe should have to parameter in the crd

  1. enable/disable - enable by default
  2. parameters for the probe - with default as we already set

erz4 avatar Feb 16 '23 07:02 erz4

@erz4 Re: nesting the variables, currently it shows like this:

    task_liveness_failure_threshold: 3
    task_liveness_initial_delay: 3
    task_liveness_period: 3
    task_liveness_success_threshold: 1
    task_liveness_timeout: 10
    task_privileged: false
    task_readiness_failure_threshold: 3
    task_readiness_initial_delay: 3
    task_readiness_period: 3
    task_readiness_success_threshold: 1
    task_readiness_timeout: 10
    web_liveness_failure_threshold: 3
    web_liveness_initial_delay: 3
    web_liveness_period: 3
    web_liveness_success_threshold: 1
    web_liveness_timeout: 10
    web_readiness_failure_threshold: 3
    web_readiness_initial_delay: 3
    web_readiness_period: 3
    web_readiness_success_threshold: 1
    web_readiness_timeout: 5

We are hoping to nest these variables to declutter the AWX CR a bit.

    task.liveness.failure_threshold: 3
    task.liveness.initial_delay: 3
    task.liveness.period: 3
    task.liveness.success_threshold: 1
    task.liveness.timeout: 10
    task.readiness.failure_threshold: 3
    task.readiness.initial_delay: 3
    task.readiness.period: 3
    task.readiness.success_threshold: 1
    task.readiness.timeout: 10
    web.liveness.failure_threshold: 3
    web.liveness.initial_delay: 3
    web.liveness.period: 3
    web.liveness.success_threshold: 1
    web.liveness.timeout: 10
    web.readiness.failure_threshold: 3
    web.readiness.initial_delay: 3
    web.readiness.period: 3
    web.readiness.success_threshold: 1
    web.readiness.timeout: 5

When testing this out, it fails on the "Apply deployment resources" task, presumably because the probe timed out. The timeout may be too low. The timeout is 10 seconds and the database migrations take much longer than that to run. Probably 60-70 seconds if I had to guess.

rooftopcellist avatar Mar 01 '23 19:03 rooftopcellist

Hi @erz4 we are prioritizing to get this in next.

due to the recent change to the deployment of awx (web-task-split) the PR need some heavy rebasing and update

would u be able to get to this?

TheRealHaoLiu avatar Apr 05 '23 18:04 TheRealHaoLiu

There is an open PR actively being worked on here to implement this:

  • https://github.com/ansible/awx-operator/pull/1674

rooftopcellist avatar Jan 17 '24 19:01 rooftopcellist

This feature has been merged as part of https://github.com/ansible/awx-operator/pull/1674

rooftopcellist avatar Mar 07 '24 20:03 rooftopcellist