beaker icon indicating copy to clipboard operation
beaker copied to clipboard

[BUG] kernel oops that happens during provison are marked as panic in the last task

Open bgoncalv opened this issue 3 years ago • 8 comments

Describe the bug
This was initially reported on https://bugzilla.redhat.com/show_bug.cgi?id=1623729

In case there is kernel oops during provision this oops is marked as a kernel panic in the last beaker task. This causes confusion to understand why that task is shown as panic.

Version-Release number
28.2

To Reproduce
Steps to reproduce the behavior:

  1. Provision an OS that triggers kernel oops at provision, but the provision completes successfully
  2. Run several tasks
  3. The last task will show as panic, because of the oops that happened in the provisioning.

Actual behavior
Last task show as panic

Expected behavior
Last task shouldn't show as panic

bgoncalv avatar Aug 06 '21 15:08 bgoncalv

Panic detection can be configured by PANIC_REGEX, it's set on LC basis https://github.com/beaker-project/beaker/blob/3a1155308899a301dbf042c55694680f03051346/LabController/src/bkr/labcontroller/default.conf#L23 and by default it includes Oops.

You can disable panic detection all together by including <watchdog panic="None"/> in your recipe.

mdujava avatar Aug 09 '21 12:08 mdujava

Note, I don't want to ignore Oops in general. My only problem are with Oops that happened on provision that causes the last task in the recipe to be marked as Panic, this task shouldn't be marked as panic as there was no Oops when the task executed.

bgoncalv avatar Aug 09 '21 12:08 bgoncalv

@bgoncalv is right. I'm aware of this behavior. The problem is that we can't really mark Panic during installation at this moment, therefore most of the time it lands on the first task. But, it may happen that we failed to detect it at first therefore it will land on another (the one which was running when we found Panic).

Yeah. This needs to remain open and we will need to find a better way how to manage this.

StykMartin avatar Aug 09 '21 12:08 StykMartin

Thanks @StykMartin, do you have any suggestion how can we workaround this? For example, we could create a dummy task to trigger this panic detection and run it as first task. Is there a way for a task force to run the detection?

bgoncalv avatar Aug 09 '21 15:08 bgoncalv

Hello @bgoncalv. My answer never ended on GitHub. I'm sorry about that. So the situation is quite a bit more complicated, but I feel like we can make some compromise if you still need it.

So panic detection is running on background in each region. Then it is proxied to the main server to process. The main server will check tasks assigned to given recipe and iterate over. If iteration is exhausted we will mark last item - because that's what will end up in variable.

If you have any idea how we can help you @bgoncalv to make it saner feel free to shoot.

StykMartin avatar Nov 23 '21 17:11 StykMartin

@StykMartin thanks for the reply. Indeed that feature would be helpful for us. If the panic detection will assign panic detected during provision always in the last task of the recipe that would help us indeed.

bgoncalv avatar Nov 23 '21 17:11 bgoncalv

This is happening at this moment. All panics are assigned to the last task if there is no running task. So basically if we didn't register any start (for example provisioning) we will always mark last. However, there is a catch. If restraint will report n-2 task as finished and n-1 is not reported as started then we may report panic in this window to last as well. But changes are quite small that restraint will crash in moments like this.

StykMartin avatar Nov 23 '21 18:11 StykMartin

I would suggest you put dummy tasks at the end and then collect panics from there.

I will try to redesign this feature so we can report panics during provisioning to dedicated space and not marking tasks.

StykMartin avatar Nov 23 '21 18:11 StykMartin