argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Flood of "Unable to obtain node for" error messages

Open terrytangyuan opened this issue 1 year ago • 9 comments

Pre-requisites

  • [X] I have double-checked my configuration
  • [X] I can confirm the issues exists when I tested with :latest
  • [ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Seeing a lot of these errors:

time="2023-11-02T12:45:44.922Z" level=error msg="was unable to obtain node for " namespace=argo workflow=hello-world-h7bpn
time="2023-11-02T12:45:44.922Z" level=error msg="was unable to obtain node for " namespace=argo workflow=hello-world-7vspx

This was introduced in https://github.com/argoproj/argo-workflows/pull/11665

cc @isubasinghe

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Hello world example

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

terrytangyuan avatar Nov 03 '23 13:11 terrytangyuan

latest

This was introduced in #11665

Rather #11451 if you're on 3.5, since #11665 was the backport.

Seeing a lot of these errors:

Is there a specific circumstance under which you see them? Since #11451 should mostly just reveal any existing bugs that were previously silently missed due to zero values.

Also is anything going wrong in Workflow evaluation or is it just printing errors while still operating correctly?

More details would be very useful as the error itself is not a root cause.

agilgur5 avatar Nov 03 '23 15:11 agilgur5

The workflow is fine. We should check whether these errors are necessary. Maybe node was just not initialized yet.

terrytangyuan avatar Nov 03 '23 15:11 terrytangyuan

We should check whether these errors are necessary. Maybe node was just not initialized yet.

So if the node was not initialized yet, then there shouldn't be a call to Nodes.Get at all. The previous code already did that though, meaning the previous code was incorrect. The error message is an indication of a bug somewhere. If we fix the root cause of that bug, then the error should no longer show up.

The error message is not the problem, the underlying bug is.

agilgur5 avatar Nov 04 '23 18:11 agilgur5

@terrytangyuan and @agilgur5 I know exactly what is going on here, it is speculative execution on the DAG section of the code.

It most likely is looking for the boundaryNode here. The boundaryNode is represented by the empty string. To me this logging is a good thing and indicates something wrong on the code side (speculation in the DAG code).

We can of course do an if statement on the boundaryID and not print if it is the empty string, I opted not to do this because it would mean that any serious bug would also not get logged.

The unfortunate thing is that NodeIDs are not quite strings, at least from a logical perspective. They are much closer to a sum type.

data NodeId 
  = NodeID String
  |  BoundaryNode

or

pub enum NodeID {
  NodeID(String),
  BoundaryID
}

I see this as two issues:

  1. Something more typesafe for enum representation
  2. Refactor the DAG code to remove speculation.

Of course happy to do the if statement as a temporary solution if the logging is too annoying

isubasinghe avatar Nov 04 '23 21:11 isubasinghe

So if the node was not initialized yet, then there shouldn't be a call to Nodes.Get at all. The previous code already did that though, meaning the previous code was incorrect.

Very much agreed

isubasinghe avatar Nov 04 '23 21:11 isubasinghe

It most likely is looking for the boundaryNode here.

Might've missed a code link in "here"?

We can of course do an if statement on the boundaryID and not print if it is the empty string

So #12020 did something similar, but in that case the boundaryID is an option in every caller and so it indeed may be unset. I'm not sure which case this one is since a code link is missing, per above

agilgur5 avatar Nov 04 '23 22:11 agilgur5

@isubasinghe

tooptoop4 avatar Jan 19 '24 21:01 tooptoop4

I suspect this arises from the dag as mentioned previously, I am now going to find out why we can ignore the error and I will document this. If we can successfully ignore the error, I will remove the log for that particular section.

isubasinghe avatar Jan 19 '24 22:01 isubasinghe

I don't really know how to fix this without rewriting the dag.go logic, the core problem is that the dag logic is entirely speculative, this means that it is impossible to differentiate between a missing node and one that doesn't exist yet.

I'm going to do something that I don't particularly like and open a PR to remove the logging in this section, the argument I am going to use for this is just the fact that while this is problematic, the code does seem to work.

@agilgur5 what are your thoughts, I'd presume you'd also ideally want the speculation itself to be removed?

A correction on the previous comment would be that this is not just for the boundaryNode, its for every node that this speculation exists. (Confirmed via running a dag and viewing the dag and logs in real time).

isubasinghe avatar Jan 20 '24 00:01 isubasinghe

May be worth seeing if this still reproduces after https://github.com/argoproj/argo-workflows/pull/12233

drawlerr avatar Jan 31 '24 18:01 drawlerr

@drawlerr the DAG code will always produce this logging unfortunately

isubasinghe avatar Jan 31 '24 18:01 isubasinghe

Bumping this issue.

After a while of getting warning level logs, a fatal level log is received for the same workflow, which kills the whole controller.

time="2024-04-02T16:39:45.097Z" level=fatal msg="was unable to obtain node for xx-1712057400-2227577750" namespace=xx workflow=xx-cron-1712057400

It happens on version 3.5.5

I had to rollback to argo-workflows 3.4.10 which is not an option as well, since it has an issue of not resolving the dag step output results to lifecycle-hooks.

k-kaminskas avatar Apr 04 '24 13:04 k-kaminskas

Also getting a flood of these log messages on Argo Workflows v3.5.7.

level=warning msg="was unable to obtain the node for foo-8c78r-235979071, taskName beforetest"

ryancurrah avatar Jun 04 '24 17:06 ryancurrah

Per my previous comments, this message is generally* indicative of a bug somewhere. It can appear in many different places, so context is critical, e.g. if it happens on a certain Workflow every time. Without those details, comments are not very helpful.

* = as Isitha wrote above, sometimes it's not a bug, but an unconditional log; in which case, the code should be modified to not log and have a comment about why the error is ignored / or not use the code which errors in the first place (e.g. it could be skipped entirely if the node will always be unintialized).

agilgur5 avatar Jun 04 '24 18:06 agilgur5

in which case, the code should be modified to not log and have a comment about why the error is ignored / or not use the code which errors in the first place (e.g. it could be skipped entirely if the node will always be unintialized).

Agreed with this approach. Does anyone want to pick this up?

terrytangyuan avatar Aug 02 '24 02:08 terrytangyuan