argo-workflows
argo-workflows copied to clipboard
Flood of "Unable to obtain node for" error messages
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issues exists when I tested with
:latest
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
Seeing a lot of these errors:
time="2023-11-02T12:45:44.922Z" level=error msg="was unable to obtain node for " namespace=argo workflow=hello-world-h7bpn
time="2023-11-02T12:45:44.922Z" level=error msg="was unable to obtain node for " namespace=argo workflow=hello-world-7vspx
This was introduced in https://github.com/argoproj/argo-workflows/pull/11665
cc @isubasinghe
Version
latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Hello world example
Logs from the workflow controller
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
Logs from in your workflow's wait container
kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
latest
This was introduced in #11665
Rather #11451 if you're on 3.5, since #11665 was the backport.
Seeing a lot of these errors:
Is there a specific circumstance under which you see them? Since #11451 should mostly just reveal any existing bugs that were previously silently missed due to zero values.
Also is anything going wrong in Workflow evaluation or is it just printing errors while still operating correctly?
More details would be very useful as the error itself is not a root cause.
The workflow is fine. We should check whether these errors are necessary. Maybe node was just not initialized yet.
We should check whether these errors are necessary. Maybe node was just not initialized yet.
So if the node was not initialized yet, then there shouldn't be a call to Nodes.Get
at all. The previous code already did that though, meaning the previous code was incorrect.
The error message is an indication of a bug somewhere. If we fix the root cause of that bug, then the error should no longer show up.
The error message is not the problem, the underlying bug is.
@terrytangyuan and @agilgur5 I know exactly what is going on here, it is speculative execution on the DAG section of the code.
It most likely is looking for the boundaryNode here. The boundaryNode is represented by the empty string. To me this logging is a good thing and indicates something wrong on the code side (speculation in the DAG code).
We can of course do an if statement on the boundaryID and not print if it is the empty string, I opted not to do this because it would mean that any serious bug would also not get logged.
The unfortunate thing is that NodeIDs are not quite strings, at least from a logical perspective. They are much closer to a sum type.
data NodeId
= NodeID String
| BoundaryNode
or
pub enum NodeID {
NodeID(String),
BoundaryID
}
I see this as two issues:
- Something more typesafe for enum representation
- Refactor the DAG code to remove speculation.
Of course happy to do the if statement as a temporary solution if the logging is too annoying
So if the node was not initialized yet, then there shouldn't be a call to Nodes.Get at all. The previous code already did that though, meaning the previous code was incorrect.
Very much agreed
It most likely is looking for the boundaryNode here.
Might've missed a code link in "here"?
We can of course do an if statement on the boundaryID and not print if it is the empty string
So #12020 did something similar, but in that case the boundaryID
is an option in every caller and so it indeed may be unset.
I'm not sure which case this one is since a code link is missing, per above
@isubasinghe
I suspect this arises from the dag as mentioned previously, I am now going to find out why we can ignore the error and I will document this. If we can successfully ignore the error, I will remove the log for that particular section.
I don't really know how to fix this without rewriting the dag.go logic, the core problem is that the dag logic is entirely speculative, this means that it is impossible to differentiate between a missing node and one that doesn't exist yet.
I'm going to do something that I don't particularly like and open a PR to remove the logging in this section, the argument I am going to use for this is just the fact that while this is problematic, the code does seem to work.
@agilgur5 what are your thoughts, I'd presume you'd also ideally want the speculation itself to be removed?
A correction on the previous comment would be that this is not just for the boundaryNode, its for every node that this speculation exists. (Confirmed via running a dag and viewing the dag and logs in real time).
May be worth seeing if this still reproduces after https://github.com/argoproj/argo-workflows/pull/12233
@drawlerr the DAG code will always produce this logging unfortunately
Bumping this issue.
After a while of getting warning level logs, a fatal level log is received for the same workflow, which kills the whole controller.
time="2024-04-02T16:39:45.097Z" level=fatal msg="was unable to obtain node for xx-1712057400-2227577750" namespace=xx workflow=xx-cron-1712057400
It happens on version 3.5.5
I had to rollback to argo-workflows 3.4.10 which is not an option as well, since it has an issue of not resolving the dag step output results to lifecycle-hooks.
Also getting a flood of these log messages on Argo Workflows v3.5.7.
level=warning msg="was unable to obtain the node for foo-8c78r-235979071, taskName beforetest"
Per my previous comments, this message is generally* indicative of a bug somewhere. It can appear in many different places, so context is critical, e.g. if it happens on a certain Workflow every time. Without those details, comments are not very helpful.
*
= as Isitha wrote above, sometimes it's not a bug, but an unconditional log; in which case, the code should be modified to not log and have a comment about why the error is ignored / or not use the code which errors in the first place (e.g. it could be skipped entirely if the node will always be unintialized).
in which case, the code should be modified to not log and have a comment about why the error is ignored / or not use the code which errors in the first place (e.g. it could be skipped entirely if the node will always be unintialized).
Agreed with this approach. Does anyone want to pick this up?