argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

feat: store podname in nodestatus

Open isubasinghe opened this issue 1 year ago • 14 comments

Fixes #12528

Motivation

Modifications

The changes are relatively simple, I reduced the pod name generation code as much as possible and stored the PodnName in the node itself.

Verification

Verified via testing, this change introduces a slight change to existing behaviour however. The change here is that we must have created a pod before a podName is assigned.

isubasinghe avatar Jan 13 '24 03:01 isubasinghe

@isubasinghe why do we need the pod name in node status? Can you provide more details about usecase/scenario? This will increase the workflow object size that will reduce the number of steps. if the usecase is to find the pod for that particular node, node name can be added as label in the pod object.

sarabala1979 avatar Jan 16 '24 06:01 sarabala1979

@sarabala1979 the issue isn't created yet, but this PR addressed this comment by Alex where we decided to store the podName in the Node status: https://github.com/argoproj/argo-workflows/issues/10267#issuecomment-1630308619

isubasinghe avatar Jan 16 '24 10:01 isubasinghe

I only took a quick glance at this, but doesn't this need to modify the UI code as well?

I thought I replied to this, must have just not pressed the comment button I suppose, this was on purpose, wanted to make the backend changes first and follow it up with the frontend changes, but happy to make the UI changes here as well if you'd like me to.

Also, for backward-compatibility, we'd still have to generate the pod names for Workflows created before this change.

Hmmm yeah you are right. What I might do is also keep the previous pod name generation code. We can than then also use that to test if these changes kept the original behaviour.

isubasinghe avatar Jan 30 '24 10:01 isubasinghe

Bumping this, @isubasinghe do you think you can rebase and resolve the conflicts?

@terrytangyuan this is something we use in kubeflow pipelines to retrieve pod names (we are currently planning to use a workaround you prescribed here), we are very much interested in seeing this merged, so we can rely on Argo to provide this value, what can we do to get this some more traction?

HumairAK avatar Apr 14 '24 18:04 HumairAK

We can discuss this in the upcoming contributors meeting to prioritize.

terrytangyuan avatar Apr 14 '24 18:04 terrytangyuan

@HumairAK this is currently a feature, so it wouldn't land until the next minor (currently 3.6). If KFP is only bumping to 3.4, this PR wouldn't solve your problem

agilgur5 avatar Apr 15 '24 01:04 agilgur5

@agilgur5 that is fine, hence why we plan to go with the work around for now, until we can upgrade to the version that (hopefully) has this change included, at which point we'll switch to relying on argo for this.

HumairAK avatar Apr 15 '24 17:04 HumairAK

@HumairAK ok. That does affect our prioritization though, as that means KFP won't be using this functionality in the short or likely near-term.

agilgur5 avatar Apr 16 '24 15:04 agilgur5

Effectively, this is a trade-off between storage and determinism.

To be clear, I don't have a strong opinion on that per se; but I think we should be explicit about that -- larger workflows would be affected by this feature.

I probably lean toward agreeing with this feature, as it would simplify areas of the codebase where we do have some bugs / edge-cases currently that are handled in different ways in different places.

(not during a backward-compat window though)

The deprecation window is actually a bit complicated now that I think about it. If we ever want to remove the pod name generation code, we would break backward-compat with Workflows that were created in a version that doesn't contain the pod name in the status.

For instance, say 3.6 puts pod names into the status by default and has backward-compat to derive pod names when not present in the status. That backward-compat means the POD_NAMES env var still has to be set.

Then say, in 3.7 we were to remove the backward-compat -- all Workflows created with Argo <3.6 would no longer be processable, e.g. for retries or Pod logs in the UI and possibly other features. We may perhaps want to keep this backward-compat around for more than 1 minor version as such (although I think Emissary only had 1 minor version of backward-compat as well: default in 3.3, only option in 3.4 -- so there may be precedent to only do 1 minor of backward-compat).

With that in mind, the pod name derivation code is going to still have to be around for a while. Defaulting to this feature means that'd we could eliminate bugs for newer Workflows, but might still have them present in older Workflows. So the frequency of the old bugs would decrease, but we may want to still fix them. Or in other words, the backward-compat requirement decreases some of the benefit of this feature.

agilgur5 avatar Apr 16 '24 17:04 agilgur5

Discussed in today's Contributor Meeting. The storage concern that I brought up earlier was still relevant, especially as we've had more users recently reporting etcd getting full (particularly on managed k8s providers where you cannot configure your etcd space), such as #12802.

But given that I brought it up and I personally think the trade-off is worthwhile in this case -- determinism over storage and treat storage as a separate problem to optimize independent of any one field -- this has the green light to go.

We do need to implement the backward compat in this PR as well as decide whether pod names will be on all nodes or just pod nodes.

agilgur5 avatar Jun 19 '24 02:06 agilgur5

This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs.

github-actions[bot] avatar Jul 23 '24 02:07 github-actions[bot]

This PR has been closed due to inactivity and lack of changes. If you would like to still work on this PR, please address the review comments and re-open.

github-actions[bot] avatar Aug 06 '24 02:08 github-actions[bot]

this has the green light to go.

hmm would still be interested in seeing this change

@isubasinghe do you have the bandwidth to continue work on this?

HumairAK avatar Aug 06 '24 16:08 HumairAK

@HumairAK will continue on this when I get the chance

isubasinghe avatar Aug 09 '24 05:08 isubasinghe