argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Ensure single source of truth for generated values.

Open isubasinghe opened this issue 2 years ago • 18 comments

Summary

We need to ensure that when we generate facts, say for example the pod name, we only do so once.

Use Cases

This would significantly reduce the number of bugs we have. For example, when investigating this issue: https://github.com/argoproj/argo-workflows/issues/10124

Something I noticed was that the output between ./argo get -json and the UI were slightly different.

I've had similar issue with a Mutex issue linked here.

This generation of facts every time we need them has been quite problematic in my opinion and we should refactor the codebase to eliminate this behaviour. I believe doing so will eliminate an entire class of bugs from our codebase.

isubasinghe avatar Dec 21 '22 01:12 isubasinghe

@isubasinghe can we have a holistic approach to solve the POD_NAME issue for all scenarios? Can you propose the solutions?

sarabala1979 avatar Jun 15 '23 16:06 sarabala1979

Related issues https://github.com/argoproj/argo-workflows/issues/7646, https://github.com/argoproj/argo-workflows/issues/9906, https://github.com/argoproj/argo-workflows/issues/10107, https://github.com/argoproj/argo-workflows/issues/10124, https://github.com/argoproj/argo-workflows/issues/11250

terrytangyuan avatar Jun 22 '23 00:06 terrytangyuan

@sarabala1979 we really need some data store, configmaps can be used but obviously have the 1MB limit imposed by etcd.

Even dealing with the configmap limit is still a better option than what we are doing now, because there are quite a lot of bugs relating to the node names alone.

Then there is also the mutex issue, which is ultimately the same problem.

Either we make mysql or postgres mandatory and we rely on that to store everything or we use something like this: https://github.com/dgraph-io/badger on the controller side running as an embedded data store.

The way badger works and the way our workflows work, I do not foresee any scaling problems running this as an embedded datastore.

I personally like the option of using badger because it doesn't require any architectural changes or requirements and I am fairly experienced in the use of these key value datastores since I have used RocksDB in the past when building a custom database.

That is my conclusion, use badger or (postgres|mysql) and store everything we need in it.

I do not think patching these bugs as they appear is a good solution, we need to extinguish this problem at the source.

isubasinghe avatar Jun 22 '23 01:06 isubasinghe

Also pinging @terrytangyuan and @juliev0 so that they can have a read.

isubasinghe avatar Jun 22 '23 01:06 isubasinghe

Actually correct me if I am wrong about Postgres| MySql not being mandatory, if it is mandatory, makes more sense to just use that.

isubasinghe avatar Jun 22 '23 01:06 isubasinghe

@sarabala1979 we really need some data store, configmaps can be used but obviously have the 1MB limit imposed by etcd.

Even dealing with the configmap limit is still a better option than what we are doing now, because there are quite a lot of bugs relating to the node names alone.

Then there is also the mutex issue, which is ultimately the same problem.

Either we make mysql or postgres mandatory and we rely on that to store everything or we use something like this: https://github.com/dgraph-io/badger on the controller side running as an embedded data store.

The way badger works and the way our workflows work, I do not foresee any scaling problems running this as an embedded datastore.

I personally like the option of using badger because it doesn't require any architectural changes or requirements and I am fairly experienced in the use of these key value datastores since I have used RocksDB in the past when building a custom database.

That is my conclusion, use badger or (postgres|mysql) and store everything we need in it.

I do not think patching these bugs as they appear is a good solution, we need to extinguish this problem at the source.

Can you summarize Mutex issues? is it because of node id/name mismatch causing the issue or anything else? I believe if we fix the nodeID/Name issue that will fix the all mutex and status mismatch issues.

sarabala1979 avatar Jun 22 '23 16:06 sarabala1979

Actually correct me if I am wrong about Postgres| MySql not being mandatory, if it is mandatory, makes more sense to just use that.

It's only required when persistence/workflow archiving is enabled.

Then there is also the mutex issue, which is ultimately the same problem.

Could you clarify what you meant by this? Any context?

I personally like the option of using badger because it doesn't require any architectural changes or requirements and I am fairly experienced in the use of these key value datastores since I have used RocksDB in the past when building a custom database.

We should stay away from any external dependencies like DB unless really necessary and leverage any cloud-native solutions whenever possible.

terrytangyuan avatar Jun 23 '23 01:06 terrytangyuan

So, this is related to re-generating information over and over, right? Is there a reason we prefer a ConfigMap over using the Workflow Status? Is it because the information is too verbose and more likely to hit the size limit?

juliev0 avatar Jun 23 '23 18:06 juliev0

@sarabala1979 this gives an example of why mutex release fails: https://github.com/argoproj/argo-workflows/blob/49865b1783b481ba7600441999559821d1a03a18/docs/proposals/mutex-improvements.md

I do not think the fix is that easy, the information available on acquire is different to the information available on release, that is why two functions exist for getting the key. The two functions actually happen to generate different keys, so you end up with the bug.

We probably can use a configmap for this particular issue because keys aren't going to be large and also keys will be released after the workflow is completed. Additionally a workflow being paused because it is waiting for space in the configmap is probably a better scenario than the mutex not really working.

also see the linked doc @terrytangyuan, it shows an example of the mutex failing, should be easy to understand.

isubasinghe avatar Jun 24 '23 01:06 isubasinghe

We should stay away from any external dependencies like DB unless really necessary and leverage any cloud-native solutions whenever possible.

@terrytangyuan I generally agree on this, but I think this is a tradeoff, storing items will give higher confidence than generating them. I prefer the higher confidence even though I know introducing any sort of persistence on k8s is asking for pain (especially in the context of multi-zone clusters).

I do know a workaround for etcd that will allow us to store items of any size on it, it will require additional work on my end to do so though. We can essentially break a single entry into multiple keys and use etcd directly, this is safe because it supports transactional writes/reads. Obviously increases the maintenance cost on our end.

isubasinghe avatar Jun 24 '23 01:06 isubasinghe

@juliev0 Yeah I believe this is correct, we generate information over and over to avoid dealing with configmap limitations (1MB).

isubasinghe avatar Jun 24 '23 01:06 isubasinghe

As a concluding remark: I think we can improve the current re-generation of facts and reduce the number of these type of bugs.

However, I do not think we can reduce these bugs to an absolute zero, for example, the typescript and Go have their own implementations of node names.

We can eliminate this class of bugs from existing in the project entirely if we store them somewhere, to me that is the superior solution, despite that having many downsides to it as well (esp in the context of multi-zone deployments of k8s).

It's better to tackle the root problem, because I think it will bring greater stability to the project, fixing these sorts of bugs is frankly quite demoralizing (for me at least). And I am not sure the advantages brought by being stateless are worth the costs of being stateless, as evidence to this I will point to this issue: https://github.com/argoproj/argo-workflows/issues/8684

isubasinghe avatar Jun 24 '23 01:06 isubasinghe

I agree with this. For pod names, we should store them in status.node[*].podName rather than looking at the annotation.

The pod name can be fixed when the NodeStatus is created. There is only one piece of code that does this, so it will be quite robust.

This has been the cause of a number (admittedly minor) annoying issues.

alexec avatar Jul 11 '23 07:07 alexec

@JPZ13 do you want to create a separate issue for pod names?

alexec avatar Jul 11 '23 07:07 alexec

@alexec We're going to have @isubasinghe handle it holistically once he gets back from vacation

JPZ13 avatar Jul 11 '23 20:07 JPZ13

adding my comment here from the other issue..

We use Argo heavily at ZG and when upgrading to v3.4 of argo workflows we noticed breaking changes this causes outside of the nodeId in the workflow. This v2 naming convention also breaks upstream k8s HOSTNAME env variable for the pod. For instance, we get the HOSTNAME in the workflow pod and run the kubernetes api call to get_namespace_pod with that HOSTNAME the pod name returned from the kubernetes api server does not match the pod name in the actual pod metadata.name. Not sure if there is some weird concatenation going on that is not persisting to etcd but the downard API does not match what is in etcd. I reverted the env var POD_NAMES to v1 and everything works in v3.4. I feel with all the bugs this v2 pod name should be reverted becuase the scope of breaking changes persists beyond argo itself and into kubernetes.

on another note, suggest using labels to store pod <-> workflow metadata instead of trying to generate a name that breaks so many scenarios in and out of argo. Labels would be a cleaner approach and will still be searchable by anyone using Argo both via cli or the Ui.

aaron-arellano avatar Aug 21 '23 19:08 aaron-arellano

@aaron-arellano could you post the pod name to host name mappings that you're seeing? I've also sent you an email and copied @isubasinghe. If you would prefer to keep those values private, you can share with us via email. This will help clue us in to concatenation issues

JPZ13 avatar Aug 22 '23 20:08 JPZ13

So here is an example of the error I saw with pod v2 name. You can see in the log file attached we call the kubernetes python library function read_namespaced_pod. in the workflow itself we call the below code:

      current_pod_name = os.environ.get("HOSTNAME", None)
      current_pod_namespace = os.environ.get("POD_NAMESPACE", None)

      if current_pod_name and current_pod_namespace:
          return self.v1.read_namespaced_pod(
              namespace=current_pod_namespace, name=current_pod_name
          )

HOTSNAME as you know is derived from default kubernetes API functionality in the container to the pod it is hosted on, refhttps://kubernetes.io/docs/concepts/containers/container-environment/#container-information. With pod name v1 everything works fine but on v2 name this fails with the logs attached. The actual pod name was aip-toleration-integration-test-zlntf-nvidia-tesla-v100-4115184276

when getting HOSTNAME from within the v2name pod we see the last 3 chars of the real pod name are not there. This consistenly happens when running get_namespaced_pod in a workflow pod generated under the v2 name feature. Again, when I revert POD_NAMES in workflow-controller and arg-server to v1 this scenario works fine. Container Environmenthttps://kubernetes.io/docs/concepts/containers/container-environment/#container-information This page describes the resources available to Containers in the Container environment. Container environment The Kubernetes Container environment provides several important resources to Containers: A filesystem, which is a combination of an image and one or more volumes. Information about the Container itself. Information about other objects in the cluster. Container information The hostname of a Container is the name of the Pod in which the Container is running. It is available through the hostname command or the gethostname function call in libc. kubernetes.io

aaron-arellano avatar Aug 23 '23 19:08 aaron-arellano