dagster
dagster copied to clipboard
[docs] - Google Cloud deployment with Helm charts + GKE Workload identity
Summary
Update the Google Cloud deployment docs to include info about running the helm chart when using GKE Workload Identity. See convo + comments for more info.
Dagster Documentation Gap
This issue was generated from the slack conversation at: https://dagster.slack.com/archives/C01U954MEER/p1639781536357200?thread_ts=1639781536.357200&cid=C01U954MEER
Conversation excerpt
U02QUNEGP0V: Hey all! I'm having an issue deploying on k8s:
Traceback (most recent call last):
File "/usr/local/bin/dagit", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/dagit/cli.py", line 239, in main
cli(auto_envvar_prefix="DAGIT") # pylint:disable=E1120
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dagit/cli.py", line 119, in ui
**kwargs,
File "/usr/local/lib/python3.7/site-packages/dagit/cli.py", line 136, in host_dagit_ui
with get_instance_for_service("dagit") as instance:
File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.7/site-packages/dagster/cli/utils.py", line 12, in get_instance_for_service
with DagsterInstance.get() as instance:
File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 387, in get
return DagsterInstance.from_config(dagster_home_path)
File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 402, in from_config
return DagsterInstance.from_ref(instance_ref)
File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 423, in from_ref
run_launcher=instance_ref.run_launcher,
File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/ref.py", line 264, in run_launcher
return self.run_launcher_data.rehydrate() if self.run_launcher_data else None
File "/usr/local/lib/python3.7/site-packages/dagster/serdes/config_class.py", line 85, in rehydrate
return klass.from_config_value(self, result.value)
File "/usr/local/lib/python3.7/site-packages/dagster_k8s/launcher.py", line 209, in from_config_value
return cls(inst_data=inst_data, **config_value)
File "/usr/local/lib/python3.7/site-packages/dagster_k8s/launcher.py", line 130, in __init__
kubernetes.config.load_incluster_config()
File "/usr/local/lib/python3.7/site-packages/kubernetes/config/incluster_config.py", line 121, in load_incluster_config
try_refresh_token=try_refresh_token).load_and_set(client_configuration)
File "/usr/local/lib/python3.7/site-packages/kubernetes/config/incluster_config.py", line 54, in load_and_set
self._load_config()
File "/usr/local/lib/python3.7/site-packages/kubernetes/config/incluster_config.py", line 73, in _load_config
raise ConfigException("Service token file does not exist.")
Am I missing a value in values.yaml somewhere? U016C4E5CP8: Hi Charles - this is when using the helm chart? U02QUNEGP0V: Hey Daniel! thanks for responding - yes that is when we're using the helm chart. U016C4E5CP8: I see - how did you configure your run launcher in your values.yaml? looks like this is related to the loadInclusterConfig / kubeconfigFile settings U016C4E5CP8: These are the relevant bits from the values.yaml - https://github.com/dagster-io/dagster/blob/master/helm/dagster/values.yaml#L384-L388 . What exactly you put here likely depends on your cluster, but your dagit pod needs to be able to load the correct kubeconfig in order to be able to launch pods for runs U02QUNEGP0V: hey Daniel, I just left it as the default k8srunner, trying to load the inclusterconfig is where it fails U016C4E5CP8: and just to double-check, those are logs from your dagit pod I assume? U02QUNEGP0V: yes U02QUNEGP0V: I am using workload identity for GKE, I am wondering if it has to do with this: https://github.com/kubernetes-client/python-base/issues/159; if this is the case, are there other people trying to use workload identity with dagster? U016C4E5CP8: ah I was just searching around and found a similar error related to GKE Workload Identity (from airflow not dagster, but both systems try to load the cluster config I think): https://stackoverflow.com/questions/57312376/unable-to-execute-airflow-kubernetesexecutor#comment117873811_57398011 U016C4E5CP8: I'm looking through https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl now to see if there's some way to generate a kubeconfig file that you could use with the kubeconfigFile setting U02QUNEGP0V: hm, I apologize I'm kind of new to k8s. the kubeconfigFile will hold the service token details too? U02QUNEGP0V: I appreciate all the help you're giving me :slightly_smiling_face: U02QUNEGP0V: Hey Daniel! I figured it out actually thanks to the link you gave me. U02QUNEGP0V: I created the workload identity and all it took was updating the configuration on the service account U02QUNEGP0V:
module "k8s-workload-identity" {
depends_on = [kubernetes_namespace.namespace]
source = "terraform-google-modules/kubernetes-engine/google//modules/workload-identity"
name = "${var.namespace_name}-${var.environment}-ksa"
namespace = var.namespace_name
project_id = var.project_id
automount_service_account_token = true
}
U016C4E5CP8: Ahhh, great! We should add this to the docs under deploying on Google Cloud. <@U018K0G2Y85> docs document how to run the helm chart when using GKE Workload Identity
Message from the maintainers:
Are you looking for the same documentation content? Give it a :thumbsup:. We factor engagement into prioritization.
From Stackoverflow:
In case anyone is still running into Service token file does not exist using GKE Workload Identity and the Kubernetes Executor with in_cluster = True, make sure you have both the serviceAccountName specified in the deployment yaml, and that automountServiceAccountToken is True. According to the docs that last one should default to True, but I didn't see the token volume appear until I set it explicitly. After setting those, everything behaved as expected.