enterprise-helm
enterprise-helm copied to clipboard
Allow for NodeSelector as Admin Defined and User Option?
I'm currently evaluating Coder and so far its great! Definitely beats manually provisioning workspaces.
I had a few questions and some minor issues
Environment
- Provider:
aws-eks
- K8s Version:
1.21
- Coder Helm Version:
1.29.1
In our cluster, we use ASGs, and specifically for GPUs, we separate them by the instance-type
size as well as the GPU type.
Example
ASG 1: T4-XL
- g4dn.xlarge
- Node Labels: compute-role:gpu, compute-size:xlarge, gpu-type:t4
ASG 2: A10G-XL
- g5.xlarge
- Node Labels: compute-role:gpu, compute-size:xlarge, gpu-type:a10g
ASG 3: Mixed-XL
- g4dn.xlarge
- g5.xlarge
- Node Labels: compute-role:gpu, compute-size:xlarge, gpu-type:mixed
Questions:
-
Are there any future plans to allow the admin to specify node-selectors/taints based on images? For CUDA enabled images, we would pre-select the node-selectors and taints to ensure that the image gets properly provisioned with a GPU node, rather than a CPU node.
-
Follow-on, would it be possible to allow users to specify the node-selectors/taints when creating workspaces without using a template? (if option is enabled by admin)
-
Is there a way to adjust/specify the session-timeout for OIDC? Currently it seems like the limit is 60 mins before refresh kicks in and requires reauth.
Issues:
I was trying to have the node-selector modified by using a template that did specify compute-type:gpu, compute-role:coder
, but within the provider settings, only compute-role:coder
is defined.

However, after testing the template, and subsequently deleting it, several workspaces that were provisioned afterwards retained the nodeSelectors that were defined only in the template itself, rather than sticking strictly with the provider specified one.

In Template Policy
, I do have write enabled for node-selector
so I wonder if that's what's causing the issue.
Thanks!
Hi @trisongz. Would encourage you to join us on Slack so we can discuss these in more detail.
Are there any future plans to allow the admin to specify node-selectors/taints based on images? For CUDA enabled images, we would pre-select the node-selectors and taints to ensure that the image gets properly provisioned with a GPU node, rather than a CPU node.
Follow-on, would it be possible to allow users to specify the node-selectors/taints when creating workspaces without using a template? (if option is enabled by admin)
Unfortunately, the answer is no for both accounts. As you may have noticed, a workspace template + template policy allows you to set NodeSelectors on the workspace level. However, a workspace must be created "from template" not "from image."
Is there a way to adjust/specify the session-timeout for OIDC? Currently it seems like the limit is 60 mins before refresh kicks in and requires reauth.
Checking with the team now on this one
However, after testing the template, and subsequently deleting it, several workspaces that were provisioned afterwards retained the nodeSelectors that were defined only in the template itself, rather than sticking strictly with the provider specified one.
Follow up question: How were you provisioning these additional workspaces? Directly from the image or are these workspaces created from the old template? This may be a bug, but I want to make sure I'm understanding correctly.
On a slightly different note, we are working on Coder v2 which allows an admin to define templates for workspaces using an entirely custom pod spec, including NodeSelectors. It uses Terraform to define templates. It's not ready for production use, but let me know if you're interested in giving feedback and shaping the roadmap. Here's how it would work for the developer
$ coder workspace create ben1
Choose a template
> Data science 1
Data science 2
Frontend development
Backend development
Creating workspace...
SSH with `coder ssh ben1`
These parameters are admin defined, and have an underlying specification with Terraform
Is there a way to adjust/specify the session-timeout for OIDC? Currently it seems like the limit is 60 mins before refresh kicks in and requires reauth.
When coder.oidc.enableRefresh
is set to true
, refresh and expiration intervals are defined by the upstream provider. When the access token expires, we use the refresh token to ensure the user still has access. If your provider doesn't return refresh tokens, this could be the cause of the 60m timeout. It may be necessary to add additional redirect options for your provider to return refresh tokens. For example, Google requires the following:
coderd:
oidc:
enableRefresh: true
redirectOptions:
access_type: offline
prompt: consent
Would encourage you to join us on Slack so we can discuss these in more detail.
Requested to join!
Follow up question: How were you provisioning these additional workspaces? Directly from the image or are these workspaces created from the old template? This may be a bug, but I want to make sure I'm understanding correctly.
These new workspaces were provisioned from images only, not templates as there's currently no in-UI option to select from pre-defined templates within the dashboard (I believe). My theory would be that changes may not have persisted fully in the backend/database before the new workspace was created.
On a slightly different note, we are working on Coder v2 which allows an admin to define templates for workspaces using an entirely custom pod spec, including NodeSelectors. It uses Terraform to define templates. It's not ready for production use, but let me know if you're interested in giving feedback and shaping the roadmap. Here's how it would work for the developer
Would be more than happy to!
When coder.oidc.enableRefresh is set to true, refresh and expiration intervals are defined by the upstream provider. When the access token expires, we use the refresh token to ensure the user still has access. If your provider doesn't return refresh tokens, this could be the cause of the 60m timeout. It may be necessary to add additional redirect options for your provider to return refresh tokens.
Will update helm specs with this and follow up if the behavior still persists.
Another bug that we found as it relates to GPU nodes.
When the admin options for Enable Caching
and Enable auto loading of 'shiftfs' kernel module
are both enabled, GPU-based nodes simply won't allow the workspace to access the GPU itself.
The pod will have proper resource allocation, and properly scheduled etc. but whenever the user goes into the workspace and tries to access the GPU via nvidia-smi
- there is no GPU present. (behavior present in 1.29-1.30) Disabling these options will resolve this issue (something I found out the hard way unfortunately)