enterprise-helm Allow for NodeSelector as Admin Defined and User Option?

I'm currently evaluating Coder and so far its great! Definitely beats manually provisioning workspaces.

I had a few questions and some minor issues

Environment

Provider: aws-eks
K8s Version: 1.21
Coder Helm Version: 1.29.1

In our cluster, we use ASGs, and specifically for GPUs, we separate them by the instance-type size as well as the GPU type.

Example

ASG 1: T4-XL
- g4dn.xlarge
  - Node Labels: compute-role:gpu, compute-size:xlarge, gpu-type:t4

ASG 2: A10G-XL
- g5.xlarge
  - Node Labels: compute-role:gpu, compute-size:xlarge, gpu-type:a10g

ASG 3: Mixed-XL
- g4dn.xlarge
- g5.xlarge
  - Node Labels: compute-role:gpu, compute-size:xlarge, gpu-type:mixed

Questions:

Are there any future plans to allow the admin to specify node-selectors/taints based on images? For CUDA enabled images, we would pre-select the node-selectors and taints to ensure that the image gets properly provisioned with a GPU node, rather than a CPU node.
Follow-on, would it be possible to allow users to specify the node-selectors/taints when creating workspaces without using a template? (if option is enabled by admin)
Is there a way to adjust/specify the session-timeout for OIDC? Currently it seems like the limit is 60 mins before refresh kicks in and requires reauth.

Issues:

I was trying to have the node-selector modified by using a template that did specify compute-type:gpu, compute-role:coder, but within the provider settings, only compute-role:coder is defined.

However, after testing the template, and subsequently deleting it, several workspaces that were provisioned afterwards retained the nodeSelectors that were defined only in the template itself, rather than sticking strictly with the provider specified one.

In Template Policy, I do have write enabled for node-selector so I wonder if that's what's causing the issue.

Thanks!

Apr 12 '22 22:04 trisongz

Hi @trisongz. Would encourage you to join us on Slack so we can discuss these in more detail.

Are there any future plans to allow the admin to specify node-selectors/taints based on images? For CUDA enabled images, we would pre-select the node-selectors and taints to ensure that the image gets properly provisioned with a GPU node, rather than a CPU node.

Follow-on, would it be possible to allow users to specify the node-selectors/taints when creating workspaces without using a template? (if option is enabled by admin)

Unfortunately, the answer is no for both accounts. As you may have noticed, a workspace template + template policy allows you to set NodeSelectors on the workspace level. However, a workspace must be created "from template" not "from image."

Is there a way to adjust/specify the session-timeout for OIDC? Currently it seems like the limit is 60 mins before refresh kicks in and requires reauth.

Checking with the team now on this one

However, after testing the template, and subsequently deleting it, several workspaces that were provisioned afterwards retained the nodeSelectors that were defined only in the template itself, rather than sticking strictly with the provider specified one.

Follow up question: How were you provisioning these additional workspaces? Directly from the image or are these workspaces created from the old template? This may be a bug, but I want to make sure I'm understanding correctly.

On a slightly different note, we are working on Coder v2 which allows an admin to define templates for workspaces using an entirely custom pod spec, including NodeSelectors. It uses Terraform to define templates. It's not ready for production use, but let me know if you're interested in giving feedback and shaping the roadmap. Here's how it would work for the developer

$ coder workspace create ben1

Choose a template
> Data science 1
  Data science 2
  Frontend development
  Backend development

Creating workspace...

SSH with `coder ssh ben1`

These parameters are admin defined, and have an underlying specification with Terraform

Apr 29 '22 14:04 bpmct

Is there a way to adjust/specify the session-timeout for OIDC? Currently it seems like the limit is 60 mins before refresh kicks in and requires reauth.

When coder.oidc.enableRefresh is set to true, refresh and expiration intervals are defined by the upstream provider. When the access token expires, we use the refresh token to ensure the user still has access. If your provider doesn't return refresh tokens, this could be the cause of the 60m timeout. It may be necessary to add additional redirect options for your provider to return refresh tokens. For example, Google requires the following:

coderd:
  oidc:
    enableRefresh: true
    redirectOptions:
      access_type: offline
      prompt: consent

Apr 29 '22 18:04 coadler

Would encourage you to join us on Slack so we can discuss these in more detail.

Requested to join!

Follow up question: How were you provisioning these additional workspaces? Directly from the image or are these workspaces created from the old template? This may be a bug, but I want to make sure I'm understanding correctly.

These new workspaces were provisioned from images only, not templates as there's currently no in-UI option to select from pre-defined templates within the dashboard (I believe). My theory would be that changes may not have persisted fully in the backend/database before the new workspace was created.

On a slightly different note, we are working on Coder v2 which allows an admin to define templates for workspaces using an entirely custom pod spec, including NodeSelectors. It uses Terraform to define templates. It's not ready for production use, but let me know if you're interested in giving feedback and shaping the roadmap. Here's how it would work for the developer

Would be more than happy to!

When coder.oidc.enableRefresh is set to true, refresh and expiration intervals are defined by the upstream provider. When the access token expires, we use the refresh token to ensure the user still has access. If your provider doesn't return refresh tokens, this could be the cause of the 60m timeout. It may be necessary to add additional redirect options for your provider to return refresh tokens.

Will update helm specs with this and follow up if the behavior still persists.

Another bug that we found as it relates to GPU nodes.

When the admin options for Enable Caching and Enable auto loading of 'shiftfs' kernel module are both enabled, GPU-based nodes simply won't allow the workspace to access the GPU itself.

The pod will have proper resource allocation, and properly scheduled etc. but whenever the user goes into the workspace and tries to access the GPU via nvidia-smi - there is no GPU present. (behavior present in 1.29-1.30) Disabling these options will resolve this issue (something I found out the hard way unfortunately)

May 03 '22 03:05 trisongz

enterprise-helm enterprise-helm copied to clipboard

Allow for NodeSelector as Admin Defined and User Option?

enterprise-helm
enterprise-helm copied to clipboard