pulumi-eks icon indicating copy to clipboard operation
pulumi-eks copied to clipboard

Support for Windows AMI family

Open jodafm opened this issue 5 years ago • 5 comments

Hi, Do you plan to support Node Groups using a "windows server" AMI family?

jodafm avatar Aug 07 '20 20:08 jodafm

It seems that there's nothing specifically preventing the Cluster type from having windows nodes - it's just that the NodeGroup type has linux stuff hard coded, most notably the user data for the launch config that is used to join the cluster (it's a hard coded bash script). It would be really useful if it were possible to override the launch config's user data and I think it would then be possible (not saying trivial though) to add a windows node group.

gunniwho avatar Aug 18 '20 09:08 gunniwho

To support Windows node groups we'll have to refactor how the cluster and nodegroup classes function.

Specifically, we'll have to cue off a cluster toggle that windows is being used to prep the cluster and nodegroups accordingly, as an EKS cluster with Windows support must have both Linux and Windows node groups to function.

Per AWS:

  • The cluster class will need to install the following at provisioning time, along with having openssl and jq available on the client machine to run the scripts:
    • [ ] AWS VPC resource controller (manifest image varies by region)
    • [ ] AWS VPC admission controller webhooks (manifest image varies by region)
      • [ ] Use cert signing shell script to create and submit a CSR to the cluster, approve it through kubectl, and create a Secret from its contents
      • [ ] Use cert bundle patching shell script to patch webhook manifest with the cluster's CA bundle.
      • [ ] Add a new cluster role binding for kube-proxy
      • [ ] Deploy the patched admission webhook
    • [ ] Requires a windows & linux hybrid cluster. Specifically, 1+ more Linux nodes need to be available to run system Pods like coredns and the VPC resource controller.
  • The nodegroup class will need to adjust and account for:
    • [ ] Distinct default root disk volume sizes
    • [ ] Distinct default latest AMI's
    • [ ] UserData based on Powershell to configure the windows nodes
  • Related, but not required:
    • Not all instance types work for Linux and Windows, though we do not enforce these today for Linux.
    • For more details, see the Linux and Windows CloudFormation Stacks for comparison

metral avatar Aug 19 '20 23:08 metral

This looks accurate. I have been able to do all this using pulumi. The only thing that I had to do manually was to create the CSR and approve it. I then supplied the resulting key pair as secret config to my pulumi program. The following is what I did using pulumi:

  • created the required secret for the VPC admission webhook using the key pair config
  • created the VPC resource controller and admission webhook by downloading the correct yaml files for my region from AWS, patching them and provisioning them using ConfigFile instances
  • created the required cluster role binding
  • created roles and instance profiles for both my linux and windows node groups
  • created the Cluster instance
    • skipDefaultNodeGroup: true
    • roleMappings: [ { roleArn: linuxInstanceRole.apply((r) => r.arn), username: "system:node:{{EC2PrivateDNSName}}", groups: ["system:bootstrappers", "system:nodes"], }, { roleArn: windowsInstanceRole.apply((r) => r.arn), username: "system:node:{{EC2PrivateDNSName}}", groups: ["system:bootstrappers", "system:nodes", "eks:kube-proxy-windows"], }, ]
  • created a node security group using createNodeGroupSecurityGroup from the eks package
  • created a new WindowsNodeGroup class that's a copy of the pulumi supplied NodeGroup class except
    • it has the required powershell user data
    • it requires the AMI to be supplied via args (no lookup)
  • created one NodeGroup instance for the linux nodes and one WindowsNodeGroup instance for the windows nodes. They are injected each with their own instance profile but the same security group

This is definitely not ideal but it was required because there was no way for me to override the hard coded bash script for user data. Had I been able to inject the user data, I would have been able to make this work with pulumi out of the box. It is not a particularly nice experience, but I think it would be relatively easy to supply a nice API on top of the existing APIs to compose the required things for a mixed os cluster (e.g. a MixedOsCluster class). The only "not-so-nice" requirement left would be having to create the CSR out of band and supply the key pair as config but I don't think that pulumi can approve CSRs (correct me if I'm wrong please) but it also takes the requirement off the client machine having to have openssl.

gunniwho avatar Aug 20 '20 09:08 gunniwho

Thanks for the validation and detailed walkthrough @gunniwho, this insight is very helpful!

We've opened up #428 to track overriding the nodegroup userdata script as start to supporting Windows nodegroups.

metral avatar Aug 20 '20 20:08 metral

This looks accurate. I have been able to do all this using pulumi. The only thing that I had to do manually was to create the CSR and approve it. I then supplied the resulting key pair as secret config to my pulumi program. The following is what I did using pulumi:

  • created the required secret for the VPC admission webhook using the key pair config

  • created the VPC resource controller and admission webhook by downloading the correct yaml files for my region from AWS, patching them and provisioning them using ConfigFile instances

  • created the required cluster role binding

  • created roles and instance profiles for both my linux and windows node groups

  • created the Cluster instance

    • skipDefaultNodeGroup: true
    • roleMappings: [ { roleArn: linuxInstanceRole.apply((r) => r.arn), username: "system:node:{{EC2PrivateDNSName}}", groups: ["system:bootstrappers", "system:nodes"], }, { roleArn: windowsInstanceRole.apply((r) => r.arn), username: "system:node:{{EC2PrivateDNSName}}", groups: ["system:bootstrappers", "system:nodes", "eks:kube-proxy-windows"], }, ]
  • created a node security group using createNodeGroupSecurityGroup from the eks package

  • created a new WindowsNodeGroup class that's a copy of the pulumi supplied NodeGroup class except

    • it has the required powershell user data
    • it requires the AMI to be supplied via args (no lookup)
  • created one NodeGroup instance for the linux nodes and one WindowsNodeGroup instance for the windows nodes. They are injected each with their own instance profile but the same security group

This is definitely not ideal but it was required because there was no way for me to override the hard coded bash script for user data. Had I been able to inject the user data, I would have been able to make this work with pulumi out of the box. It is not a particularly nice experience, but I think it would be relatively easy to supply a nice API on top of the existing APIs to compose the required things for a mixed os cluster (e.g. a MixedOsCluster class). The only "not-so-nice" requirement left would be having to create the CSR out of band and supply the key pair as config but I don't think that pulumi can approve CSRs (correct me if I'm wrong please) but it also takes the requirement off the client machine having to have openssl.

@gunniwho Thanks for detailing these steps. I kinda did the same but a little differently . I created a linux cluster via pulumi installed /prerequisites as per aws documentation ( certificates , secrets etc..) In my pulumi in the next apply I added a new windows nodegroup but with different userdata with powershell script .

But my windows nodes are not registering to control plane. My code can be found here https://github.com/bit-cloner/poke

would it be possible to share your pulumi code ?

Thank you

bit-cloner avatar Mar 11 '21 10:03 bit-cloner