image-builder CAPI - Define CI & Test approach

We need to be able to run tests on pull requests / periodically as well as provide an automated build & promotion process for images /assign @figo @codenrhoden

Aug 14 '19 16:08 moshloop

So I've been starting to think about this, and am interested in figuring out next steps.

From the CAPI perspective, we are currently generating images for three providers -- AWS, vSphere, and GCE. Those providers use images built from this repo to deploy their clusters.

So how do we go about doing E2E tests? The only way to really prove out tests is by using the image with its associated provider. In other words, if we build a CAPA image, we need to use CAPA to deploy it to test it -- which will really be testing two things: the CAPA image, and whatever build of CAPA we used to deploy it. They are tightly coupled.

So that makes me wonder what testing we can/should do in this repo as part of CI, versus E2E tests found in the various providers' repos.

I guess the first steps no matter what is just an automated pipeline for building and publishing the images themselves. After that we can define what/how to test.

So I think that brings up the following questions:

Where do we publish the images? (e.g. GCS buckets?)
How ofter do we build/publish?
- I feel like every PR could be a bit much -- these are large images with full OS installations. Would doing this on a periodic basis be enough?
What support matrix do we target?
- What K8s versions do we need to build for? The latest patch release for n-2?
How do want to handle securing credentials?
- We'll need AWS keys for AMI building. VMC (most likely) keys for OVA building, etc.

I'm just thinking out loud here to start gathering information.

/cc @figo @timothysc @akutz @detiber ^ feel free to ping anyone else that might be interested.

Sep 10 '19 19:09 codenrhoden

I really like this approach and I think it is a good step in putting a plan in place.

In terms of testing images, there is some level of testing that should be done in this repo outside of E2E testing in CAPI/KOPS/Kubespray etc :

Basic sanity tests to ensure kubeadm init produces a functional control plane
Reproducibility and bill-of-material variance. Byte level repeatable builds are out of scope, but we should have some level of confidence that repeated executions of the same build produce the same result - and critically even with the passage of time - we shouldn't find out that the build for K8 N - 2 on OS N - 2 is broken when a critical CVE needs to be released.
Security testing, do the images have any known vulnerabilities? do they score below a threshold on CIS or similar - @randomvariable I know you were doing some work in this area?

Where do we publish the images? (e.g. GCS buckets?)

Github releases are a good option as well, they have a limit of 2GB per file and are otherwise unlimited storage and bandwidth. The are also easily discoverable - see here

How ofter do we build/publish?

Every tag and then periodically might be a good middle ground

Sep 10 '19 20:09 moshloop

Thanks for the feedback, @moshloop!

In terms of testing images, there is some level of testing that should be done in this repo outside of E2E testing in CAPI/KOPS/Kubespray etc :

Basic sanity tests to ensure kubeadm init produces a functional control plane

Really good point. I've been thinking so centric to CAPI lately that I failed to realize that the images built from here should work with a simple kubeadm init. It's obvious in hindsight, but I still didn't catch that. That would be a great way to test...

Reproducibility and bill-of-material variance. Byte level repeatable builds are out of scope, but we should have some level of confidence that repeated executions of the same build produce the same result - and critically even with the passage of time - we shouldn't find out that the build for K8 N - 2 on OS N - 2 is broken when a critical CVE needs to be released.

Agreed. I think this comes back to what type of BOM we want to produce, and whether we need to store it for builds over time for future comparison.

Sep 11 '19 14:09 codenrhoden

One test that I've done in the past for images is to validate Node Conformance on an image: https://github.com/kubernetes-sigs/image-builder/tree/master/images/capi/packer/ami#run-the-e2e-node-conformance-tests

We should be able to assume that the general binaries are functional based on the testing that is done for producing them, so I'm not sure we get much value out of kubeadm init, but Node Conformance should tell us that all the required pre-requisites for a functional kubelet are present and that things are in a functional state.

Sep 11 '19 19:09 detiber

Where do we publish the images? (e.g. GCS buckets?)

Github releases are a good option as well, they have a limit of 2GB per file and are otherwise unlimited storage and bandwidth. The are also easily discoverable - see here

What type of images are we talking about here? For public cloud-based images I would expect consumption in a native way rather than requiring import for use.

Sep 11 '19 19:09 detiber

How do want to handle securing credentials? We'll need AWS keys for AMI building. VMC (most likely) keys for OVA building, etc.

If we can build using prow, then we can leverage the secrets management there as well as a checkout system such as boskos. That may get a bit trickier for vSphere, OpenStack, etc, though

Sep 11 '19 19:09 detiber

Security testing, do the images have any known vulnerabilities? do they score below a threshold on CIS or similar - @randomvariable I know you were doing some work in this area?

Are we actually planning on having a real support statement around these images now? Previously we said this would only be a tool for users to be able to build images. For example, the CAPA images are meant for testing and demo purposes only, production users are encouraged to build and manage their own images.

Sep 11 '19 19:09 detiber

validate Node Conformance on an image

Does this not depend on having a working go installation inside the image?

For public cloud-based images I would expect consumption in a native way rather than requiring import for use.

There is an argument to be made that the images should all be produced offline and then imported as AMI's / GCP / OVA / QCOW - It makes producing reproducible builds far simpler and ensuring consistency between images across cloud providers.

production users are encouraged to build and manage their own images.

I think the purpose of this project is to ultimately provide production-ready images across all cloud platforms, that can be customized easily without re-inventing the wheel.

Sep 11 '19 19:09 moshloop

validate Node Conformance on an image

Does this not depend on having a working go installation inside the image?

No, the tests are distributed as binaries. The link I used above basically downloads a tarball, extracts it, and then executes the node conformance tests.

There is an argument to be made that the images should all be produced offline and then imported as AMI's / GCP / OVA / QCOW - It makes producing reproducible builds far simpler and ensuring consistency between images across cloud providers.

Sure, but that also means that we may need to do things such as modify kernels based on the targeted cloud provider, install per-provider required drivers, per-provider required agents, and also own other per-cloud tweaks needed that come for free with starting from an existing image optimized for the target cloud.

production users are encouraged to build and manage their own images.

I think the purpose of this project is to ultimately provide production-ready images across all cloud platforms, that can be customized easily without re-inventing the wheel.

This has been something that the greater Kubernetes project has resisted in the past. I'm not saying that it isn't something we should strive for, but it's something that we'd need to get more than just SIG Cluster Lifecycle on board for.

Sep 11 '19 19:09 detiber

Sure, but that also means that we may need to do things such as modify kernels based on the targeted cloud provider, install per-provider required drivers, per-provider required agents, and also own other per-cloud tweaks needed that come for free with starting from an existing image optimized for the target cloud.

On the extremely specific front of drivers, I don't expect there to be any issues. With the exception of proprietary graphics drivers, everything's upstreamed in modern LTS kernels (e.g. Amazon ENA, Xen pvscsi fixes for EBS, Hyper-V paravirtualization etc...). Doesn't mean there isn't an issue around agents and tweaks though.

Are we actually planning on having a real support statement around these images now? Previously we said this would only be a tool for users to be able to build images.

Nope. On CIS benchmarks, ones for OS compliance and Kubernetes compliance can and are treated independently by CIS themselves. Did do some experiments to see if the most extreme aspects of a CIS OS layout could be done, and the answer was yes, but that was more about not excluding the possibility than actually doing it.

Sep 12 '19 08:09 randomvariable

Sure, but that also means that we may need to do things such as modify kernels based on the targeted cloud provider, install per-provider required drivers, per-provider required agents, and also own other per-cloud tweaks needed that come for free with starting from an existing image optimized for the target cloud.

On the extremely specific front of drivers, I don't expect there to be any issues. With the exception of proprietary graphics drivers, everything's upstreamed in modern LTS kernels (e.g. Amazon ENA, Xen pvscsi fixes for EBS, Hyper-V paravirtualization etc...). Doesn't mean there isn't an issue around agents and tweaks though.

This seems to counter the Amazon docs for enabling ENA (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking-ena.html#enable-enhanced-networking-ena-ubuntu), which state that installing the linux-aws package is required (at least on Ubuntu).

To support GPU instances, we'd need to install the NVIDIA driver as well: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html

All of this also doesn't take into account that we are setting the proper kernel parameters for the built images, ensuring that we are running with kernels that have all the appropriate flags set during build time. Each of which we could generally get for free from the OS vendors themselves.

I'm not necessarily saying that we shouldn't build the images from scratch, we just need to realize that it also comes with the ownership and maintenance of more than just a single OS installation and the bits we are layering on top of it.

Sep 12 '19 13:09 detiber

we are setting the proper kernel parameters for the built images, ensuring that we are running with kernels that have all the appropriate flags set during build time.

I don't think compiling kernels is something anyone wants todo, The OpenStack cloud images released by the distributions include the compiled kernel and bootloaders, the only thing left is to install any drivers and maybe configure cloud-init.

Sep 12 '19 14:09 moshloop

we are setting the proper kernel parameters for the built images, ensuring that we are running with kernels that have all the appropriate flags set during build time.

I don't think compiling kernels is something anyone wants todo, The OpenStack cloud images released by the distributions include the compiled kernel and bootloaders, the only thing left is to install any drivers and maybe configure cloud-init.

Right, that is exactly my point. The OpenStack cloud images released by distributions handle this, but most distributions do not offer a similar image for other cloud providers outside of the ones published to those cloud providers.

Sep 12 '19 15:09 detiber

which state that installing the linux-aws package is required

Hadn't looked at Ubuntu, but ENA is definitely in the mainline kernel, so I think Canonical are carving up the modules granularly. Fedora for example, has ENA included in kernel-modules.

We likely can't include the nvidia driver in an actual image because of licensing anyway.

All of that is besides the point that we don't want to do anything other than use the recommended distro kernel set for the cloud provider, however that is distributed.

Sep 12 '19 17:09 randomvariable

/cc @dims

Sep 30 '19 19:09 timothysc

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Dec 29 '19 19:12 fejta-bot

/lifecycle frozen

Dec 30 '19 15:12 detiber

This effort is finally underway, and I have a Google doc (edit perms given to SCL group) here: https://docs.google.com/document/d/10okAxlg_Z2sAQ_ocO2_dI-kYSO35oI_KtWpBEzwp5NQ/edit?usp=sharing

Basic idea is to start with "build only", not worrying about publishing images. For PRs that modify relevant code, make sure that we can still build the images. We would start with AWS, Azure, and vSphere.

Jul 30 '20 14:07 codenrhoden

@codenrhoden we should probably add GCE to the list as well, since there is a release-informing job that uses image-builder for cluster-api-provider-gcp

Jul 30 '20 14:07 detiber

image-builder image-builder copied to clipboard

CAPI - Define CI & Test approach

image-builder
image-builder copied to clipboard