kubernetes icon indicating copy to clipboard operation
kubernetes copied to clipboard

[WIP] Booting/testing something in CI

Open ijc opened this issue 7 years ago • 12 comments

Very incomplete right now, will work on it as time allows.

ijc avatar Dec 05 '17 10:12 ijc

This is hitting issues with the linuxkit build getting randomly SIGKILL'd, I think it is getting squashed by the OOM killer and raised https://github.com/moby/tool/pull/191 to reduce the memory overheads. I'm also experimenting with the resource_class to see if that helps in the meantime (as a short term band-aid, although it looks like it will become a paid only feature before too long).

ijc avatar Dec 07 '17 11:12 ijc

Early indications are that medium+ (3 CPU 6GB RAM) is insufficient while large (4 CPU 8GB RAM) is enough.

In https://github.com/moby/tool/pull/191 I observed the initial RAM usage before my changes was 6.7GB which seems consistent with getting SITH on a 6GB limit. I also noted that tar used more like 2GB, which explains why they mostly work since the default medium has 4GB RAM, although they do still fail occasionally so perhaps there can be spikes or differences around other factors like content trust.

ijc avatar Dec 07 '17 11:12 ijc

Please sign your commits following these rules: https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work The easiest way to do this is to amend the last commit:

$ git clone -b "ci-boot-something" [email protected]:ijc/linuxkit-kubernetes.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354248512
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

GordonTheTurtle avatar Dec 08 '17 16:12 GordonTheTurtle

Booting using qemu in CI (so no KVM) hits a hardcoded 30min timeout in kubeadm init. In any case 30min is far far far too long (tha'ts on top of a minute or so to boot and another for sshd to actually start).

ijc avatar Dec 13 '17 14:12 ijc

After #35 gets merged, we will have a way of running e2e tests. Making all tests pass is a separate problem, but it'd be nice to incorporate that into nightly/weekly job (currently I'm seeing that it takes about 30min on my laptop, but we skip a fair chunk of tests).

errordeveloper avatar Jan 05 '18 15:01 errordeveloper

In any case 30min is far far far too long (tha'ts on top of a minute or so to boot and another for sshd to actually start).

I'd expect CircleCI to also timeout in the time frame similar to 30min. Besides that, the timing you've quoted would be unbearable if we were to run even a subset of the e2e suite.

Large VM with nested virtualisation would mean that we don't need to upload anything and seems generally easier to start with, but if upload speed and image storage costs are negligible, then linuxkit push+linuxkit run is probably sufficient. Although the need for clustering would require extra configuration (e.g. VPC etc), so one beefy box seems easier again and potentially a little easier to port (as KVM seems like the lowest common denominator).

errordeveloper avatar Jan 08 '18 08:01 errordeveloper

...so one beefy box seems easier again and potentially a little easier to port (as KVM seems like the lowest common denominator).

I am mostly thinking of someone building up LinuxKit CI for their own projects, not so much about us moving CI from one place to another.

errordeveloper avatar Jan 08 '18 08:01 errordeveloper

We have no MacOS workers in circle (seems it is a paid only feature) so the jobs just hang forever waiting to be assigned.

It's been suggested we could try using the same gcp account as the linuxkit/linuxkit CI and use linuxkit run gcp. Need to take care not to leak VMs though and to thoroughly cleanup.

ijc avatar Jan 15 '18 15:01 ijc

It's been suggested we could try using the same gcp account as the linuxkit/linuxkit CI and use linuxkit run gcp. Need to take care not to leak VMs though and to thoroughly cleanup.

This makes sense. The only concern that I have is how long would it take to upload a build from Circle to GCP... We would also have to cleanup image, as I'm pretty sure they charge for storing them. The alternative would be to build and run on a VM that supports nested virt. IIRC there is a new instance type GCP that supports nested virt.

errordeveloper avatar Jan 17 '18 16:01 errordeveloper

@rn is experimenting with the nested virt stuff on the main linuxkit repo now, will wait and see how he gets on.

It might indeed be too much data to upload for each PR, may turn out better to deploy a VM to build the images whether we then test them nested or in a new VM altogether.

ijc avatar Jan 17 '18 16:01 ijc

Also, if use nested virt, it means someone can reproduce the tests locally very easily, we could even provide a generalised script for folks to re-use in private LinuxKit projects. Just reiterating what I've said earlier...

errordeveloper avatar Jan 17 '18 16:01 errordeveloper

https://github.com/linuxkit/linuxkit/pull/2871 has the right runes for enabling nested virt. The image needs to have a "special license" (some string in the Licenses field) and the instance needs to be Haswell or newer.

rn avatar Jan 17 '18 18:01 rn