woodpecker icon indicating copy to clipboard operation
woodpecker copied to clipboard

Native Kubernetes Support

Open anbraten opened this issue 3 years ago • 28 comments

try to re-base #23

#9

TODO

  • [x] add docs
  • [x] update helm charts

TEST it NOW

anbraten avatar Nov 27 '21 13:11 anbraten

is there a way to specify the storageClass and maybe even size, via the backend config?

now this is very pedantic, but it might make sense to abbreviate corev1 instead of v1

dmolik avatar Feb 03 '22 20:02 dmolik

I think proper volumes support is one of the biggest challenges we somehow have to solve in this PR. We should allow to specify volume details. Do you know if we can create some kind of kubernetes storage provided by most kubernetes installations which allows us to mount the same volume to multiple pods?

anbraten avatar Feb 03 '22 22:02 anbraten

@anbraten depends on, do you wanna ReadOnly mount or ReadWrite? There is a ReadOnlyMany (ROX) and ReadWriteMany (RWX) modes that can be set on a PersistentVolumeClaim: see https://stackoverflow.com/a/62545427

616b2f avatar Feb 04 '22 14:02 616b2f

I think most deployments only support RWO ( read write once ), I'm thinking AWS EBS volumes and Rook/Ceph, I don't think we should expect or require anything more complicated.

pipelines can be mostly serial, so it should be enough.

Perhaps with parallel stages jobs can be schlepped into a "mega" pod

dmolik avatar Feb 04 '22 14:02 dmolik

@dmolik it may be simpler to start with RWX first as they are some options out there:

  • AWS with Amazon EFS CSI driver see https://aws.amazon.com/de/premiumsupport/knowledge-center/eks-persistent-storage/
  • Google with Filestore CSI driver see https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/filestore-csi-driver
  • Azure with Azure Files see https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv
  • For on Prem: Rook supports RWX via CephFS see https://github.com/ceph/ceph-csi and https://github.com/rook/rook/issues/543#issuecomment-388060949
  • Or use NFS CSI driver see https://github.com/kubernetes-csi/csi-driver-nfs/blob/9811fe4c6fa00169b4f80832cd807c0203fa0059/deploy/example/pvc-nfs-csi-dynamic.yaml

How ever I am not sure how performant they are, but in my opinion it is simpler to start with them at the beginning then use some workarounds with RWO. The approach you described can still be explored after words if there are some issues with the RWX approach.

What do you think?

616b2f avatar Feb 06 '22 19:02 616b2f

You can, in general, use a network drive. https://kubernetes.io/docs/concepts/storage/volumes/#nfs

But these are sometime problematic (usually if they fail on the umount), and should be used with care.

LamaAni avatar Feb 07 '22 22:02 LamaAni

Personally I don't think that sticking to network drives are good idea. We're using kubernetes with local storage mostly, cause it is much more faster.

kvaster avatar Feb 11 '22 10:02 kvaster

So I think we have to make that configurable ...

6543 avatar Feb 11 '22 10:02 6543

Personally I don't think that sticking to network drives are good idea. We're using kubernetes with local storage mostly, cause it is much more faster.

@kvaster This does not work for data that has to be shared across pods (because they can be located on different nodes) and even if you host them all on one node, this approach does not scale well and is not really failure tolerant, if your node dies all your workload and data does with it. But maybe I misunderstood you so please correct me if I am wrong.

616b2f avatar Feb 11 '22 11:02 616b2f

Existing Kubernetes deployments are just that, dind agents that have parallel jobs all on the same node, and don't have network RWX volumes, either.

dmolik avatar Feb 11 '22 11:02 dmolik

Generally there can be two types of volumes. First one - create PVC for all pods created during build and this PVC can be RWX (and network) and second one - PVC per pod.

kvaster avatar Feb 12 '22 16:02 kvaster

@kvaster This does not work for data that has to be shared across pods (because they can be located on different nodes) and even if you host them all on one node, this approach does not scale well and is not really failure tolerant, if your node dies all your workload and data does with it. But maybe I misunderstood you so please correct me if I am wrong.

I think this is always question about cache, bandwidth, disk speed e.t.c. It is good to have flexibility. Sometimes it is much more better to use local disks while building with some kind of cache. Also this may be combined with network drive for sharing some part of the data while building artifacts in parallel.

kvaster avatar Feb 13 '22 11:02 kvaster

can i inquire about the progress on this?

xuecanlong avatar Mar 31 '22 10:03 xuecanlong

focus is on #784 - after this one got in we can move forward to native by reusing code and refactoring more ... the mentioned pull is not merged jet as int touches code outside of pipeline/backend/* - so this changes have to be proted and merged first seperatly

6543 avatar Mar 31 '22 10:03 6543

To be fair I am not 100% sure if we should take #784 over this one as I don't see the point in using the kubectl (not working for scratch images, rpm, deb packages etc out of the box). Suggestion would be to migrate all ideas / pieces from #784 into this one.

anbraten avatar Mar 31 '22 10:03 anbraten

To be fair I am not 100% sure if we should take #784 over this one as I don't see the point in using the kubectl (not working for scratch images, rpm, deb packages etc out of the box). Suggestion would be to migrate all ideas / pieces from #784 into this one.

I agree with you, if we want to support k8s, use k8s-sdk is good. kubectl can use local type to use it

xuecanlong avatar Mar 31 '22 10:03 xuecanlong

To be fair I am not 100% sure if we should take #784 over this one as I don't see the point in using the kubectl (not working for scratch images, rpm, deb packages etc out of the box). Suggestion would be to migrate all ideas / pieces from #784 into this one.

That would be even better - it just need a lot more work to be done ...

6543 avatar Mar 31 '22 10:03 6543

If you have a look at the code its not too different. We already have most of the general functions of the kubectl backend in this PR as well.

anbraten avatar Mar 31 '22 11:03 anbraten

If you have a look at the code its not to different. We already have most of the general functions of the kubectl backend in this PR as well.

i can't wait it be release

xuecanlong avatar Mar 31 '22 11:03 xuecanlong

Hi, the reason I used kubectl is simplicity in implementation (apply, delete). That said, by changing the functions on "client.go" you can use the kube go api. All the interactions with kubernets are only there. Sadly, I do not have the time to do that.

We have been using the kubectl backend in development for about a month now with small issues (not detecting event based crashes is the worst one, it only detects BackOff for now). Once I have a fuller list I'll update these errors.

@xuecanlong I can safely say that we are now in the beta mode of that, and creating an image that uses that is very simple.

LamaAni avatar Mar 31 '22 13:03 LamaAni

agree, useing kubectl is faster to ops cluster, but not convenience for complex operations

xuecanlong avatar Mar 31 '22 23:03 xuecanlong

Codecov Report

Merging #552 (bf3a919) into master (dbbd369) will decrease coverage by 0.01%. The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master     #552      +/-   ##
==========================================
- Coverage   49.61%   49.59%   -0.02%     
==========================================
  Files          86       86              
  Lines        6553     6555       +2     
==========================================
  Hits         3251     3251              
- Misses       3111     3113       +2     
  Partials      191      191              
Impacted Files Coverage Δ
cmd/agent/agent.go 0.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Aug 14 '22 18:08 codecov-commenter

Docs are missing

6543 avatar Aug 14 '22 19:08 6543

Deployment of preview was successful: https://woodpecker-ci-woodpecker-pr-552.surge.sh

woodpecker-bot avatar Aug 16 '22 06:08 woodpecker-bot

This PR should be mainly done now. Some pipeline features are still missing, but I would add them from time to time. I will add the currently open points to https://github.com/woodpecker-ci/woodpecker/issues/9#issuecomment-483979755.

anbraten avatar Aug 16 '22 06:08 anbraten

else for a starting point ... we can merge asap you call it "ready for review"

6543 avatar Aug 16 '22 07:08 6543

Not sure how to do this properly since this in and of itself is a PR, but I created a PR for this branch to implement a setting for using RWO access mode: https://github.com/anbraten/woodpecker/pull/3

Rynoxx avatar Aug 18 '22 21:08 Rynoxx

https://github.com/anbraten/woodpecker/pull/4 need a merge

6543 avatar Aug 20 '22 14:08 6543

@anbraten hope you are fine with how it's done as ContextKey for now ...

I might refactor how backends do get added and so also add a new interface we can use for backend specific config, but that can be done later

6543 avatar Sep 04 '22 22:09 6543

Been testing this for several weeks, also just built and deployed locally after the last push to this PR, things are still working, so, FWIW: Tested-by: Stijn Tintel [email protected]

stintel avatar Sep 04 '22 23:09 stintel