pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

Controlling max parallel jobs per pipeline

Open ysaakpr opened this issue 5 years ago • 38 comments

What is the way to control concurrency? The pipeline has 100 independent steps. But I don't want them to run all 100 together. For different pipeline run, I wish to adjust the concurrency as well.

ysaakpr avatar May 09 '20 18:05 ysaakpr

There isn't a configuration for this today, but it should be possible if there's demand and the use cases make sense.

In the meantime you can run pipelines in a namespace with a resource limit such that no more than X CPUs are available to tasks, and those over the cap will queue until others finish. If you're just trying to limit the resource footprint this is likely the best way to express the limitation.

Can you give more details about why you want to limit concurrency of tasks?

https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-default-namespace/

imjasonh avatar May 09 '20 23:05 imjasonh

Using Kubernetes resource-wise limits is one of the ways. But which pretty hard to achieve due to the kind of resource limits that I have to dynamically do, based on different aspects. In my pipeline, we are loading 100 of DB dump to a new database, for creating a new environment from a seed database instance. All the jobs can run parallel, But if we run that the DB instance. from which we taking data will be chocked out and cause connection limits error.

Controlling concurrency is well needed for a CI-CD system, where though all can be run in parallel, the user should get an option to control due resource limitation/availability.

Note: Currently I have achieved it using our consul server, and using consul-cli to wrap the run script on a shared key and concurrency. eg Snippet below.

consul lock -n ${concurrency} -child-exit-code ${jobkey} "bash $script $@"

There are few problems in this case:

  1. Pods are already started running, so total run time to completion will be wait time + runtime of the script
  2. Not easy to configure using pipeline run args, need to modify my script to achieve concurrency control

But the same can be directly added as a feature to tekton to use a semaphore and lock max concurrent jobs, on pipelinerun and create pods only after the locks are aquiired.

ysaakpr avatar May 10 '20 07:05 ysaakpr

@ImJasonH I could contribute to this if someone can give me some hints on the code structure as well as the standards to follow, plus any other technical issues if exists, which will conflict against this behaviour

ysaakpr avatar May 10 '20 14:05 ysaakpr

Thanks, I think this seems like a reasonable addition, and would be happy to help you with it.

First, what kind of API addition are you envisioning? What's the minimum addition we can add that we could extend later if we need to? Is there any precedent in other workflow tool we could borrow/steal from?

Depending on the size of the change, we'd probably ask for a design doc answering those questions, and describing the use case (which you've done above, thanks for that!)

imjasonh avatar May 10 '20 14:05 imjasonh

At first glance, I think a property in pipelinerun or|and pipeline resource named concurrency concurrency if defined should be a value gte 0. Zero, can make the pipeline to pause between run, any positive value will set the max parallel that can be triggered. If no value set, all possible will run in parallel.

Initial version for this feature can be just concurrency field in the pipeline run. I am not sure exactly similar feature that we can borrow, But GitLab has runner concurrency that a user can set per runner.

ysaakpr avatar May 10 '20 17:05 ysaakpr

One way of handling this is by using a Pod quota for the Namespace.

jlpettersson avatar May 10 '20 19:05 jlpettersson

One way of handling this is by using a Pod quota for the Namespace.

Of course, you have an option from Kubernetes resource limits as you mentioned, but not always practical. For example, In my namespace, I am not just using the only tekton. And configuration and usability point of view, using pod quota is much more complex than using a value of max parallel in pipelinerun

ysaakpr avatar May 11 '20 03:05 ysaakpr

@ysaakpr Configuring pipeline-wide concurrency limits definitely seems easiest, but I wonder if that's expressive/powerful enough to satisfy more complex use cases. We should explore other options, even if only to be able to dismiss them with well thought out reasons.

Consider a pipeline that does two highly parallelizeable things (e.g., initializing a database, then later deploying code to multiple AZs), but each of those parallelizeable things have different concurrency caps -- it might make sense to kick off 1000 init-db tasks at once, max 100 at a time, then later in that same pipeline kick off 10 deploy tasks, max 3 at a time. Configuring concurrency: 100 at the pipeline level wouldn't help limit the second group of tasks. A user could manually configure their pipeline to perform 3 deploy tasks in parallel, then the next 3, etc., but that's exactly the kind of manual configuration we're trying to avoid -- they also could have manually configured the pipeline to do 100 init-db tasks in parallel, then the next 100, etc., today, but that's toilsome.

(To be clear, this example isn't reason enough by itself to discount the pipeline-wide concurrency config, but it's worth considering and at least explicitly acknowledging this shortcoming.)

One way to express the different concurrency levels would be to group tasks together, then express concurrency limits per-group. Is that worth the additional config required to express/understand/visualize this grouping? I'm not sure. Would it be possible to support group-wise limits and pipeline-wide limits side-by-side? I truly have nothing to offer but open-ended questions. :)

imjasonh avatar May 11 '20 15:05 imjasonh

@ImJasonH that's a good thought. There are already two other tickets for Task grouping in a pipeline, which actually discussing the pipeline task grouping https://github.com/tektoncd/pipeline/issues/2592 and https://github.com/tektoncd/pipeline/issues/2586#issuecomment-625941044.

As you mentioned, the idea of concurrency should not be limited to just at the pipeline level. I agree/accept that for a more complex pipeline, configuring this at group of task-level would be an always amazing feature.

Pipeline level concurrency will be the max possible/default concurrency. And task group level concurrency can be used to fine-tune it further.

ysaakpr avatar May 11 '20 16:05 ysaakpr

/kind feature /area api

dibyom avatar May 11 '20 18:05 dibyom

/priority important-longterm

vdemeester avatar May 18 '20 15:05 vdemeester

How could I contribute on this, Are there any discussion forum? Where I can also be part of the design/implementation discussions.

ysaakpr avatar May 20 '20 06:05 ysaakpr

+1 for this feature. I am also looking for something similar and open to contributing to any discussions/design/code.

takirala avatar Jun 09 '20 21:06 takirala

See also https://github.com/tektoncd/pipeline/issues/2828.

holly-cummins avatar Jun 22 '20 20:06 holly-cummins

I think it would not be so difficult to add logic for this.

e.g. right before we create a TaskRun - we could check if we have less than X uncompleted TaskRuns or else omit creating a new.

later when a TaskRun is completed, the PipelineRun will do the reconciliation again, and the creation of a TaskRun will be re-evaluated.

jlpettersson avatar Jun 22 '20 20:06 jlpettersson

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot avatar Aug 14 '20 23:08 tekton-robot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

tekton-robot avatar Aug 14 '20 23:08 tekton-robot

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Aug 14 '20 23:08 tekton-robot

/remove-lifecycle rotten /remove-lifecycle stale /reopen

vdemeester avatar Aug 17 '20 09:08 vdemeester

@vdemeester: Reopened this issue.

In response to this:

/remove-lifecycle rotten /remove-lifecycle stale /reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Aug 17 '20 09:08 tekton-robot

I want to add the following use case that might be included above, but I want to make it clear, that that's an important part for our CI/CD.

Say, we have a task that is used in many pipelines. That task has a concurrency limit. It would be great to honor that limit across concurrent pipelineruns.

ibotty avatar Oct 26 '20 11:10 ibotty

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Jan 24 '21 12:01 tekton-robot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot avatar Feb 23 '21 13:02 tekton-robot

/remove-lifecycle rotten

afrittoli avatar Mar 09 '21 17:03 afrittoli

Controlling max taskruns: https://github.com/tektoncd/pipeline/issues/3796

afrittoli avatar Apr 06 '21 16:04 afrittoli

Related issue in experimental: https://github.com/tektoncd/experimental/issues/699

afrittoli avatar Apr 06 '21 16:04 afrittoli

Related approval task issue: https://github.com/tektoncd/experimental/issues/728

afrittoli avatar Apr 06 '21 16:04 afrittoli

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Jul 05 '21 17:07 tekton-robot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot avatar Oct 15 '21 00:10 tekton-robot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot avatar Nov 14 '21 01:11 tekton-robot