tibanna icon indicating copy to clipboard operation
tibanna copied to clipboard

Hitting vCPU limit

Open Finesim97 opened this issue 5 years ago • 4 comments

Hi, While running Tibanna with Snakemake i keep hitting my vCPU limit. While Snakemake supports a job limit for Cluster executions (-j), it doesn't use that for Tibanna. Might this something that Tibanna should consider instead?

Finesim97 avatar Feb 16 '20 14:02 Finesim97

At what stage are you getting vCPU limit? Is it on the cloud or locally? Tibanna doesn't use local CPUs to run jobs, it uses them just to submit jobs to the cloud (i.e. 1 CPU would be enough locally). On the cloud, it should have enough CPU if the right instance type was launched.

SooLee avatar Feb 16 '20 15:02 SooLee

The first lambda (RunTaskAwsem) in the step function fails.

{
  "errorMessage": "failed to launch instance for job J2sUQPneqq2f: An error occurred 
(VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU
 capacity than your current vCPU limit of 64 allows for the instance bucket that the specified 
instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request 
an adjustment to this limit.",
  "errorType": "Exception",
  "stackTrace": [
    [
      "/var/task/service.py",
      20,
      "handler",
      "return run_task(event)"
    ],
    [
      "/var/task/tibanna/run_task.py",
      64,
      "run_task",
      "execution.launch()"
    ],
    [
      "/var/task/tibanna/ec2_utils.py",
      338,
      "launch",
      "self.instance_id = self.launch_and_get_instance_id()"
    ],
    [
      "/var/task/tibanna/ec2_utils.py",
      480,
      "launch_and_get_instance_id",
      "res = self.ec2_exception_coordinator(self.run_instances)(ec2)"
    ],
    [
      "/var/task/tibanna/ec2_utils.py",
      525,
      "inner",
      "raise Exception(\"failed to launch instance for job %s: %s\" % (self.jobid, str(e)))"
    ]
  ]
}

They almost instantly increased my limit after requesting it, but for a Snakemake workflow with a large number of tasks available at the same time, this will be a problem again.

Finesim97 avatar Feb 16 '20 15:02 Finesim97

Ah I see. It was AWS limit. I will try to add some kind of error handling (e. g. waiting) over the next few days. Thanks for reporting again.

SooLee avatar Feb 16 '20 17:02 SooLee

No problem and again thank you for your work.

Finesim97 avatar Feb 19 '20 08:02 Finesim97