worker icon indicating copy to clipboard operation
worker copied to clipboard

randomize initial zone selection

Open joshk opened this issue 5 years ago • 4 comments

What is the problem that this PR is trying to fix?

Workers, when deployed, are bound to a primary zone. Whenever a job (and VM) is started, the primary zone is used first, then an alternate zone is selected.

The core problem is when we start hitting zone exhaustion errors for a zone. Since there are pools of workers per zone, and each worker will try its primary zone first, which just puts more pressure on the api, raising the risk of api rate limit issues.

What approach did you choose and why?

This is a first step towards a fix, but not yet the full fix.

This changes the concept of a primary zone, and instead has a random zone picked each try. A zone can still be defined in the config, but this zone will be used each and every time.

The follow up work to this PR is to block a zone for 10mins across all workers (possibly using Redis), which will help reduce api usage, and a better self healing worker setup.

How can you test this?

I've tested this locally by connecting it to staging, and it worked a treat.

What feedback would you like, if any?

This is for general discussion

I also need help understanding how to fix the test failures

joshk avatar Apr 01 '19 09:04 joshk

Note: this comment has been redacted; see followup below

The core problem is when we start hitting zone exhaustion errors for a zone.

~~Which zone exhaustion errors are you referring to? IIUC, most resource quotas (as well as API quotas) are per project and/or per region rather than per zone. (See GCE Quotas page)~~

Example from recent outage:

Error
QUOTA_EXCEEDED: Quota 'SSD_TOTAL_GB' exceeded. Limit: 800000.0 in region us-central1. 

^ note the exhaustion is in the region us-central1, not a zone like us-central1-c.

(Note that we have occasionally hit ZONE_RESOURCE_POOL_EXHAUSTED errors in the past, but that has to do with the global resource usage for that zone rather than our projects specifically. ~~And it was not the case in the recent outage.)~~

Since there are pools of workers per zone, and each worker will try its primary zone first, which just puts more pressure on the api, raising the risk of api rate limit issues.

I think this is only the case when a zone is specified in the worker config. In our case it looks like instances are pretty well distributed across zones already, no? image

I wonder if it would make more sense to have each worker create job instances in its own zone, if possible, since those are already automatically distributed by the managed instance group.

Also, not all resource types (in particular GPUs) are available in all zones. So I'm not sure it makes sense to make zone pinning an all-or-nothing thing, because I believe that will cause problems when GPUs (and perhaps some specific CPUs) are specified.

~~Disclaimer: I still don't understand what actually triggered the recent outage, beyond the issue of not being able to delete instances because we had exceeded the API rate limits, thus causing the resource exhaustion. We still don't know what exactly caused us to exceed the API limits in the first place, do we?~~

soulshake avatar Apr 01 '19 10:04 soulshake

Edit: I take it all back, I see there were plenty of these errors during the last outage:

Mar 29 23:32:53 production-2-worker-org-gce-4mr9 travis-worker-wrapper: 
time="2019-03-30T04:32:53Z" level=error msg="couldn't start instance, attempting 
requeue" err="code=ZONE_RESOURCE_POOL_EXHAUSTED location= 
message=The zone 'projects/travis-ci-prod-2/zones/us-central1-c' does not have 
enough resources available to fulfill the request.  Try a different zone, or try again later." 
job_id=123456798 job_path=xyz/xyz/jobs/123456798 pid=1 
processor=ed8d48ea-5209-4f2b-b595-bfda4c06ce13@1.production-2-worker-org-gce-4mr9 
repository=xyz/xyz self=step_start_instance start_timeout=8m0s uuid=84c6f99e-f021-4b13-baf1-ba101c22e3ab

soulshake avatar Apr 01 '19 10:04 soulshake

Thanks for the feedback AJ.

I'm just about to hit the hay, but I thought I would also add that our Google reps recommended temporarily not using a zone when we hit these errors.

I'll write up a more detailed reply when I wake up.

On Mon, Apr 1, 2019 at 11:50 PM, AJ Bowen [email protected] wrote:

Edit: I take it all back, I see there were plenty of these errors during the last outage:

Mar 29 23:32:53 production-2-worker-org-gce-4mr9 travis-worker-wrapper: time="2019-03-30T04:32:53Z" level=error msg="couldn't start instance, attempting requeue" err="code=ZONE_RESOURCE_POOL_EXHAUSTED location= message=The zone 'projects/travis-ci-prod-2/zones/us-central1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later." job_id=123456798 job_path=xyz/xyz/jobs/123456798 pid=1 processor=ed8d48ea-5209-4f2b-b595-bfda4c06ce13@1.production-2-worker-org-gce-4mr9 repository=xyz/xyz self=step_start_instance start_timeout=8m0s uuid=84c6f99e-f021-4b13-baf1-ba101c22e3ab

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/travis-ci/worker/pull/583#issuecomment-478530902, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAh_QtAMyI8vI5Goe_0HV_clixYcSVkks5vceRtgaJpZM4cVTDf .

joshk avatar Apr 01 '19 10:04 joshk

Thanks for the feedback AJ. I'm just about to hit the hay, but I thought I would also add that our Google reps recommended temporarily not using a zone when we hit these errors. I'll write up a more detailed reply when I wake up.

This definitely makes more sense now that I know we had hit ZONE_RESOURCE_POOL_EXHAUSTED errors during the last incident. :+1:

soulshake avatar Apr 01 '19 10:04 soulshake