nextstrain.org icon indicating copy to clipboard operation
nextstrain.org copied to clipboard

~30s of request queuing when promoting canary to production

Open victorlin opened this issue 1 year ago • 3 comments

Recently, I've noticed that promoting canary to production prevents nextstrain.org from loading for a short but noticeable amount of time.

With the latest promotion of 24ba9ee742e14c30a09ea03d4db3f8d6acefdc40 (nextstrain-server v894 → v895), I paid extra attention to this. Here is a breakdown of the time it took to load https://nextstrain.org on a web browser in two scenarios. The requests took ~30 seconds and were initiated about 10 seconds after the promotion completed successfully, meaning the total downtime was around 40 seconds:

load times

Issue title says "local" downtime because I'm not sure if it's just my connection or if this can be observed by everyone.

victorlin avatar Dec 18 '23 22:12 victorlin

The automated build of 24ba9ee742e14c30a09ea03d4db3f8d6acefdc40 on canary showed this warning (GitHub, Heroku), which may be related:

Warning: Your slug size (313 MB) exceeds our soft limit (300 MB) which may affect boot time.

victorlin avatar Dec 18 '23 22:12 victorlin

I've noticed this and believe it's due to how Heroku's routing layer switches things over a bit early when cutting between the old dynos and new dynos. I wouldn't call it downtime, though. There's a short period of time when new requests will queue up waiting for the new dyno to be ready and take longer to get a response, but no requests should fail.

I haven't looked into minimizing that time; slug size might be implicated, or our code's own startup time. I also wonder if we could have Heroku's routing layer hold on directing requests to the new dyno until after an app-level health check passes (as opposed to the dyno-level health check it seems to use now).

tsibley avatar Jan 10 '24 19:01 tsibley

Noting that this happens on any restart of the app such as config changes.

victorlin avatar Sep 25 '24 19:09 victorlin