nextstrain.org
nextstrain.org copied to clipboard
~30s of request queuing when promoting canary to production
Recently, I've noticed that promoting canary to production prevents nextstrain.org from loading for a short but noticeable amount of time.
With the latest promotion of 24ba9ee742e14c30a09ea03d4db3f8d6acefdc40 (nextstrain-server v894 → v895), I paid extra attention to this. Here is a breakdown of the time it took to load https://nextstrain.org on a web browser in two scenarios. The requests took ~30 seconds and were initiated about 10 seconds after the promotion completed successfully, meaning the total downtime was around 40 seconds:
Issue title says "local" downtime because I'm not sure if it's just my connection or if this can be observed by everyone.
The automated build of 24ba9ee742e14c30a09ea03d4db3f8d6acefdc40 on canary showed this warning (GitHub, Heroku), which may be related:
Warning: Your slug size (313 MB) exceeds our soft limit (300 MB) which may affect boot time.
I've noticed this and believe it's due to how Heroku's routing layer switches things over a bit early when cutting between the old dynos and new dynos. I wouldn't call it downtime, though. There's a short period of time when new requests will queue up waiting for the new dyno to be ready and take longer to get a response, but no requests should fail.
I haven't looked into minimizing that time; slug size might be implicated, or our code's own startup time. I also wonder if we could have Heroku's routing layer hold on directing requests to the new dyno until after an app-level health check passes (as opposed to the dyno-level health check it seems to use now).
Noting that this happens on any restart of the app such as config changes.