dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug]: Services stop responding after `dstack-gateway` reboot

Open jvstme opened this issue 1 year ago • 4 comments

Steps to reproduce

  1. Create a gateway
  2. Run one or several services behind the gateway
  3. Reboot the instance the gateway is running on, e.g. via its cloud console

Actual behaviour

Previously created services no longer respond.

curl https://gateway.mygateway.example/chat/completions -H 'Authorization: Bearer *****' -H 'Content-Type: application/json' -d '{"model":"llama3.1", "messages": [{"role":"user", "content":"Hi"}]}'
{"error":"GatewayError","message":"<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.18.0 (Ubuntu)</center>\r\n</body>\r\n</html>\r\n"}

Expected behaviour

Previously created services should continue responding, since rebooting the gateway's instance is a rare but possible circumstance.

dstack version

master

Server logs

No response

Additional information

dstack-gateway creates SSH tunnels to services and stores control sockets in the /tmp directory that does not survive machine reboots.

jvstme avatar Aug 15 '24 18:08 jvstme

yea same issue

Stealthwriter avatar Sep 05 '24 17:09 Stealthwriter

A rebooted gateway instance leads to services not working. Marking it as major since this doesn't have a simple workaround fix.

r4victor avatar Sep 30 '24 10:09 r4victor

Fleet instances won't survive reboot either, at least on some backends (tested with gcp) — they don't have dstack public keys after reboot.

un-def avatar Oct 10 '24 06:10 un-def

Sometimes the dstack-gateway application won't start at all after instance reboot. Recently a planned reboot lead to empty ~/dstack/state.json file, so dstack-gateway failed to restart and gateway state was lost.

Oct 14 14:56:48 ip-172-31-30-166 sh[614992]: INFO:     127.0.0.1:36098 - "GET /api/stats/collect HTTP/1.1" 200 OK
Oct 14 14:56:52 ip-172-31-30-166 sh[614992]: INFO:     127.0.0.1:33052 - "GET /api/stats/collect HTTP/1.1" 200 OK
-- Boot 2744f08ca7ef4101be2445a849b165b3 --
Oct 14 14:57:24 ip-172-31-30-166 systemd[1]: Started dstack gateway service.
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: INFO:     Started server process [395]
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: INFO:     Waiting for application startup.
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: 2024-10-14 14:57:28,321 - dstack.gateway.core.persistent - DEBUG - Loading state from /home/ubuntu/dstack/state.json
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: ERROR:    Traceback (most recent call last):
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/starlette/routing.py", line 734, in lifespan
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     async with self.lifespan_context(app) as maybe_state:
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     return await anext(self.gen)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/dstack/gateway/main.py", line 24, in lifespan
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     store = get_store()
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/dstack/gateway/core/store.py", line 346, in get_store
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     store = Store.model_validate(get_persistent_state().get(Store.persistent_key, {}))
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/home/ubuntu/dstack/blue/lib/python3.10/site-packages/dstack/gateway/core/persistent.py", line 27, in get_persistent_state
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     state = json.load(f)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/usr/lib/python3.10/json/__init__.py", line 293, in load
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     return loads(fp.read(),
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     return _default_decoder.decode(s)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:   File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
Oct 14 14:57:28 ip-172-31-30-166 sh[395]:     raise JSONDecodeError("Expecting value", s, err.value) from None
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Oct 14 14:57:28 ip-172-31-30-166 sh[395]: ERROR:    Application startup failed. Exiting.

jvstme avatar Oct 17 '24 10:10 jvstme

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Nov 17 '24 02:11 github-actions[bot]

@jvstme, is it fixed with the latest gateway refactoring?

r4victor avatar Dec 23 '24 12:12 r4victor

Yes, fixed in #1595

jvstme avatar Dec 26 '24 09:12 jvstme