Unclear error message `Gateway is not working` if DNS is misconfigured
Steps to reproduce
- Start
dstack serverwith ZeroSSL configured as the CA for dstack-gateway. See this comment. - Create a gateway
dstack gateway create --domain $DOMAIN --region eu-central-1 --backend aws - Set a DNS A record for
*.$DOMAIN, but instead of pointing it to the gateway's IP address point it to an IP address of some other machine that is down. As if you redeployed the gateway, but forgot to change the DNS record. - Try running any service with dstack
> cat drope.yml type: service commands: - pip install drope - drope port: 8000 > dstack run . -f drope.yml ... (redacted for brevity) ... Shown 3 of 761 offers, $49.159 max Continue? [y/n]: y
Expected behaviour
The CLI shows an error saying that dstack-gateway failed to issue a certificate for the service's domain and suggests the user to make sure the DNS A record points to the domain.
Actual behaviour
After 30 seconds the CLI shows an unclear error message.
Gateway is not working:
The server logs don't have anything relevant.
dstack version
0.17.0
Server logs
No response
Additional information
What happens is:
- the server requests gateway's
/api/registry/{project}/services/register - the gateway tries issuing a certificate via certbot
- certbot hangs indefinitely because of misconfigured DNS
- the server cancels its request after a timeout
This behavior depends on the CA. E.g. with Let's Encrypt certbot exits quickly and the error is passed to dstack server and then to the CLI.
GatewayError: Certbot failed:
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Some challenges have failed.
I suggest we fix this by adding a timeout to certbot runs and passing a clear error message to the CLI if the timeout is reached.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Still relevant
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.