Intermittently fails to create proxy to service in docker-compose.yaml
Expected Behaviour
Services specified in /var/lib/faasd/docker-compose.yaml with a mapped port are started and exposed on the host system via a proxy.
Current Behaviour
Most of the time, the services are exposed successfully, but sometimes the proxy fails with the following error (seen in journalctl -u faasd --no-pager):
unable to dial to 10.62.0.13:8080, error: dial tcp 10.62.0.13:8080: connect: connection refused
In this case, the IP is for the correct service, and the port (8080) is the "internal" port. The proxied "external" port would have been 8082.
After seeing this message, I tried the following troubleshooting to ensure the service was started properly:
- View service logs with
journalctl -t openfaas:my-service --no-pager, which successfully shows me the logs frommy-service - Run
curl 10.62.0.13:8080which successfully returns the expected results - Run
curl 10.62.0.1:8082which fails to return the expected results - Run
curl 10.62.0.1:8080(the openfaas gateway to test another proxied service), which returns the expected results
List all Possible Solutions
Since I'm not familiar with how a lot of the networking (cni) is handled, my assumption is that there's a "timing" problem at the point the proxy is attempting to "dial" the service here.
I'm making this assumption due to the following:
- the service is available later at the IP address and port.
- this doesn't always happen
- sometimes it's a different service (or more than one service)
- usually the last service(s) to start
Possible Solutions
- Add "wait" time before creating proxies
- Add "retries" here
- Implement "health checks" for services to determine when/if the proxy should be created.
List the one solution that you would recommend
Add "retries" here is the solution I would recommend since that doesn't rely on an arbitrary wait time and will only be used when necessary. Additionally, a fallback strategy can be implemented in case of longer startup times.
I also like the "health checks" solution, but that seems like it would be a bigger change and I wouldn't know where to begin.
Steps to Reproduce (for bugs)
- Add custom service to
/var/lib/faasd/docker-compose.yamlwith a mapped port - Reload systemd:
sudo systemctl daemon-reload - Restart faasd:
sudo systemctl restart faasd
Context
I'm attempting to add some custom services that startup when the rest of the faasd system starts up.
Your Environment
-
OS and architecture: Running on
multipassusing the default LTS version:20.04 focal,lts 20210315 Ubuntu 20.04 LTSThe host system is:macOS Catalina Version 10.15.7 -
Versions:
go version
containerd -version
containerd github.com/containerd/containerd v1.3.5 9b6f3ec0307a825c38617b93ad55162b5bb94234
uname -a
Linux doowb-faasd-03 5.4.0-67-generic #75-Ubuntu SMP Fri Feb 19 18:03:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
faasd version
__ _
/ _| __ _ __ _ ___ __| |
| |_ / _` |/ _` / __|/ _` |
| _| (_| | (_| \__ \ (_| |
|_| \__,_|\__,_|___/\__,_|
faasd version: 0.11.0 commit: 0bf221b286ea07291a5e62e109e6c62103fb1669
Hi Brian,
Thanks for the report.
Can you give the YAML file or diff for the In this case where you mention port 8082?
Add custom service to /var/lib/faasd/docker-compose.yaml with a mapped port
Can you share the specific services, because it may look generic, but be specific to timing/application.
Add "retries" here
What do you do, in order to see the error? Is it when you attempt to access the "added" service on port 8082 that you see the error, or just when the faasd service restarts?
Can you repro with?
figlet:
image: ghcr.io/openfaas/figlet:96d4440125c18c6ffc75ecb72687b37a277d3489
user: "1000"
ports:
- "8082:8080"
ubuntu@faasd:~$ journalctl -u faasd --no-pager --lines 25
-- Logs begin at Tue 2021-02-02 18:49:36 UTC, end at Thu 2021-03-18 20:36:33 UTC. --
Mar 18 20:35:43 faasd faasd[125827]: Starting: gateway
Mar 18 20:35:43 faasd faasd[125827]: 2021/03/18 20:35:43 Created container: gateway
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 gateway has IP: 10.62.0.43
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Task: gateway Container: gateway
Mar 18 20:35:44 faasd faasd[125827]: Starting: queue-worker
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Created container: queue-worker
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 queue-worker has IP: 10.62.0.44
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Task: queue-worker Container: queue-worker
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Supervisor init done in: 4.056249759s
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Looking up IP for: "gateway"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Resolver rebuilding map
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Resolver: "localhost"="127.0.0.1"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Resolver: "faasd-provider"="10.62.0.1"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Resolver: "figlet"="10.62.0.39"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Resolver: "basic-auth-plugin"="10.62.0.40"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Resolver: "nats"="10.62.0.41"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Resolver: "prometheus"="10.62.0.42"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Resolver: "gateway"="10.62.0.43"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Resolver: "queue-worker"="10.62.0.44"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 faasd: waiting for SIGTERM or SIGINT
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Looking up IP for: "figlet"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Proxy from: 0.0.0.0:8082, to: figlet:8080 (10.62.0.39)
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Looking up IP for: "prometheus"
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Proxy from: 0.0.0.0:9090, to: prometheus:9090 (10.62.0.42)
Mar 18 20:35:44 faasd faasd[125827]: 2021/03/18 20:35:44 Proxy from: 0.0.0.0:8080, to: gateway:8080 (10.62.0.43)
ubuntu@faasd:~$ curl localhost:8082 -d test
_ _
| |_ ___ ___| |_
| __/ _ \/ __| __|
| || __/\__ \ |_
\__\___||___/\__|
ubuntu@faasd:~$
Here's the second restart:
Mar 18 20:37:50 faasd faasd[126763]: 2021/03/18 20:37:50 Resolver: "figlet"="10.62.0.47"
Mar 18 20:37:50 faasd faasd[126763]: 2021/03/18 20:37:50 Resolver: "basic-auth-plugin"="10.62.0.48"
Mar 18 20:37:50 faasd faasd[126763]: 2021/03/18 20:37:50 Resolver: "prometheus"="10.62.0.49"
Mar 18 20:37:50 faasd faasd[126763]: 2021/03/18 20:37:50 Resolver: "gateway"="10.62.0.50"
Mar 18 20:37:50 faasd faasd[126763]: 2021/03/18 20:37:50 Looking up IP for: "figlet"
Mar 18 20:37:50 faasd faasd[126763]: 2021/03/18 20:37:50 Proxy from: 0.0.0.0:8082, to: figlet:8080 (10.62.0.47)
Mar 18 20:37:50 faasd faasd[126763]: 2021/03/18 20:37:50 faasd: waiting for SIGTERM or SIGINT
Mar 18 20:37:51 faasd faasd[126763]: 2021/03/18 20:37:51 Proxy from: 0.0.0.0:9090, to: prometheus:9090 (10.62.0.49)
Mar 18 20:37:51 faasd faasd[126763]: 2021/03/18 20:37:51 Proxy from: 0.0.0.0:8080, to: gateway:8080 (10.62.0.50)
ubuntu@faasd:~$ curl localhost:8082 -d test
_ _
| |_ ___ ___| |_
| __/ _ \/ __| __|
| || __/\__ \ |_
\__\___||___/\__|
ubuntu@faasd:~$
I've run this workflow 5+ times and got the same result each time, so having an exact repro would help.
I'll try with figlet and maybe some other services, since it's not always happening.
Can you share the specific services, because it may look generic, but be specific to timing/application.
It's happened for different things, the most recent was grafana exposed on port 3000:3000.
What do you do, in order to see the error?
In the grafana case, I had already added it and had a browser window open. The error occurred after restarting faasd, which I'm assuming was because the grafana client made a request while faasd was still starting up.
The other times, the error occurred when an external health check pinged the service before faasd was finished starting.
I made the assumption, from looking at the go code, that the error occurred just after the proxy was started, For instance, just after this line is logged out: Proxy from: 0.0.0.0:8080, to: gateway:8080 (10.62.0.50).
If the error could only occur when a request is made to the proxy, then I can see how the timing of the health checks could cause an issue. If this is the case, is there a way for the proxy to succeed on the next request?
What happens when the service has started properly and you try accessing Grafana again?
Sorry for the delayed response, I've been attempting to consistently recreate the issue.
What happens when the service has started properly and you try accessing Grafana again?
I believe the services are eventually starting properly, every time, since I can access them through their direct IP and port. I see the error when attempting to access the services through the port on the "localhost" IP and the proxy is unable to make the connection.
I attempted reducing the custom services to just multiple instances of figlet and had an external script running a loop to ping all of the figlet instances on their "localhost" IP and port, but I couldn't recreate the error.
I think this is a timing issue, since some of the services with issues depend on other services, so I ensured that all of my depends_on sections were fully and properly configured. This has reduced the frequency of the issue, so restarting the faasd system service usually means the service resolves correctly the next time.
Unless I can find a way to consistently cause the error, I'm not sure how else to approach this, or if there is anything to be done in faasd itself. If I have any new information, I'll post back on this issue.
I had a very similar problem and I was able to reproduce it if all these conditions are met:
- The service is mentioned in
docker-conpose.yaml. - The service has an external port X mentioned.
- The service takes some time to initialise.
- The service is called on the external port X before the service finishes initialising.
This always seems to cause this error.