grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

Why does connection to `localhost` fail on local ?

Open manuraj17 opened this issue 1 year ago • 20 comments

I faced this issue when I was testing out a simple application; Not able to put a finger on what exactly am missing.

My client code is returning with this error

could not greet: rpc error: code = DeadlineExceeded desc = context deadline exceeded

I have my server running as

const (
	port = ":50051"
)

// server is used to implement helloworld.GreeterServer.
type server struct {
	helloworld.UnimplementedGreeterServer
}

// SayHello implements helloworld.GreeterServer
func (s *server) SayHello(ctx context.Context, in *helloworld.HelloRequest) (*helloworld.HelloReply, error) {
	log.Printf("Received: %v\n", in.GetName())
	return &helloworld.HelloReply{Message: "Hello " + in.Name}, nil
}

func main() {
	lis, err := net.Listen("tcp", port)
	if err != nil {
		log.Fatalf("failed to listen: %v", err)
	}
	s := grpc.NewServer()
	helloworld.RegisterGreeterServer(s, &server{})
	log.Printf("server starting at %v\n", lis.Addr())
	if err := s.Serve(lis); err != nil {
		log.Fatalf("failed to serve: %v", err)
	}
}

Client connecting to localhost as

const (
	address = "localhost:50051"
	// address     = "[::1]:50051"

	// address     = "127.0.0.1:50051"
	defaultName = "world"
)

func main() {
	// Set up a connection to the server.
	conn, err := grpc.NewClient(address, grpc.WithTransportCredentials(insecure.NewCredentials()))
	if err != nil {
		log.Fatalf("did not connect: %v", err)
	}
	defer conn.Close()
	c := helloworld.NewGreeterClient(conn)

	// Contact the server and print out its response.
	name := defaultName
	if len(os.Args) > 1 {
		name = os.Args[1]
	}
	ctx, cancel := context.WithTimeout(context.Background(), time.Second)
	defer cancel()
	r, err := c.SayHello(ctx, &helloworld.HelloRequest{Name: name})
	if err != nil {
		log.Fatalf("could not greet: %v", err)
	}
	log.Printf("Greeting: %s", r.GetMessage())
}

This code fails.

But when I connect with either of

	address     = "[::1]:50051"
	address     = "127.0.0.1:50051"

It works. I am trying to understand what am missing here. Can anyone help me understand? TIA!

Full application code available here

UPDATE (More Info):

Some more information Version

google.golang.org/grpc v1.65.0

I am able to connect if I am using 127.0.0.0 and [::1] The issue specifically seems to occur when localhost is used.

etc/hosts is below

##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1       localhost
127.0.0.1       mail.local
255.255.255.255 broadcasthost
::1             localhost

I am running the server at add = :50051 and initialising like

	lis, err := net.Listen("tcp", port)

It works with Dial

When I use Dial it works

	conn, err := grpc.DialContext(ctx, addr,
		grpc.WithTransportCredentials(insecure.NewCredentials()),
		grpc.WithBlock(),
	)

I have tried all variations to connect without using Dial, and everything except localhost works. The resolution of localhost also seems to be working correctly. I get the following logs

 localhost resolves to: [::1 127.0.0.1]

NOTE: IPV6 gets resolved first.

Though I am not able to figure what failed here. I will have end up not using localhost for now. Though, curious what is happening here.

manuraj17 avatar Jul 21 '24 06:07 manuraj17

@manuraj17 I was also able to repro on my local with localhost. It looks like when using localhost with grpc.NewClient address resolution takes more than 5s which is why we get Deadline exceeded in the example. If you try with 10s timeout, rpc will succeed.

Will discuss with team and update on what is the best course of action here. Thanks once again for catching this.

purnesh42H avatar Jul 22 '24 17:07 purnesh42H

@purnesh42H Thanks for checking on this. Will wait for the update 👍🏽

manuraj17 avatar Jul 22 '24 17:07 manuraj17

Client logs with 1 second timeout when using localhost with grpc.NewClient

2024/07/23 21:08:08 INFO: [core] original dial target is: "localhost:50051"
2024/07/23 21:08:08 INFO: [core] [Channel #1]Channel created
2024/07/23 21:08:08 INFO: [core] [Channel #1]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"dns", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/localhost:50051", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2024/07/23 21:08:08 INFO: [core] [Channel #1]Channel authority set to "localhost:50051"
2024/07/23 21:08:08 INFO: [core] [Channel #1]Channel exiting idle mode
2024/07/23 21:08:13 could not greet: rpc error: code = DeadlineExceeded desc = context deadline exceeded
exit status 1

purnesh42H avatar Jul 23 '24 15:07 purnesh42H

@purnesh42H Could share what changed between from being able to reproduce to now? Just curious on what happened earlier and what is happening now.

manuraj17 avatar Jul 23 '24 17:07 manuraj17

@manuraj17 the problem seems to be when using grpc.NewClient. The address resolution is taking longer when using localhost and that's why you experience timeout. If you try the same example with 10s timeout, rpc will succeed. We are still debugging the root cause.

purnesh42H avatar Jul 25 '24 03:07 purnesh42H

this call is blocked for longer when using localhost https://github.com/grpc/grpc-go/blob/2bcbcab9fbeec8a475631e8f06bd0cb45eb92dc8/stream.go#L212

purnesh42H avatar Jul 25 '24 04:07 purnesh42H

@manuraj17 I am not able to repro this anymore and localhost seems to work with existing example. For now, we will keep this open but won't investigate further since its not a widespread issue. Feel free to add any new information here and we can re-evaluate the priority

purnesh42H avatar Jul 29 '24 12:07 purnesh42H

Are you still able to reproduce this? Can you run with full debug logging enabled and include the logs?

https://github.com/grpc/grpc-go#how-to-turn-on-logging

dfawley avatar Jul 29 '24 15:07 dfawley

@dfawley Will check and update; Though any idea on what is happening in this scenario? @purnesh42H was able to identify and figure it's a resolution issue. Curious as to what information you folks have regarding this.

manuraj17 avatar Jul 30 '24 11:07 manuraj17

@dfawley Will check and update; Though any idea on what is happening in this scenario?

No idea. If your machine can't resolve localhost, or properly connect to an address it returns, then that sounds like a configuration issue to me.

dfawley avatar Jul 30 '24 15:07 dfawley

No idea. If your machine can't resolve localhost, or properly connect to an address it returns, then that sounds like a configuration issue to me.

If it's my machine issue wouldn't it fail for

  • Dial method
  • every other app that is running in localhost

Not sure what what you are on to here coz it still doesn't clariy how @purnesh42H was able to reproduce it for a while; did he have some configuration issue?

manuraj17 avatar Jul 31 '24 04:07 manuraj17

@manuraj17 can you pull the latest and then try running with grpc.NewClient and localhost with the logging turned on https://github.com/grpc/grpc-go#how-to-turn-on-logging? Since, this is not always reproducible, we need more information to verify if its a library or configuration issue.

To clarify, even for me the resolution was just taking a bit longer with localhost but that's not happening anymore

purnesh42H avatar Jul 31 '24 12:07 purnesh42H

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Aug 06 '24 12:08 github-actions[bot]

@purnesh42H

can you pull the latest

Are you suggesting the latest from the grpc main branch?

manuraj17 avatar Aug 10 '24 02:08 manuraj17

Are you suggesting the latest from the grpc main branch?

Yes

purnesh42H avatar Aug 10 '24 18:08 purnesh42H

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Aug 16 '24 18:08 github-actions[bot]

We ran into this issue running code using the latest grpc-go release as of 8 August 2024. While I unfortunately cannot provide any logs or code, I can say we only started seeing the issue when switching from using Dial to NewClient, as previously commented, and only when all nameservers listed in /etc/resolv.conf were unreachable. (Figuring this part out took a while!)

Similar to the reporter, changing from localhost to 127.0.0.1 for the server and client also avoided the error, even with all invalid/unreachable nameservers in /etc/resolv.conf.

(Note: while the nameservers in /etc/resolv.conf were invalid, we could still e.g. ping localhost and get a response; it was only grpc-go that seemed to suffer any issues (beyond what you'd naturally expect when your DNS config is awry, anyway).)

(Also, localhost was listed in /etc/hosts similar to the reporter.)

ciaphas01 avatar Aug 21 '24 00:08 ciaphas01

Apologies for the double post, forgot the @purnesh42H and I'm not sure editing mentions into comments actually works lol

ciaphas01 avatar Aug 21 '24 00:08 ciaphas01

Update: curiously, the problem seems to stop manifesting immediately when the machine is physically disconnected from all networks. Perhaps somewhere there's code that's causing this DNS record lookup to get skipped if all network interfaces are down, allowing the localhost entry in /etc/hosts to take over and the health check to proceed?

ciaphas01 avatar Aug 21 '24 18:08 ciaphas01

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Aug 27 '24 18:08 github-actions[bot]

In new grpc.NewClient the default name resolution is dns as opposed to passthrough in deprecated grpc.Dial. Try NewClient with passthrough as the resolver and it should behave the same.

This type of issue with dns resolution is more likely related to system configuration rather than a problem with the gRPC library itself as it relies on the underlying system for DNS resolution. If the system takes time to resolve localhost, this delay will be reflected in gRPC client behavior as well.

purnesh42H avatar Sep 03 '24 05:09 purnesh42H

We are seeing the same issue, and indeed using passthrough works. But even with passthrough net.Dial will eventually perform the DNS resolution, and this seemingly works just fine, so I think there is an open question about what the grpc-go DNS resolver does differently to provoke this issue (both use net.Resolver eventually). I'm not sure it should just be written off to system misconfiguration. Even if that was a factor, the people encountering this seem to have no other issues with DNS resolution (we don't) and are slowed down while trying to switch to NewClient with the recommended default resolution scheme. It'd be great to at least find a more satisfying explanation.

ClaudioAlbertin avatar Sep 19 '24 22:09 ClaudioAlbertin

@ClaudioAlbertin can you provide the gRPC logs when the DNS resolver fails?

arjan-bal avatar Sep 20 '24 08:09 arjan-bal

@ClaudioAlbertin can you provide the gRPC logs when the DNS resolver fails?

@ClaudioAlbertin ping on this

purnesh42H avatar Sep 24 '24 17:09 purnesh42H

Unfortunately I can only reliably reproduce the issue on our CI server in some tests with very noisy output, but I'll see what logs I can get out of it.

It may be worth noting for @manuraj17 that the grpc-go DNS resolver also performs a lookup for a TXT record (in the hope of finding a Service Config), which does not happen if an IP address or the passthrough resolver is used. This lookup contacts an actual DNS server, whereas resolving localhost otherwise just involves reading local files. This could contribute to the RPC taking much longer with grpc.NewClient vs grpc.Dial. You can test whether this is your issue by disabling service configs and by extension the TXT lookup using the grpc.WithDisableServiceConfig() dial option. This did not fix the issue for us, however.

ClaudioAlbertin avatar Sep 24 '24 23:09 ClaudioAlbertin

@manuraj17 as suggested by @ClaudioAlbertin, can you try disabling service config using the grpc.WithDisableServiceConfig() dial option and see if localhost resolution works with grpc.NewClient? If it does, it means that all the nameservers present in the /etc/resolv.conf are un-reachable.

arjan-bal avatar Sep 27 '24 17:09 arjan-bal

@ClaudioAlbertin to be sure, are you seeing the issue with localhost or another hostname?

Also, do you happen to be using an http proxy? If yes, there is a known issue (https://github.com/grpc/grpc-go/issues/7556) which causes DNS resolution to happen on the client instead of the proxy which could cause timeouts if the client can't reach the DNS itself.

arjan-bal avatar Sep 27 '24 17:09 arjan-bal

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Oct 03 '24 18:10 github-actions[bot]

@ClaudioAlbertin wrote:

We are seeing the same issue, and indeed using passthrough works.

How do I use passthrough through the API then? Looking at the code of Dial it uses a private option called withDefaultScheme("passthrough") but since that is private, I cannot use it. It would be helpful if the deprecated function was implemented in such a way that we could fully reproduce it in the code that uses it to get rid of the deprecated code!

cakoolen avatar Dec 16 '24 10:12 cakoolen

@ClaudioAlbertin wrote:

We are seeing the same issue, and indeed using passthrough works.

How do I use passthrough through the API then? Looking at the code of Dial it uses a private option called withDefaultScheme("passthrough") but since that is private, I cannot use it. It would be helpful if the deprecated function was implemented in such a way that we could fully reproduce it in the code that uses it to get rid of the deprecated code!

grpc.Dial by default uses passthrough scheme so you don't have to do anything but if you are using grpc.NewClient then you can prefix your url with passthrough:/// to skip the dns resolution on the client.

purnesh42H avatar Dec 23 '24 10:12 purnesh42H