grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

NewClient functions behaviour is incompatible with secure forward-proxies

Open puneet-traceable opened this issue 1 year ago • 27 comments

What version of gRPC are you using?

1.64.0 and v1.67.0-dev

What version of Go are you using (go version)?

1.22

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

If possible, provide a recipe for reproducing the error.

  1. setup a squid proxy
  2. setup grpc client(examples/helloworld/greeter_client) and grpc server(examples/helloworld/greeter_server) to use tls
  3. run grpc server
  4. on client shell set env var for using proxy(export https_proxy="http://<proxy_host:port")
  5. start tcpdump

What did you expect to see?

the target should be hostname while it's sent to proxy and dns resolution for target should happen on proxy

What did you see instead?

dns is resolved on the client and only ip is sent. Attaching tcpdump screenshot with difference Screenshot 2024-08-23 at 7 50 34 PM

tcpdump for curl Screenshot 2024-08-23 at 9 48 22 PM

puneet-traceable avatar Aug 23 '24 16:08 puneet-traceable

If you need a short-term workaround to keep things working as they were before, you should be able to use the passthrough resolver:

client, err := grpc.NewClient("passthrough:///<hostname>:<port>", ...)

That probably should not be required, though.

dfawley avatar Aug 23 '24 17:08 dfawley

@dfawley I get that this would work in most of the cases. But there are more issues with this change. For example: if dns lookups are disabled on app instances and it's only the proxy that can resolve the dns, current grpc client built with NewClient does not work. curl for http2 does not face this problem.

puneet-traceable avatar Aug 27 '24 13:08 puneet-traceable

But there are more issues with this change.

Are you saying the workaround of using passthrough:///<hostname>:<port> as the target string isn't working for you in these situations? If so, can you provide some more information and debugging logs of that?

dfawley avatar Aug 27 '24 15:08 dfawley

target

No, I didn't mean passthrough won't work. But I think NewClient api should have greater flexibility to be able to configure dns versus passthrough. Since we use opentelemetry, the real grpc-go integration is way down the stack and it's hard to configure the url this way as the same url gets used at multiple places.

puneet-traceable avatar Aug 28 '24 06:08 puneet-traceable

Thanks for confirming. Yes, this should be treated as a bug and the workaround was not suggested to avoid fixing it.

dfawley avatar Aug 28 '24 16:08 dfawley

For reference, this gRFC comes into play here: https://github.com/grpc/proposal/blob/master/A1-http-connect-proxy-support.md

But note that Java did not implement this gRFC, and we may or may not want to do things this way.

dfawley avatar Aug 30 '24 21:08 dfawley

Java added support for "Use Case 1". But it did not use the gRFC's design. https://github.com/grpc/grpc-java/issues/10022 tracks implementing Use Case 2 in Java.

ejona86 avatar Aug 31 '24 00:08 ejona86

Keeping this issue open to track the fix.

arjan-bal avatar Sep 19 '24 09:09 arjan-bal

We also ran into this issue. It's very easy to reproduce. Just use https://github.com/grpc/grpc-go/tree/master/examples/helloworld/greeter_client and run nc to mock the proxy:

  1. in one terminal, run "nc -l 8888";
  2. in another terminal, run "HTTPS_PROXY=http://localhost:8888 go run main.go -addr=example.com:50051"

If using 1.58.x, you will see the nc outputs: % nc -l 8888 CONNECT example.com:50051 HTTP/1.1 Host: example.com:50051 User-Agent: grpc-go/1.58.4-dev

If using 1.66.x, you will see: % nc -l 8888 CONNECT 93.184.215.14:50051 HTTP/1.1 Host: 93.184.215.14:50051 User-Agent: grpc-go/1.66.4-dev

guyni avatar Oct 17 '24 16:10 guyni

https://github.com/grpc/grpc-go/blob/master/internal/transport/http2_client.go#L181 Shouldn't it call proxyDial(ctx, addr.ServerName, grpcUA) instead of proxyDial(ctx, address, grpcUA) here?

guyni avatar Oct 17 '24 17:10 guyni

master/internal/transport/http2_client.go#L181 Shouldn't it call proxyDial(ctx, addr.ServerName, grpcUA) instead of proxyDial(ctx, address, grpcUA) here?

Hey @guyni , I think the using address is correct because as metioned in gRFC A1 case 2 , we need to send the server address (specifically the resolved server address) in the HTTP_CONNECT request. The property ServerName is used for TLS certificate validation and might be empty most of the time unless specifically set while using custom CA.

eshitachandwani avatar Oct 18 '24 05:10 eshitachandwani

@eshitachandwani If you look at the usecase no. 1 here, it mentions that the external address should be resolved at the proxy. That means the hostname should be supplied in CONNECT call.

puneet-traceable avatar Oct 18 '24 07:10 puneet-traceable

Hi @puneet-traceable, currently, gRPC-Go supports only use case 2, and I’m actively working on adding support for use case 1. Apologies for not mentioning this in my previous comment.

eshitachandwani avatar Oct 18 '24 07:10 eshitachandwani

We also ran into this issue. It's very easy to reproduce. Just use master/examples/helloworld/greeter_client and run nc to mock the proxy:

  1. in one terminal, run "nc -l 8888";
  2. in another terminal, run "HTTPS_PROXY=http://localhost:8888 go run main.go -addr=example.com:50051"

If using 1.58.x, you will see the nc outputs: % nc -l 8888 CONNECT example.com:50051 HTTP/1.1 Host: example.com:50051 User-Agent: grpc-go/1.58.4-dev

If using 1.66.x, you will see: % nc -l 8888 CONNECT 93.184.215.14:50051 HTTP/1.1 Host: 93.184.215.14:50051 User-Agent: grpc-go/1.66.4-dev

This happens because in v1.58.x we were using grpc.Dial and we used the passthrough scheme as default name resolving scheme, and so if there is no scheme, it did not get resolved on client and proxy received the unresolved name. But now, we have changed to use grpc.NewClient() with default name resolving scheme as dns and so the proxy gets the resolved name by default if no scheme is specified. As mentioned earlier, we are actively trying to resolve this.

eshitachandwani avatar Oct 18 '24 09:10 eshitachandwani

As mentioned earlier, we are actively trying to resolve this.

Thanks @eshitachandwani for handing this issue. Didn't find if it was already mentioned but this issue impacts consumers of the GCP Go SDK (https://github.com/googleapis/google-cloud-go), can we make sure any solution applies for those consumers as well (e.g. if the solution is by adding a new option, that consumers of the GCP Go SDK can enable the option)?

erezrokah avatar Nov 05 '24 09:11 erezrokah

@erezrokah : Could you please elaborate on your last comment?

How does the GCP Go SDK currently use the proxy feature?

Are they currently affected by using grpc.NewClient? Or do you think they will be affected if they make the switch?

Thanks.

easwars avatar Nov 05 '24 22:11 easwars

Are they currently affected by using grpc.NewClient?

They are currently affected due to this change https://github.com/googleapis/google-cloud-go/commit/be2d56dd036cde7c55ed60e7df852fdd5571a6cc#diff-215847f913454f2311866edbdbf41d7a3cd3879e0ed0e7aa26f4ad771dd6a1dcR296.

We're experiencing the same issue as this one only when using the Google Cloud Go SDK.

I think https://github.com/grpc/grpc-go/blob/master/Documentation/anti-patterns.md#the-wrong-way-grpcdial should say both Dial and DialContext use passthrough and switching to NewClient as is should be considered a breaking change, WDYT?

Edit

See https://github.com/googleapis/google-cloud-go/issues/11089

erezrokah avatar Nov 06 '24 09:11 erezrokah

I can think of a workaround which doesn't involve a code change in the googleapis client library:

import (
	"google.golang.org/grpc/resolver"
	"google.golang.org/grpc"
        "google.golang.org/api/option"
)

// A resolver that registers the passthrough resolver under the "dns" scheme
// to indirectly set the default resolver used by NewClient as the passthrough
// resolver.
// Note: This will shadow the dns resolver.
type customPassthroughResolver struct{}

func (r *customPassthroughResolver) Build(target resolver.Target, cc resolver.ClientConn, opts resolver.BuildOptions) (resolver.Resolver, error) {
	passthrough := resolver.Get("passthrough")
	if passthrough == nil {
		panic("Passthrough resolver not registered!")
	}
	return passthrough.Build(target, cc, opts)
}

func (r *customPassthroughResolver) Scheme() string {
	return "dns"
}

// Use a grpc client option to avoid modifying the global grpc resolver registry.
clientOption := option.WithGRPCDialOption(grpc.WithResolvers(&customPassthroughResolver{}))
// Use the client option while creating the googleapis client.

arjan-bal avatar Nov 06 '24 10:11 arjan-bal

@arjan-bal

Regarding your workaround above, @erezrokah asked (on https://github.com/googleapis/google-cloud-go/issues/11089):

Won't this workaround disable DNS resolving altogether even when explicitly set by the scheme?

quartzmo avatar Nov 08 '24 21:11 quartzmo

@quartzmo, I'm not suggesting that the Go client library uses this workaround. This workaround is hacky. The Go client library should probably revert back to using grpc.Dial instead of grpc.NewClient to fix the regression. grpc-go is working on fixing the handling of forward proxy use case in NewClient, but that could take a couple of months to land.

Won't this workaround disable DNS resolving altogether even when explicitly set by the scheme?

Yes, it will replace the DNS resolver with the passthrough resolver, even if the target URI uses dns as the scheme. Since the workaround uses a call option instead of registering a balancer globally, a user should add the call option only when they don't want to use the dns scheme. A less hacky fix would be to use the passthrough scheme in the target endpoint explicitly so that the default dns scheme is not used.

DNS resolution happens when using both passthrough and dns resolvers:

  • When using the DNS resolver, the LB policy sees multiple endpoints, one for each IP addresses returned by DNS.
  • When using passthrough, the LB policy sees a single endpoint with the unresolved hostname. The Dialer resolves the hostname into an IP address using DNS just before creating the transport.

The passthrough resolver exists for backward compatibility. grpc-go wants to move all users to the dns scheme, so the GCP client library should revert back to using NewClient once NewClient handles forward proxies correctly.

arjan-bal avatar Nov 11 '24 07:11 arjan-bal

@arjan-bal considering the recommendation to revert back to grpc.Dial is there anything to be done to prevent other consumers of grpc-go from running into the same issue?

I think the following pattern could be repeated for other consumers:

  1. Someone updates to a new version of grpc-go
  2. Linter checks fail due to the deprecation notice
  3. Someone updates to use NewClient per the linter warning without fully understanding the impact (some of it described here)
  4. After some time the breaking changes are observed

Related PRs from consumers to have created issues:

  • https://github.com/googleapis/google-cloud-go/pull/10780
  • https://github.com/open-telemetry/opentelemetry-collector/pull/11575
  • https://github.com/hypertrace/opentelemetry-collector/pull/783
  • https://github.com/gravitational/teleport/pull/46006

Other PRs I found from a quick search that might experience similar issues

  • https://github.com/DataDog/datadog-agent/pull/29624
  • https://github.com/LightBitsLabs/los-csi/pull/51
  • https://github.com/DataDog/dd-trace-go/pull/2855
  • https://github.com/omec-project/upf/pull/858
  • https://github.com/pulumi/pulumi/pull/17701
  • https://github.com/go-kratos/kratos/pull/3323
  • https://github.com/open-feature/go-sdk-contrib/pull/593
  • https://github.com/datatrails/go-datatrails-common/pull/83
  • https://github.com/vdaas/vald/pull/2639

Going over the PRs above I don't see any acknowledgement of the potential breakage.

Would it make sense to update the deprecation notice to make the possible issues more clear? There's some context in the DialContext comment https://github.com/grpc/grpc-go/blob/e2b98f96c919c49d1f9dd731c1a03fb49a6281a7/clientconn.go#L218 but that could be easily missed I think

erezrokah avatar Nov 12 '24 11:11 erezrokah

@erezrokah the fact that the dns resolver resolves the backend hostname on the client is a bug, it wasn't intentional. grpc-go doesn't use godoc comments to inform users about bugs presently. We expect users to find this GitHub issue if they're effected. Once the fix is merged in master, we will decide if a patch release is required for older grpc-go releases.

arjan-bal avatar Nov 12 '24 18:11 arjan-bal

We plan to update the release notes for recent releases.

arjan-bal avatar Nov 12 '24 18:11 arjan-bal

@erezrokah the fact that the dns resolver resolves the backend hostname on the client is a bug, it wasn't intentional

Thanks for the additional context I thought this was intended behavior. Updating the release notes and having a fixed merged would be great as currently the deprecation of grpc.Dial pushes users to grpc.NewClient with the bug.

erezrokah avatar Nov 12 '24 20:11 erezrokah

Our understanding of the problem : Before grpc.NewClient was introduced, grpc.Dial was the primary way to establish gRPC connections. With grpc.Dial, the default "passthrough" resolver simply passed the target address unchanged to the transport layer. When grpc.NewClient was introduced, it switched to the "dns" resolver by default, which resolves the target address to IP addresses before passing them to the transport. This change inadvertently altered how proxies are handled when configured via environment variables. This behavior is inconsistent with what we expect i.e resolution to happen on proxy.

The proposal to resolve the issue:

  • Remove environment variable detection from the transport layer. Instead, introduce a per-address attribute to specify the proxy CONNECT string. This change aligns with the xDS design and provides more control over proxy behavior.
  • Create a new resolver that handles proxy configuration and delegates to child resolvers for target and proxy address resolution. This resolver does the following:
    • Use the "dns" resolver for proxy address resolution if environment variables are set.
    • Use the target's "path" portion directly if the scheme is "dns" and a proxy is configured, mimicking the old "passthrough" behavior.
    • Combine the resolved target and proxy addresses, setting the proxy CONNECT string attribute.
  • Introduce a new DialOption to preserve the current behavior of grpc.Dial, where the resolved target address is used in the proxy CONNECT request. This option ensures backward compatibility for users relying on the existing behavior.

It is based on the gRFC A1

eshitachandwani avatar Nov 22 '24 06:11 eshitachandwani

@eshitachandwani I'm concerned about this: "Remove environment variable detection from the transport layer. Instead, introduce a per-address attribute to specify the proxy CONNECT string. This change aligns with the xDS design and provides more control over proxy behavior."

Some users have extensive use of the https_proxy and expect it to work. Forcing them to do code changes will be difficult, especially if it's just a thing that happens during some library upgrade. It's a pretty substantial change to how it works. In particular, users of GRPC often are unaware of the addresses they are connecting to. (For example, it is common to use a high-level package, which may be many layers away from the actual call to NewClient or Dial; adding options is not possible; knowing which addresses are being used is not possible.)

Adding a new, more flexible, feature is surely harmless, but if the rug is pulled out under https_proxy (which is part of the Go 1.0 API promise) that's pretty severe.

bushnell-deshaw avatar Nov 25 '24 14:11 bushnell-deshaw

Hey @bushnell-deshaw , sorry for the confusion. We are not pulling the plug on use of https_proxy environment for specifying proxy. I meant to say that detecting the use of proxy from the environment will be moved from the transport layer (it will be moved to the new delegating resolver logic, which will add the attribute to the address if proxy is detected). The transport layer will connect to proxy if the attribute is set with address. The users can still specify proxy using https_proxy environment, its the internal logic of detection and connecting to proxy that will change. Let me know if there is still any concern.

eshitachandwani avatar Nov 25 '24 15:11 eshitachandwani