grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

xds: improve error message when matched route on client is not of type RouteActionRoute

Open Aditya-Sood opened this issue 1 year ago • 13 comments

Fixes #5921

RELEASE NOTES: none

Aditya-Sood avatar May 03 '23 05:05 Aditya-Sood

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: Aditya-Sood / name: Aditya Sood (2ee1fdd849505165744186e0cbe6ec07d6709514, 5aca643df96199349b41e5be7943e563887a45bf, 2223c7fee4979bd08369f405ca8f4eb577867584, cd24581cdf52d773b8d07950ff79f5b3391a0822, 5f6f4160fdb2c703833527a1725eea684f2d55f7, 212c198d31fcd079298202d8431def67cba501e9, c243e0ea0874bcc2609c9748f90843d4cf88132d)

I've submitted the CLA document

Aditya-Sood avatar May 03 '23 11:05 Aditya-Sood

@Aditya-Sood

Could you give me some pointers on how I can test this code? I've run the go test tests but the vet script was throwing a command not found error which I'm working on debugging

go test google.golang.org/grpc/... from the grpc-go directory should run all the tests. Also, you should add a test for RPCs matching routes with action not equal to RouteActionRoute and ensure that RPCs fail with the appropriate error. There are bunch of tests in https://github.com/grpc/grpc-go/blob/master/xds/internal/resolver/xds_resolver_test.go which verify different scenarios, and you should be able to a new one for this scenario. Please let me know if you have more questions on the test.

easwars avatar May 04 '23 06:05 easwars

thanks for the review @easwars!

I've locally implemented all the changes that have been suggested

they are breaking a couple of tests in xds_resolver_test.go, so I'll work on resolving those + adding the routes with action not equal to RouteActionRoute test case before creating a commit and pushing it to the PR

Aditya-Sood avatar May 04 '23 12:05 Aditya-Sood

hi @easwars and @tobotg it seems like there are testcases that use RouteActionUnsupported routes on xds, which are now failing because of this additional check on route type:

1. Test/XDSResolverHTTPFilters/no_router_filter
-> Expects "no router filter present" from newInterceptor()
-> Receiving "action type of matched route is RouteActionUnsupported"

2. Test/XDSResolverHTTPFilters/ignored_after_router_filter
-> Expects err to be nil (throwing "Unexpected error from cs.SelectConfig(_)")
-> Receiving "action type of matched route is RouteActionUnsupported"

3. Test/XDSResolverHTTPFilters/NewStream_error
-> Expects err to be nil (throwing "Unexpected error from cs.SelectConfig(_)")
-> Receiving "action type of matched route is RouteActionUnsupported"

4. Test/XDSResolverHTTPFilters/all_overrides
-> Expects err to be nil (throwing "Unexpected error from cs.SelectConfig(_)")
-> Receiving "action type of matched route is RouteActionUnsupported"

I also tried flipping the order and putting the if rt.clusters == nil {...} statement before the if rt.actionType != xdsresource.RouteActionRoute {...} statement, and the tests are still failing

(Output of go test google.golang.org/grpc/... - test4.txt)

does this contradict our working assumption that route with action type which is not RouteActionRoute, the clusters field will be nil? or are these valid tests which now need to be modified to expect an error if rt.actionType != RouteActionRoute?

Aditya-Sood avatar May 05 '23 10:05 Aditya-Sood

@Aditya-Sood Thank you so much for bringing up the issue of the failing tests. We had a discussion about this, and the decision is to ensure that an incoming LDS response is NACK'ed at the xDS client if the list of HTTP filters does not contain the router filter and/or the router filter is not the last one in the list.

Would you be interested in making that change? Once we do that, we can stop checking that the router filter is the last one in the list here.

I can provide you with more information if you are interested to make that change. Please let me know.

easwars avatar May 05 '23 18:05 easwars

hi @easwars, yes I can work on that change too, please share the details with me

Aditya-Sood avatar May 05 '23 18:05 Aditya-Sood

hi @easwars, yes I can work on that change too, please share the details with me

Here you go: https://github.com/grpc/grpc-go/issues/6259.

easwars avatar May 05 '23 21:05 easwars

I'm setting the Blocked label on this one since we need to fix #6259 before getting back to this. Thanks for your help @Aditya-Sood

easwars avatar May 05 '23 21:05 easwars

So newInterceptor() shouldn't be checking for Router Filter and ignoring HTTP filters after, this was the legacy behavior that I should've deleted (the language iterated on here and deleted: https://github.com/grpc/proposal/pull/250/files) when I added the NACK check in the client for IsTerminal() here: https://github.com/grpc/grpc-go/pull/4676. newInterceptor() should simply just run the RPCs through all the filters.

zasweq avatar May 24 '23 17:05 zasweq

hi @zasweq do you mean we should delete this section of code in newInterceptor()?

if router.IsRouterFilter(filter.Filter) {
	// Ignore any filters after the router filter.  The router itself
	// is currently a nop.
	return &interceptorList{interceptors: interceptors}, nil
}

(link to serviceconfig.go file)

Aditya-Sood avatar May 30 '23 18:05 Aditya-Sood

Yeah, and also any check for the router at all. Once it gets there, the scope is just run the RPC through all the filters.

zasweq avatar Jun 02 '23 17:06 zasweq

hi @zasweq so the newInterceptor() method in serviceconfig.go is the only location where the router.IsRouterFilter() check is currently being used

the method iterates over the ConfigSelector's filters, building a client interceptor for each of them and appending them to a slice

the loop stops either when:

  1. it encounters a router filter, returning the slice of client-interceptors; or
  2. all filters have been covered, at which point it returns an error stating "no router filter was found"

considering we are removing the router-filter check, should that loop now simply return the interceptors slice after it finishes iterating? something like so:

for _, filter := range cs.httpFilterConfig {
    override := cluster.httpFilterConfigOverride[filter.Name] // cluster is highest priority
    if override == nil {
        override = rt.httpFilterConfigOverride[filter.Name] // route is second priority
    }
    if override == nil {
        override = cs.virtualHost.httpFilterConfigOverride[filter.Name] // VH is third & lowest priority
    }
    ib, ok := filter.Filter.(httpfilter.ClientInterceptorBuilder)
    if !ok {
        // Should not happen if it passed xdsClient validation.
        return nil, fmt.Errorf("filter does not support use in client")
    }
    i, err := ib.BuildClientInterceptor(filter.Config, override)
    if err != nil {
        return nil, fmt.Errorf("error constructing filter: %v", err)
    }
    if i != nil {
        interceptors = append(interceptors, i)
    }
}
return &interceptorList{interceptors: interceptors}, nil

Aditya-Sood avatar Jun 08 '23 09:06 Aditya-Sood

hi @dfawley @zasweq @easwars so introducing the additional rt.actionType == xdsresource.RouteActionRoute check is causing some existing tests to fail, because they previously didn't expect to be errored-out for failing to meet this condition

do they need to be rewritten to:

  1. be of xdsresource.RouteActionRoute type, OR
  2. expect the "matched route does not have a supported route type" error?

awaiting your inputs

Aditya-Sood avatar Jun 16 '23 07:06 Aditya-Sood

I finally caught up on this set of issues/PRs. I also spoke with @zasweq. To summarize:

  • The existing validation in the xdsClient to check that the last filter is a terminal filter will stay. We will not have an explicit check there to validate that the last filter is a router filter. Since the only supported terminal filter is the router filter, the validation in the xdsClient is good enough and does not require any changes.
  • Update newInterceptor() method to not check for terminal filter or router filter, or ignore filters after a router filter. The change suggested in https://github.com/grpc/grpc-go/pull/6248#issuecomment-1582237029 seems about right.
  • For failing tests, they should be rewritten to use xdsresource.RouteActionRoute
  • We also need a test for the unsupported route type error

@zasweq : Please let me know if I have missed anything.

@Aditya-Sood : Please let us know if you have more questions. Thanks.

easwars avatar Jun 20 '23 20:06 easwars

That all sounds good to me :).

zasweq avatar Jun 20 '23 21:06 zasweq

thanks for the clarification @easwars

  • I'm working on fixing the tests failing because of the rt.actionType != xdsresource.RouteActionRoute check

  • also regarding the changes to newInterceptor(), it seems like specifically the router type filters need to be ignored and not appended to the interceptors slice, otherwise they throw a "no filter implementation found for <router type filter>" eventually (just ignoring terminal type filters doesn't work, the check specifically needs to be for router filters) will look into it and update

Aditya-Sood avatar Jun 22 '23 19:06 Aditya-Sood

also regarding the changes to newInterceptor(), it seems like specifically the router type filters need to be ignored and not appended to the interceptors slice

I think what we are saying is that since the xdsClient ensures that the last filter in the list is a terminal filter (and the only supported terminal filter is the router filter), the config selector can simply trust that the xdsClient has done its job and run through the list of filters without any validation.

easwars avatar Jun 22 '23 22:06 easwars

hi @easwars and @zasweq

TL;DR - I need help with understanding the design/logical flow of the codebase (especially the xds resolver and router relevant parts), where can I find the low-level documentation?

I've spent more than a week trying to understand the control flow by following the error stacktraces and object definitions With that I've been able to resolve the issue of existing test cases failing (tests have been adapted in the above commit)

But I think I've hit a wall in terms of how much I can get done without gaining a deeper understanding of the logical flow 😬

For instance, if I remove the following router filter check in newInterceptor():

if router.IsRouterFilter(filter.Filter) {
    continue
}

then the FaultInjection_MaxActiveFaults test panics with a nil pointer dereference error

I spent the entire weekend trying to make sense of the error, but I'm unable to reconcile how the fault_test.go (source of the nil pointer dereference error) is related to the unmarshal_lds.go (whose "no filter implementation found for..." error is also mentioned in the stacktrace)

(PFA full error log - TestFaultInjection_MaxActiveFaults.txt)

I have tried the documentation files in the repo but they didn't help with understanding this

Could you please redirect me to the relevant low-level documentation or comms channel where I can understand the logical flow? I'm currently out of ideas on how to resolve this test failure otherwise

Kindly let me know, thank you!

Aditya-Sood avatar Jun 27 '23 17:06 Aditya-Sood

The following are some good docs to read to understand what we are trying to do with xDS and how we implement things in gRPC:

  • Overall xDS design: https://github.com/grpc/proposal/blob/master/A27-xds-global-load-balancing.md
  • Config selector design: https://github.com/grpc/proposal/blob/master/A31-xds-timeout-support-and-config-selector.md
  • Fault injection design: https://github.com/grpc/proposal/blob/master/A33-Fault-Injection.md
  • HTTP filters design: https://github.com/grpc/proposal/blob/master/A39-xds-http-filters.md

Some of the above docs are long reads, but they can given you a very good picture.

In the log file that you attached, I do see the following log line:

        fault_test.go:617: RPC error: rpc error: code = Unavailable desc = name resolver error: no filter implementation found for "type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"

This indicates that the router package has not been imported. You need a line like this to import the router package which registers the router filter.

_ "google.golang.org/grpc/xds/internal/httpfilter/router" // Register the router filter.

easwars avatar Jun 28 '23 23:06 easwars

Apart from what is mentioned in the docs listed above, a very high level flow would look like this:

  • User application creates a grpc.ClientConn and uses the xds:/// scheme in the dial target
  • This results in gRPC instantiating an xDS resolver
  • The xDS resolver creates an xDS client to the management server (or xDS server) to receive configuration
  • The xDS resolver registers a watch for two types of resources from the xDS client, the Listener resource and the RouteConfiguration resource
    • The xDS client in turn converts these watch requests into xDS requests to the management server
    • When the management server responds with xDS resources, the xDS client takes care of unmarshaling the response, and converting it to an internal representation which is then pushed to the xDS resolver
  • The xDS resolver upon receiving configuration from the xDS client creates a service config and pushes this to gRPC
    • This service config contains the LB policy to use
    • The xDS resolver also builds a config selector (which is used at RPC time to determine RPC method specific configuration) and sends it to gRPC
  • gRPC instantiates the LB policy specified in the service config
    • For the xDS case, the top-level LB policy will always be the cluster manager LB policy
    • There will be a tree of LB policies under this, and each of them perform a specific role
    • Put together though, they determine the configuration for the xDS cluster and the list of endpoints and also balance load across these endpoints
  • When the user application makes an RPC, gRPC uses the config selector built by the xDS resolver to determine the method specific configuration

I know this description is very very high level, but I hope this and the above mentioned docs help you, and if you have subsequent questions, please let us know.

easwars avatar Jun 29 '23 00:06 easwars

thanks a lot @easwars! I'll try to finish these over the weekend and get back

Aditya-Sood avatar Jun 30 '23 04:06 Aditya-Sood

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Jul 06 '23 04:07 github-actions[bot]

sorry for the delay in resolution, have been caught up with my day job I've been going through the shared docs, will try to close the PR over the weekend

Aditya-Sood avatar Jul 14 '23 04:07 Aditya-Sood

hi @easwars and @zasweq apologies for the delay, I had written the test case last week but wasn't fully sure that my logic for causing the error was in tune with the actual code flow, so was still going through the docs this week

  1. thanks to @easwars's bug fix (blank identifier import) and shared documentation, I was able to understand the high-level flow and write the new test case by changing the route config update for the new test case with a RouteActionUnsupported ActionType - link to commit

  2. can this PR be re-opened? if not I will create a new one

  3. one query - in the resolver.SelectConfig() code, would it be semantically better to replace the rt.actionType != xdsresource.RouteActionRoute check with rt.actionType == xdsresource.RouteActionUnsupported, since the latter is already used to represent routing types not currently supported by gRPC?

Aditya-Sood avatar Jul 22 '23 16:07 Aditya-Sood

  1. one query - in the resolver.SelectConfig() code, would it be semantically better to replace the rt.actionType != xdsresource.RouteActionRoute check with rt.actionType == xdsresource.RouteActionUnsupported, since the latter is already used to represent routing types not currently supported by gRPC?

resolver.SelectConfig() happens on the client side and on that that the only supporting action type is xdsresource.RouteActionRoute.

easwars avatar Aug 01 '23 16:08 easwars

hi @easwars, thanks for the review! I have included the above changes in the latest commit

  1. I've changed the variable name to errUnsupportedClientRouteAction instead of the suggested errUnsupportedRouteAction - I felt the latter could be confused as an error specifically for RouteActionUnsupported action type However if you feel the clarification is unnecessary kindly let me know, I'll change it accordingly

  2. For the "NewStream error" testcase I've removed the second route and RPC response - not sure why it was added in the first place, but the testcase passes after the change as well

  3. One query: As clarified in the previous comment, the client requests will always be assigned the RouteActionRoute action type by the xds client (source code snippet) Since that is the case, is the additional check introduced by this PR and the testcases for RouteActionUnsupported/NonForwardingAction primarily from a completeness perspective? Or is it possible to have client requests which are not of RouteActionRoute type?

Aditya-Sood avatar Aug 02 '23 04:08 Aditya-Sood

  1. One query: As clarified in the previous comment, the client requests will always be assigned the RouteActionRoute action type by the xds client (source code snippet) Since that is the case, is the additional check introduced by this PR and the testcases for RouteActionUnsupported/NonForwardingAction primarily from a completeness perspective? Or is it possible to have client requests which are not of RouteActionRoute type?

Currently, the only supported route action type on the client is RouteActionRoute and the only supported route action type on the server is NonForwardingAction. The gRPC client and server share the same xdsClient which is the one which receives the resource from the management server and parses it. When parsing, the xdsClient does not have enough information to know whether the route configuration will be used by the client or the server (because the same application can have a server and a client). So, it will accept routes with either of these two values. And therefore we need extra validation at the client and server sides.

Yes, including test cases for RouteActionUnsupported and NonForwardingAction is for completeness.

easwars avatar Aug 02 '23 17:08 easwars

thank you for the clarification! I've moved the RouteActionNonForwardingAction testcase as well

Aditya-Sood avatar Aug 02 '23 20:08 Aditya-Sood

hi @easwars @zasweq are there any further changes required for this PR?

Aditya-Sood avatar Aug 07 '23 16:08 Aditya-Sood