xds relay fetches incorrect eds resources
Snapshot of versioninfo on upstream envoymanager and xds relay. We can observe that cache keys are not mapping correctly to the responses they contain in xds relay cache.
➜ ~ em exec --stdin --tty envoymanager-main-754f8c6b74-ck2pr -n envoymanager-staging --container envoymanager-service-gojson -- curl -s "0:6070/entry_dump?key=pyexample2_staging_eds" | head
[
{
"key": "pyexample2_staging_eds",
"version": "93efeb417a01422a7f856a1d14d70f2cbeefde0d",
"resource":
{
"clusterName": "pyexample2",
"endpoints": [
{
"locality": {
➜ ~ em exec --stdin --tty envoymanager-main-754f8c6b74-ck2pr -n envoymanager-staging --container envoymanager-service-gojson -- curl -s "0:6070/entry_dump?key=kitchensink_staging_eds" | head
[
{
"key": "kitchensink_staging_eds",
"version": "23878f15191ae03cdc018a012f2e5ddce2c2db40",
"resource":
{
"clusterName": "kitchensink",
"endpoints": [
{
"locality": {
➜ ~ em exec --stdin --tty envoymanager-main-754f8c6b74-ck2pr -n envoymanager-staging --container envoymanager-service-gojson -- curl -s "0:6070/entry_dump?key=pyexample2workers_staging_eds" | head
[
{
"key": "pyexample2workers_staging_eds",
"version": "d507920d74ddae6b002c24a899313aa656a78756",
"resource":
{
"clusterName": "pyexample2workers",
"endpoints": [
{
"locality": {
➜ ~ em exec --stdin --tty xdsrelay-main-7bfc54dd8f-xjbqt -n xdsrelay-staging --container xdsrelay-service-gojson -- curl -s 0:6070/cache/v3-pyexample2workers-staging-iad_eds | head
{
"Cache": [
{
"Key": "v3-pyexample2workers-staging-iad_eds",
"Resp": {
"VersionInfo": "23878f15191ae03cdc018a012f2e5ddce2c2db40",
"Resources": {
"Endpoints": [
{
"cluster_name": "kitchensink",
➜ ~ em exec --stdin --tty xdsrelay-main-7bfc54dd8f-xjbqt -n xdsrelay-staging --container xdsrelay-service-gojson -- curl -s 0:6070/cache/v3-pyexample2-staging-iad_eds | head
{
"Cache": [
{
"Key": "v3-pyexample2-staging-iad_eds",
"Resp": {
"VersionInfo": "d507920d74ddae6b002c24a899313aa656a78756",
"Resources": {
"Endpoints": [
{
"cluster_name": "pyexample2workers",
➜ ~ em exec --stdin --tty xdsrelay-main-7bfc54dd8f-xjbqt -n xdsrelay-staging --container xdsrelay-service-gojson -- curl -s 0:6070/cache/v3-kitchensink-staging-iad_eds | head
{
"Cache": [
{
"Key": "v3-kitchensink-staging-iad_eds",
"Resp": {
"VersionInfo": "93efeb417a01422a7f856a1d14d70f2cbeefde0d",
"Resources": {
"Endpoints": [
{
"cluster_name": "pyexample2",
➜ ~
Found the reason for this. The aggregator rule in our private repo specific to Lyft had a bug due to which eds requests were cached on service name. So when svcA asked for eds, the last eds won and overwrote eds for all previous services
After adding rules to add resource name for eds, the cache is happy now.
- rules:
- match:
request_type_match:
types:
- "type.googleapis.com/envoy.api.v2.RouteConfiguration"
- "type.googleapis.com/envoy.config.route.v3.RouteConfiguration"
- "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
- "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment"
result:
resource_names_fragment:
element: 0
action: { exact: true }
An important aspect here is that these rules will possibly apply to all users of the project.
In the envoy slack we mentioned two alternatives to fix this:
- define an aggregation rules checker, similar to envoy's router check tool.
- implicitly add the resource name to the cache key in case it's not present.