xds-relay icon indicating copy to clipboard operation
xds-relay copied to clipboard

xds relay fetches incorrect eds resources

Open jyotimahapatra opened this issue 5 years ago • 4 comments

Snapshot of versioninfo on upstream envoymanager and xds relay. We can observe that cache keys are not mapping correctly to the responses they contain in xds relay cache.

➜  ~ em exec --stdin --tty envoymanager-main-754f8c6b74-ck2pr -n envoymanager-staging  --container envoymanager-service-gojson -- curl -s "0:6070/entry_dump?key=pyexample2_staging_eds" | head
[
{
  "key": "pyexample2_staging_eds",
  "version": "93efeb417a01422a7f856a1d14d70f2cbeefde0d",
  "resource":
{
  "clusterName": "pyexample2",
  "endpoints": [
    {
      "locality": {
➜  ~ em exec --stdin --tty envoymanager-main-754f8c6b74-ck2pr -n envoymanager-staging  --container envoymanager-service-gojson -- curl -s "0:6070/entry_dump?key=kitchensink_staging_eds" | head
[
{
  "key": "kitchensink_staging_eds",
  "version": "23878f15191ae03cdc018a012f2e5ddce2c2db40",
  "resource":
{
  "clusterName": "kitchensink",
  "endpoints": [
    {
      "locality": {
➜  ~ em exec --stdin --tty envoymanager-main-754f8c6b74-ck2pr -n envoymanager-staging  --container envoymanager-service-gojson -- curl -s "0:6070/entry_dump?key=pyexample2workers_staging_eds" | head
[
{
  "key": "pyexample2workers_staging_eds",
  "version": "d507920d74ddae6b002c24a899313aa656a78756",
  "resource":
{
  "clusterName": "pyexample2workers",
  "endpoints": [
    {
      "locality": {
➜  ~ em exec --stdin --tty xdsrelay-main-7bfc54dd8f-xjbqt -n xdsrelay-staging  --container xdsrelay-service-gojson -- curl -s 0:6070/cache/v3-pyexample2workers-staging-iad_eds | head
{
  "Cache": [
    {
      "Key": "v3-pyexample2workers-staging-iad_eds",
      "Resp": {
        "VersionInfo": "23878f15191ae03cdc018a012f2e5ddce2c2db40",
        "Resources": {
          "Endpoints": [
            {
              "cluster_name": "kitchensink",
➜  ~ em exec --stdin --tty xdsrelay-main-7bfc54dd8f-xjbqt -n xdsrelay-staging  --container xdsrelay-service-gojson -- curl -s 0:6070/cache/v3-pyexample2-staging-iad_eds | head
{
  "Cache": [
    {
      "Key": "v3-pyexample2-staging-iad_eds",
      "Resp": {
        "VersionInfo": "d507920d74ddae6b002c24a899313aa656a78756",
        "Resources": {
          "Endpoints": [
            {
              "cluster_name": "pyexample2workers",
➜  ~ em exec --stdin --tty xdsrelay-main-7bfc54dd8f-xjbqt -n xdsrelay-staging  --container xdsrelay-service-gojson -- curl -s 0:6070/cache/v3-kitchensink-staging-iad_eds | head
{
  "Cache": [
    {
      "Key": "v3-kitchensink-staging-iad_eds",
      "Resp": {
        "VersionInfo": "93efeb417a01422a7f856a1d14d70f2cbeefde0d",
        "Resources": {
          "Endpoints": [
            {
              "cluster_name": "pyexample2",
➜  ~

jyotimahapatra avatar Sep 17 '20 18:09 jyotimahapatra

Found the reason for this. The aggregator rule in our private repo specific to Lyft had a bug due to which eds requests were cached on service name. So when svcA asked for eds, the last eds won and overwrote eds for all previous services

jyotimahapatra avatar Sep 17 '20 21:09 jyotimahapatra

After adding rules to add resource name for eds, the cache is happy now.

  - rules:
    - match:
        request_type_match:
          types:
            - "type.googleapis.com/envoy.api.v2.RouteConfiguration"
            - "type.googleapis.com/envoy.config.route.v3.RouteConfiguration"
            - "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
            - "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment"
      result:
        resource_names_fragment:
          element: 0
          action: { exact: true }

jyotimahapatra avatar Sep 17 '20 22:09 jyotimahapatra

An important aspect here is that these rules will possibly apply to all users of the project.

jyotimahapatra avatar Sep 17 '20 22:09 jyotimahapatra

In the envoy slack we mentioned two alternatives to fix this:

  • define an aggregation rules checker, similar to envoy's router check tool.
  • implicitly add the resource name to the cache key in case it's not present.

eapolinario avatar Sep 18 '20 05:09 eapolinario