cilium icon indicating copy to clipboard operation
cilium copied to clipboard

Add fqdn perf test

Open marseel opened this issue 1 year ago • 3 comments

Start measuing fqdn performance. This test runs 30 qps of DNS requests to S3, which return different set of IPs with each request. We keep track of client-side latency, latencies reported by cilium-agent metrics and also cpu/memory usage while also gathering profiling information.

marseel avatar May 13 '24 14:05 marseel

Example results Based on Cilium metrics:

{
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "DNS Proxy dataplane time - Perc50": 0.0025064447433310916,
        "DNS Proxy dataplane time - Perc99": 0.0049627605917955606,
        "DNS Proxy policy check time - Perc50": 0.0025,
        "DNS Proxy policy check time - Perc99": 0.00495,
        "DNS Proxy policy generation time - Perc50": 0.0028914533229893974,
        "DNS Proxy policy generation time - Perc99": 0.09597826086956514,
        "DNS Proxy policy semaphore time - Perc50": 0.0025,
        "DNS Proxy policy semaphore time - Perc99": 0.00495,
        "DNS Proxy processing time - Perc50": 0.0028933238452581184,
        "DNS Proxy processing time - Perc99": 0.09811848958333333,
        "DNS Proxy total time - Perc50": 0.0029030897053096195,
        "DNS Proxy total time - Perc99": 0.11715346534653429,
        "DNS Proxy upstream time - Perc50": 0.002575138185168125,
        "DNS Proxy upstream time - Perc99": 0.021066249999999915
      },
      "unit": "s"
    }
  ]
}

Based on client-side metrics:

{
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "DNS Error Count": 0,
        "DNS Error Percentage": 0,
        "DNS Lookup Count": 10734,
        "DNS Lookup Latency - Perc50": 0.00993109151047409,
        "DNS Lookup Latency - Perc99": 0.17374999999999893,
        "DNS Timeout Count": 0
      },
      "unit": "s"
    }
  ]
}

CPU/mem usage:

50th percentile
    {
      "Name": "cilium-pgg9c/cilium-agent",
      "CPU": 0.483284872,
      "Mem": 236933120
    },
    {
      "Name": "cilium-v5bq6/cilium-agent",
      "CPU": 0.634017295,
      "Mem": 228868096
    },
99th percentile
    {
      "Name": "cilium-pgg9c/cilium-agent",
      "CPU": 0.518770507,
      "Mem": 238755840
    },
    {
      "Name": "cilium-v5bq6/cilium-agent",
      "CPU": 0.716143043,
      "Mem": 243933184
    },

CPU pprof: image

marseel avatar May 15 '24 11:05 marseel

One interesting observation, when I increased the number of distinct DNS names from 10 to 100, without changing qps , most of the requests start to fail, timing out on policy generation time:

{
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "DNS Proxy dataplane time - Perc50": 0.0027479629109300363,
        "DNS Proxy dataplane time - Perc99": 0.6157446808510627,
        "DNS Proxy policy check time - Perc50": 0.0025,
        "DNS Proxy policy check time - Perc99": 0.00495,
        "DNS Proxy policy generation time - Perc50": 10,
        "DNS Proxy policy generation time - Perc99": 10,
        "DNS Proxy policy semaphore time - Perc50": 0.0025,
        "DNS Proxy policy semaphore time - Perc99": 0.00495,
        "DNS Proxy processing time - Perc50": 10,
        "DNS Proxy processing time - Perc99": 10,
        "DNS Proxy total time - Perc50": 10,
        "DNS Proxy total time - Perc99": 10,
        "DNS Proxy upstream time - Perc50": 0.002936055238667067,
        "DNS Proxy upstream time - Perc99": 0.04322471910112352
      },
      "unit": "s"
    }
  ]
}
{
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "DNS Error Count": 8127.043253333333,
        "DNS Error Percentage": 80.77929442324005,
        "DNS Lookup Count": 10060.8,
        "DNS Lookup Latency - Perc50": 10,
        "DNS Lookup Latency - Perc99": 10,
        "DNS Timeout Count": 8127.043253333333
      },
      "unit": "s"
    }
  ]
}

marseel avatar May 16 '24 14:05 marseel

/test

marseel avatar May 21 '24 11:05 marseel

rebased on main to pull change with EKS clusters not using preemtibles.

marseel avatar May 28 '24 08:05 marseel

/test

marseel avatar May 28 '24 08:05 marseel

Not sure why it doesn't get label ready-to-merge, all required tests passed, reviews are in, no pending comments or blocking labels. Marking as ready-to-merge.

marseel avatar May 28 '24 13:05 marseel