Add fqdn perf test
Start measuing fqdn performance. This test runs 30 qps of DNS requests to S3, which return different set of IPs with each request. We keep track of client-side latency, latencies reported by cilium-agent metrics and also cpu/memory usage while also gathering profiling information.
Example results Based on Cilium metrics:
{
"version": "v1",
"dataItems": [
{
"data": {
"DNS Proxy dataplane time - Perc50": 0.0025064447433310916,
"DNS Proxy dataplane time - Perc99": 0.0049627605917955606,
"DNS Proxy policy check time - Perc50": 0.0025,
"DNS Proxy policy check time - Perc99": 0.00495,
"DNS Proxy policy generation time - Perc50": 0.0028914533229893974,
"DNS Proxy policy generation time - Perc99": 0.09597826086956514,
"DNS Proxy policy semaphore time - Perc50": 0.0025,
"DNS Proxy policy semaphore time - Perc99": 0.00495,
"DNS Proxy processing time - Perc50": 0.0028933238452581184,
"DNS Proxy processing time - Perc99": 0.09811848958333333,
"DNS Proxy total time - Perc50": 0.0029030897053096195,
"DNS Proxy total time - Perc99": 0.11715346534653429,
"DNS Proxy upstream time - Perc50": 0.002575138185168125,
"DNS Proxy upstream time - Perc99": 0.021066249999999915
},
"unit": "s"
}
]
}
Based on client-side metrics:
{
"version": "v1",
"dataItems": [
{
"data": {
"DNS Error Count": 0,
"DNS Error Percentage": 0,
"DNS Lookup Count": 10734,
"DNS Lookup Latency - Perc50": 0.00993109151047409,
"DNS Lookup Latency - Perc99": 0.17374999999999893,
"DNS Timeout Count": 0
},
"unit": "s"
}
]
}
CPU/mem usage:
50th percentile
{
"Name": "cilium-pgg9c/cilium-agent",
"CPU": 0.483284872,
"Mem": 236933120
},
{
"Name": "cilium-v5bq6/cilium-agent",
"CPU": 0.634017295,
"Mem": 228868096
},
99th percentile
{
"Name": "cilium-pgg9c/cilium-agent",
"CPU": 0.518770507,
"Mem": 238755840
},
{
"Name": "cilium-v5bq6/cilium-agent",
"CPU": 0.716143043,
"Mem": 243933184
},
CPU pprof:
One interesting observation, when I increased the number of distinct DNS names from 10 to 100, without changing qps , most of the requests start to fail, timing out on policy generation time:
{
"version": "v1",
"dataItems": [
{
"data": {
"DNS Proxy dataplane time - Perc50": 0.0027479629109300363,
"DNS Proxy dataplane time - Perc99": 0.6157446808510627,
"DNS Proxy policy check time - Perc50": 0.0025,
"DNS Proxy policy check time - Perc99": 0.00495,
"DNS Proxy policy generation time - Perc50": 10,
"DNS Proxy policy generation time - Perc99": 10,
"DNS Proxy policy semaphore time - Perc50": 0.0025,
"DNS Proxy policy semaphore time - Perc99": 0.00495,
"DNS Proxy processing time - Perc50": 10,
"DNS Proxy processing time - Perc99": 10,
"DNS Proxy total time - Perc50": 10,
"DNS Proxy total time - Perc99": 10,
"DNS Proxy upstream time - Perc50": 0.002936055238667067,
"DNS Proxy upstream time - Perc99": 0.04322471910112352
},
"unit": "s"
}
]
}
{
"version": "v1",
"dataItems": [
{
"data": {
"DNS Error Count": 8127.043253333333,
"DNS Error Percentage": 80.77929442324005,
"DNS Lookup Count": 10060.8,
"DNS Lookup Latency - Perc50": 10,
"DNS Lookup Latency - Perc99": 10,
"DNS Timeout Count": 8127.043253333333
},
"unit": "s"
}
]
}
/test
rebased on main to pull change with EKS clusters not using preemtibles.
/test
Not sure why it doesn't get label ready-to-merge, all required tests passed, reviews are in, no pending comments or blocking labels. Marking as ready-to-merge.