[BUG] Few issues on prometheus_traffic test case
Describe the bug A couple of issues observed during running the prometheus_traffic test.
Issue 1
The match process function sometimes matches the wrong process, for example, when my cluster don't have Prometheus installed, the function will still match with other process which has the string prometheus in the container running args, see below matched with opa which is installed by the cnf-testsuite:
I, [2022-09-06 16:23:42 +02:00 #287532] INFO -- cnf-testsuite: Match Pod: {node: {"apiVersion" => "v1", "kind" => "Node"
...
"pod":{
"apiVersion""=>""v1",
"kind""=>""Pod",
...
"name""=>""gatekeeper-audit-66c5dd65b7-k9hnm",
"namespace""=>""cnf-testsuite",
...
I, [2022-09-06 16:23:43 +02:00 #287532] INFO -- cnf-testsuite: service_by_pod pod_name: gatekeeper-audit-66c5dd65b7-k9hnm
I think it's because the gatekeeper is running with --prometheus-port=8888 container args. Similar has happened even when I installed the Prometheus (along with Alertmanager), the match function matches for the alertmanager pod instead of prometheus pod.
Issue 2
When there is a matching pod, the test is trying to find the service associated with the Prometheus pod, however the function KubectlClient::Get.service_by_pod(match[:pod]) is only looking for the service in default namespace while looking for the pods in all namespaces, so if I have installed prometheus in another namespace, there will never be a match:
#utils/kubectl_client/kubectl_client.cr
def self.service_by_pod(pod : (JSON::Any | Hash(String | Nil, String)))
Log.info { "service_by_pod pod: #{pod}" }
services = KubectlClient::Get.services
# The function self.services has default all_namespace set to False
def self.services(all_namespaces : Bool = false) : JSON::Any
cmd = "kubectl get services -o json"
if all_namespaces
cmd = "#{cmd} -A"
end
result = ShellCmd.run(cmd, "KubectlClient::Get.services")
response = result[:output]
Issue3 The service_url generation also has some problems:
- It assumes the service will be run only in default namespace
- It assumes the service will be run on port 80.
If not matching the above assumptions, the curl will timeout and the test will be skipped.
service = KubectlClient::Get.service_by_pod(match[:pod])
service_url = service.dig("metadata", "name")
# The service namespace is hardcoded to default
service_namespace = "default"
if service.dig?("metadata", "namespace")
service_namespace = service.dig("metadata", "namespace")
end
Log.info { "service_url: #{service_url}"}
ClusterTools.install
# There is no code to get the service port and the curl will goes to port 80 by default
prom_api_resp = ClusterTools.exec("curl http://#{service_url}.#{service_namespace}.svc.cluster.local/api/v1/targets?state=active")
To Reproduce
- Run the test without Prometheus installed and gatekeeper installed:
./cnf-testsuite prometheus_traffic -l info - Install Prometheus in another namespace other than default then run the test
./cnf-testsuite prometheus_traffic -l info
Expected behavior A clear and concise description of what you expected to happen.
- The test should match the correct Pormetheus pod.
- The test should not restrict to check the service in default namespace
- The service_url generation should not limiting to port 80 and default namespace.
Device (please complete the following information):
- OS [e.g. Linux, iOS, Windows, Android]: Linux
- Distro [e.g. Ubuntu]: Ubuntu
- Version [e.g. 18.04]: 18.04
- Architecture [e.g. x86, arm]: x86
- Browser [e.g. chrome, safari]: N/A
Another side effect observed is that the open_metrics test case will be always skipped as the prometheus_traffic is a dependency task of open_metrics.
desc "Does the CNF emit prometheus open metric compatible traffic"
task "open_metrics", ["prometheus_traffic"] do |_, args|
Log.info { "Running: open_metrics" }
Thanks for reporting this @iElephant. I'll look into it and share updates.
@iElephant The improvements in the PR linked above are in the main branch and should resolve the issues 1, 2 & 3 mentioned on the ticket description. The other issue about open_metrics mentioned in the comment has been moved to a new issue as #1644.
Looking into #1644.
AC in #1644