fluentd-kubernetes-sumologic
fluentd-kubernetes-sumologic copied to clipboard
Missing collector for scheduled (success|failure) events
Primary Concern
I'd like some help to understand whether or not I've missed something when following the README and the guides on help.sumologic.com ... kubernetes.
I seem to have most dashboards working with the exception of scheduler related panels like Kubernetes - Overview -> Pods Scheduled By Namespace
which is driven by the following query:
_sourceCategory = *kube-scheduler*
| timeslice 1h
| parse "Successfully assigned * to *\"" as name2,node
| parse "reason: '*'" as reason
| parse "type: '*'" as normal
| parse "Name:\\\"*\\\"" as name
| parse "Namespace:\\\"*\\\"" as namespace
| parse "Kind:\\\"*\\\"" as kind
| count by _timeslice, namespace
| transpose row _timeslice column namespace
| fillmissing timeslice(1h)
The problem is that the line this query is driven by is not logged by the scheduler but emitted as an event. The only piece from the documentation which I can see which would be able to push this to sumo is the sumologic-k8s-api script which is noticeably lacking any calls the v1/api/events
as well as the role for calling that.
I've tested a fix which would add these log lines and can submit it as a PR against sumologic-k8s-api but I feel like I've missed something obvious.
Secondary concern
I see some of the panels are driven by queries which extract fields which don't fill me with confidence that I've got things configured correctly:
Kubernetes - Controller Manager -> Event Severity Trend
using the following query:
_sourceCategory = *kube-controller-manager*
| parse "\"message\":\"*\"" as message
| parse "\"source\":\"*.*:*\"" as resource,resource_action,resource_code
| parse "\"severity\":\"*\"" as severity
| fields - resource_action, resource_code
| timeslice 1h
| count _timeslice, severity
| transpose row _timeslice column severity
| fillmissing timeslice(1h)
Which matches this log line:
{
"timestamp": 1528785188171,
"severity": "I",
"pid": "1",
"source": "round_trippers.go:439",
"message": "Response Status: 200 OK in 2 milliseconds"
}
Where resource_action, resource_code
would match go
and 439
respectively. Is this correct?
@keir-rex can you provide the following information?
- What version of k8s?
- Where is it running?
- Managed Service (GKE/EKS) or you manage the cluster (kops/kubeadm)
- Can you share your YAML
These logs did exist at some point, very possible they have been tweaked in a new release or things have changed in the underlying logging of the scheduler so this will help me figure out what is going on.
@frankreno
- v1.9.6 (
kubectl version
output below): - AWS
- kops
- Provided below
sumologic-k8s-api
I rebuilt your image to also hit /v1/api/events
you can see the diff here:
log.info("getting data for events")
events = requests.get(url="{}/api/v1/events".format(self.k8s_api_url)).json()
for event in events["items"]:
log.info("pushing to sumo")
requests.post(url=self.collector_url,
data=json.dumps(event),
headers=self.headers)
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: sumologic-k8s-api
labels:
app: sumologic-k8s-api
spec:
schedule: "*/5 * * * *"
successfulJobsHistoryLimit: 10
failedJobsHistoryLimit: 10
concurrencyPolicy: Replace
jobTemplate:
spec:
template:
spec:
serviceAccount: sumologic-k8s-api
restartPolicy: OnFailure
containers:
- name: sumologic-k8s-api
imagePullPolicy: Always
image: frankreno/sumologic-k8s-api:events
env:
- name: SUMO_HTTP_URL
value: <INSERT_URL_HERE>
- name: K8S_API_URL
value: http://127.0.0.1:8001
- name: X-Sumo-Category
value: k8s/api
- name: X-Sumo-Name
value: sumologic-k8s-api
- name: kubectl
image: gcr.io/google_containers/kubectl:v1.0.7
command: ["/kubectl"]
args: ["proxy", "-p", "8001"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: sumologic-k8s-api
labels:
app: sumologic-k8s-api
rules:
- apiGroups: [""]
resources: ["nodes", "pods", "events"]
verbs: ["get", "list"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: sumologic-k8s-api
labels:
app: sumologic-k8s-api
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: sumologic-k8s-api
labels:
app: sumologic-k8s-api
subjects:
- kind: ServiceAccount
name: sumologic-k8s-api
namespace: default
roleRef:
kind: ClusterRole
name: sumologic-k8s-api
apiGroup: rbac.authorization.k8s.io
fluentd-kubernetes-sumologic
is basically vanilla
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluentd
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: fluentd
rules:
- apiGroups:
- ""
resources:
- namespaces
- pods
verbs:
- get
- list
- watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: fluentd
roleRef:
kind: ClusterRole
name: fluentd
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: fluentd
# This namespace setting will limit fluentd to watching/listing/getting pods in the default namespace. If you want it to be able to log your kube-system namespace as well, comment the line out.
namespace: default
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: fluentd-sumologic
labels:
app: fluentd-sumologic
version: v1
spec:
template:
metadata:
labels:
name: fluentd-sumologic
spec:
serviceAccountName: fluentd
volumes:
- name: pos-files
emptyDir: {}
- name: host-logs
hostPath:
path: /var/log/
- name: docker-logs
hostPath:
path: /var/lib/docker
containers:
- image: sumologic/fluentd-kubernetes-sumologic:latest
name: fluentd
imagePullPolicy: Always
volumeMounts:
- name: host-logs
mountPath: /mnt/log/
readOnly: true
- name: host-logs
mountPath: /var/log/
readOnly: true
- name: docker-logs
mountPath: /var/lib/docker/
readOnly: true
- name: pos-files
mountPath: /mnt/pos/
env:
- name: COLLECTOR_URL
valueFrom:
secretKeyRef:
name: sumologic
key: collector-url
tolerations:
#- operator: "Exists"
- effect: "NoSchedule"
key: "node-role.kubernetes.io/master"
kubectl version:
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-05-12T04:12:12Z", GoVersion:"go1.9.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}```
@keir-rex thanks for the info. So this appears to be a change in 1.9.x. I have a 1.8 cluster and a 1.9 cluster and the schedule is not producing the same logs. Will try to track down to the source and work on remediation for this.
Cheers @frankreno let me know if there's anything I can help with
@keir-rex still no response from the folks on the scheduling team for k8s. So I do not have a good answer as to why this changed and how to remedy yet. I found the code where the log used to be generated and see no changes to account for this, so just means the change is not coming from the scheduler, but somewhere else. Will keep you updated. Long term, we are working on a new metrics collection strategy for Kubernetes not using heapster which will allow us to collect from many more data sources and provide insights into this. Let's keep this issue open until we solve it one of those ways...
Sounds good @frankreno. I'll throw together something which does de-duping of events since we need that anyway.
Could you comment on my second query on my initial post?
Cheers
@keir-rex that's right. I see [218, 42, 205, 363 and 374] as code, 'event' as a resource, and 'go' as resource_action. Although, I have to revisit these to make sure these are proper naming conventions