eventing
eventing copied to clipboard
simpler observability for Knative eventing system components
Problem no easy way to observe what is going on in Knative eventing system components, users may not have to system namespaces where knative eventing is installed
Persona: Which persona is this feature for?
Event consumer (developer) System Integrator Contributors
Exit Criteria User can oobserver Knative eventing system components
Time Estimate (optional): How many developer-days do you think this may take to resolve? 1-\inf
Additional context (optional) Add any other context about the feature request here.
This came up during Source WG discussion
@lionelvillard @n3wscott @cr22rc @nachocano @lberk @grantr @bryeung
We should explore producing cloudevents in response to milestones in progress on our control and data planes.
These could optionally be sunk to Kubernetes Events, or to a Broker and then directed to a namespace for a user to observe errors in their multi-tenant control planes without leaking secret info.
It is an interesting thing to explore.
More on problem: when Knative is installed some of its components may be in system namespaces such as mt-broker or sources link how do we get logs (and other observability) to user namespace?
For getting logs: https://github.com/knative/eventing/issues/3299
One idea was to make system logs and events into CloudEvents that are then routed to user namespace @n3wscott ?
I was thinking that may work great if users can create in user namespace special Knative Eventing Observability CR for diagnostic and do not need ot run multiple logs or describe or install additinal tools or require system namespaces permissions:
apiVersion: eventing.knative.dev/v1alpha1
kind: Observability
metadata:
name: diagnostic
# additional option to what to observe, filtering etc. - has reasonable defaults
after applying CR in user namespace
kubectl apply -f knative-eventing-observability.yaml
then user can describe created object and see status of knative eventing in their namespace, k8s events etc.
kubectl describe eventing.knative.dev diagnostic
For good user experience there may be also special pod created that gathers logs from system namespaces and makes them available in the pod in user namespace:
kubectl logs diagnostic--XYZ-123
And when done simply do cleanup (and avoid overheads of diagnostic observability):
kubectl delete -f knative-eventing-observability.yaml
Note that we also want consistency with Serving and I'm not 100% sure a hard dependency on CloudEvents is the right direction but worth a try.
The key point here is to agree on the need to produce diagnosis events in our data planes, other than metrics. How it's implemented is a different question.
@dprotaso @mattmoor @mdemirhan thoughts?
How do you see this being related to #3299 ?
As an end user, I think the most important thing to see are errors. Ideally, I think those should be associated with the entity the error is about from the user's POV. For example, an error with the github source/adapter should probably be related to (seen thru) the Github CR. A more generic error reporting thing (e.g. logDNA) might be useful, but I tend to think of those as deeper analysis tools that, if people really want, they can setup. And I kind of view the idea of generating CEs with different sinks in that category... more advanced. But for the simple use cases I think most people would prefer to look at the CR and so I'd prefer to solve that one first.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/reopen Being discuss in for GSOC
@csantanapr: Reopened this issue.
In response to this:
/reopen Being discuss in for GSOC
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/remove-lifecycle stale
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
This issue or pull request is stale because it has been open for 90 days with no activity.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
/lifecycle stale
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.