eventing simpler observability for Knative eventing system components

Problem no easy way to observe what is going on in Knative eventing system components, users may not have to system namespaces where knative eventing is installed

Persona: Which persona is this feature for?

Event consumer (developer) System Integrator Contributors

Exit Criteria User can oobserver Knative eventing system components

Time Estimate (optional): How many developer-days do you think this may take to resolve? 1-\inf

Additional context (optional) Add any other context about the feature request here.

This came up during Source WG discussion

@lionelvillard @n3wscott @cr22rc @nachocano @lberk @grantr @bryeung

Jun 23 '20 16:06 aslom

We should explore producing cloudevents in response to milestones in progress on our control and data planes.

These could optionally be sunk to Kubernetes Events, or to a Broker and then directed to a namespace for a user to observe errors in their multi-tenant control planes without leaking secret info.

It is an interesting thing to explore.

Jun 23 '20 16:06 n3wscott

More on problem: when Knative is installed some of its components may be in system namespaces such as mt-broker or sources link how do we get logs (and other observability) to user namespace?

For getting logs: https://github.com/knative/eventing/issues/3299

Jun 23 '20 16:06 aslom

One idea was to make system logs and events into CloudEvents that are then routed to user namespace @n3wscott ?

Jun 23 '20 16:06 aslom

I was thinking that may work great if users can create in user namespace special Knative Eventing Observability CR for diagnostic and do not need ot run multiple logs or describe or install additinal tools or require system namespaces permissions:

apiVersion: eventing.knative.dev/v1alpha1
kind: Observability
metadata:
  name: diagnostic
# additional option to what to observe, filtering etc. - has reasonable defaults

after applying CR in user namespace

kubectl apply -f knative-eventing-observability.yaml

then user can describe created object and see status of knative eventing in their namespace, k8s events etc.

kubectl describe eventing.knative.dev diagnostic

For good user experience there may be also special pod created that gathers logs from system namespaces and makes them available in the pod in user namespace:

kubectl logs diagnostic--XYZ-123

And when done simply do cleanup (and avoid overheads of diagnostic observability):

kubectl delete -f knative-eventing-observability.yaml

Jun 23 '20 16:06 aslom

Note that we also want consistency with Serving and I'm not 100% sure a hard dependency on CloudEvents is the right direction but worth a try.

The key point here is to agree on the need to produce diagnosis events in our data planes, other than metrics. How it's implemented is a different question.

@dprotaso @mattmoor @mdemirhan thoughts?

Jun 23 '20 17:06 lionelvillard

How do you see this being related to #3299 ?

As an end user, I think the most important thing to see are errors. Ideally, I think those should be associated with the entity the error is about from the user's POV. For example, an error with the github source/adapter should probably be related to (seen thru) the Github CR. A more generic error reporting thing (e.g. logDNA) might be useful, but I tend to think of those as deeper analysis tools that, if people really want, they can setup. And I kind of view the idea of generating CEs with different sinks in that category... more advanced. But for the simple use cases I think most people would prefer to look at the CR and so I'd prefer to solve that one first.

Jun 24 '20 15:06 duglin

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Nov 23 '20 01:11 github-actions[bot]

/reopen Being discuss in for GSOC

Apr 01 '22 01:04 csantanapr

@csantanapr: Reopened this issue.

In response to this:

/reopen Being discuss in for GSOC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 01 '22 01:04 knative-prow[bot]

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jul 01 '22 01:07 github-actions[bot]

/remove-lifecycle stale

Jul 01 '22 12:07 aslom

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Sep 30 '22 01:09 github-actions[bot]

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

Oct 30 '22 02:10 knative-prow-robot

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jan 30 '23 01:01 github-actions[bot]

eventing eventing copied to clipboard

simpler observability for Knative eventing system components

eventing
eventing copied to clipboard