eraser icon indicating copy to clipboard operation
eraser copied to clipboard

Observability and Auditability

Open bdlb77 opened this issue 3 years ago • 3 comments

Describe the solution you'd like

Being alerted or having access to audits for success / failure of Image Removal

Below is a Suggestion for Auditability:

As an Operator I should be able to see an outputted Log Trace of when I try to delete an unwanted Image

Log Trace Example

  1. Collector - Image List is Generated (ImageList Logs)
  2. Eraser Manager - Eraser Controller Manager Logs of Creating eraser-kind-workers
  3. Eraser Manager - EraserJob logs on manipulation of images
  4. Eraser - Pods deleting and spinning-down

A proposed solution would be to add a CorrelationID to the CRDs of ImageList and ImageJob.

  • Collector would generate the CorrelationID, and pass this CorrelationID to ImageList CRD on creation
  • Eraser Controller Manager would pull the CorrelationID from the ImageList and ImageJob CRDs
  • Creating ImageJob from Eraser ImageList Controller
    • https://github.com/Azure/eraser/blob/main/pkg/eraser/eraser.go#L220
  • EraserJob

Anything else you would like to add:

This will probably require documentation on how to connect to an external Logging Platform (Such as Log Analytics / DataDog/ etc.)

Open Questions:

  • If an Image is not being successfully removed by Eraser, how is the Platform Operator alerted about which image is causing the problem? Should they be alerted about a specific image?
  • There's a high probability that Traceability will be faulty as well. In Kind, I've had about 0.5 - 1 Second capture any logs from the Eraser Worker pod. After the pod is deleted, kubectl logs -p doesn't surface the logs of the terminated Eraser from eraser-kind-worker
Screen Shot 2022-05-17 at 4 03 22 PM

I ran the kubectl logs --previous ... immediately after my terminal returned from following the logs while the container was Running

(E.g., {"level":"info","ts":1652827640.9949317,"logger":"eraser","msg":"Removed","image":"docker.io/library/nginx:latest"})

The issue that may arise is that a Logger Agent may not be able to capture this log in time to fully capture a complete scenario during an Audit.
Also in Development of Eraser, this may make it very difficult to validate a specific image being deleted, and also may require the use of a logging agent in the development workflow if this is something that a developer may want to capture.

Have y'all ran into issues with capturing logs directly from the Eraser Worker if wanting to view the actual log for deletion of a certain Image? Do you think this may surface a problem with Audibility and Tracing through the Eraser Worker Pod?

Environment:

  • Eraser version: main Branch
  • Kubernetes version: (use kubectl version): 1.22

bdlb77 avatar May 17 '22 23:05 bdlb77

@paulbouwer @sozercan @hewatson-msft

What do y'all think to the following questions, or thoughts?

"Have y'all ran into issues with capturing logs directly from the Eraser Worker if wanting to view the actual log for deletion of a certain Image? Do you think this may surface a problem with Audibility and Tracing through the Eraser Worker Pod?"

bdlb77 avatar May 18 '22 17:05 bdlb77

As an advocate for the customer or business user of this feature, and in line with best practices around Production system changes, I would expect messages to be logged in a way that they could be reported on after the fact for at least a month (or in accordance with customer recording keeping policies). I think the customer would want to know exactly what non-running images were removed from where and when. I hypothesize that they want to know how often this is happening in general over time (via regular reports) and take preventative measures if and where possible.

hewatson-msft avatar May 18 '22 18:05 hewatson-msft

@hewatson-msft Thanks for this. Eraser v1.0.0 will record metrics using opentelemetry, which will include

  • images discovered on cluster
  • results of image scans (if any)
  • images that were actually removed
  • timestamps on all of the above

pmengelbert avatar Oct 17 '22 13:10 pmengelbert

Closing this one because of the now available telemetry. If what's now available doesn't satisfy the use case, please feel free to open a new issue :)

salaxander avatar Aug 28 '23 18:08 salaxander

@hewatson-msft Thanks for this. Eraser v1.0.0 will record metrics using opentelemetry, which will include

  • images discovered on cluster
  • results of image scans (if any)
  • images that were actually removed
  • timestamps on all of the above

Just to clarify, the metrics available today are the total count of these, not the actual images removed as describes in https://eraser-dev.github.io/eraser/docs/metrics and https://github.com/eraser-dev/eraser/blob/afb831bcf61d665e1d766453c9b7d22d29297d78/pkg/metrics/metrics.go#L65 @pmengelbert @sozercan

ritazh avatar Dec 19 '23 18:12 ritazh