eraser
eraser copied to clipboard
Observability and Auditability
Describe the solution you'd like
Being alerted or having access to audits for success / failure of Image Removal
Below is a Suggestion for Auditability:
As an Operator I should be able to see an outputted Log Trace of when I try to delete an unwanted Image
Log Trace Example
- Collector - Image List is Generated (ImageList Logs)
- Eraser Manager - Eraser Controller Manager Logs of Creating
eraser-kind-workers - Eraser Manager - EraserJob logs on manipulation of images
- Eraser - Pods deleting and spinning-down
A proposed solution would be to add a CorrelationID to the CRDs of ImageList and ImageJob.
Collectorwould generate the CorrelationID, and pass thisCorrelationIDtoImageListCRD on creation- Eraser Controller Manager would pull the CorrelationID from the ImageList and ImageJob CRDs
- Creating ImageJob from Eraser ImageList Controller
- https://github.com/Azure/eraser/blob/main/pkg/eraser/eraser.go#L220
- EraserJob
Anything else you would like to add:
This will probably require documentation on how to connect to an external Logging Platform (Such as Log Analytics / DataDog/ etc.)
Open Questions:
- If an Image is not being successfully removed by Eraser, how is the Platform Operator alerted about which image is causing the problem? Should they be alerted about a specific image?
- There's a high probability that Traceability will be faulty as well. In
Kind, I've had about 0.5 - 1 Second capture any logs from theEraser Workerpod. After the pod is deleted,kubectl logs -pdoesn't surface the logs of the terminatedEraserfromeraser-kind-worker
I ran the kubectl logs --previous ... immediately after my terminal returned from following the logs while the container was Running
(E.g., {"level":"info","ts":1652827640.9949317,"logger":"eraser","msg":"Removed","image":"docker.io/library/nginx:latest"})
The issue that may arise is that a Logger Agent may not be able to capture this log in time to fully capture a complete scenario during an Audit.
Also in Development of Eraser, this may make it very difficult to validate a specific image being deleted, and also may require the use of a logging agent in the development workflow if this is something that a developer may want to capture.
Have y'all ran into issues with capturing logs directly from the Eraser Worker if wanting to view the actual log for deletion of a certain Image? Do you think this may surface a problem with Audibility and Tracing through the Eraser Worker Pod?
Environment:
- Eraser version:
mainBranch - Kubernetes version: (use
kubectl version): 1.22
@paulbouwer @sozercan @hewatson-msft
What do y'all think to the following questions, or thoughts?
"Have y'all ran into issues with capturing logs directly from the Eraser Worker if wanting to view the actual log for deletion of a certain Image? Do you think this may surface a problem with Audibility and Tracing through the Eraser Worker Pod?"
As an advocate for the customer or business user of this feature, and in line with best practices around Production system changes, I would expect messages to be logged in a way that they could be reported on after the fact for at least a month (or in accordance with customer recording keeping policies). I think the customer would want to know exactly what non-running images were removed from where and when. I hypothesize that they want to know how often this is happening in general over time (via regular reports) and take preventative measures if and where possible.
@hewatson-msft Thanks for this. Eraser v1.0.0 will record metrics using opentelemetry, which will include
- images discovered on cluster
- results of image scans (if any)
- images that were actually removed
- timestamps on all of the above
Closing this one because of the now available telemetry. If what's now available doesn't satisfy the use case, please feel free to open a new issue :)
@hewatson-msft Thanks for this. Eraser v1.0.0 will record metrics using opentelemetry, which will include
- images discovered on cluster
- results of image scans (if any)
- images that were actually removed
- timestamps on all of the above
Just to clarify, the metrics available today are the total count of these, not the actual images removed as describes in https://eraser-dev.github.io/eraser/docs/metrics and https://github.com/eraser-dev/eraser/blob/afb831bcf61d665e1d766453c9b7d22d29297d78/pkg/metrics/metrics.go#L65 @pmengelbert @sozercan