results
results copied to clipboard
log storage feature needs to be rewritten
Expected Behavior
Storage of logs succeeds at a high success rate with reasonable performance
Actual Behavior
In a sufficiently sized production system, addressing
- the watcher channel/thread leak by not cancelling context properly
- but allowing log storage to complete before before properly cancelling the context to free up storage
- yet avoid HTTP2 and GRPC throttling which can force log storage to take many MINUTES
is untenable with the current design.
2 months of performance analysis of GRPC and golang's HTTP2 have uncovered repeatedly report issues on the internet about HTTP2 Request Body access not scaling / becoming increadibly slow, as well GRPC level throttling which the current tuning options in the golang GRPC code have yet to resolve.
We have a prototype already that showed improvement wrt the HTTP2 body scaling issue, where we did pod log retrieval in the API server and then stored to S3 there, but it was still slowed by GRPC throttling on the background threads, leading to long delays before logs were actually stored.
The WG has discussed on calls and in slack, where we are close to high level agreement on doing log storage in the watcher, and only update the Log Records on success / failure.
@khrm @enarha @sayan-biswas @avinal FYI