log storage feature needs to be rewritten

Open gabemontero opened this issue 8 months ago • 4 comments

Expected Behavior

Storage of logs succeeds at a high success rate with reasonable performance

Actual Behavior

In a sufficiently sized production system, addressing

the watcher channel/thread leak by not cancelling context properly
but allowing log storage to complete before before properly cancelling the context to free up storage
yet avoid HTTP2 and GRPC throttling which can force log storage to take many MINUTES

is untenable with the current design.

2 months of performance analysis of GRPC and golang's HTTP2 have uncovered repeatedly report issues on the internet about HTTP2 Request Body access not scaling / becoming increadibly slow, as well GRPC level throttling which the current tuning options in the golang GRPC code have yet to resolve.

We have a prototype already that showed improvement wrt the HTTP2 body scaling issue, where we did pod log retrieval in the API server and then stored to S3 there, but it was still slowed by GRPC throttling on the background threads, leading to long delays before logs were actually stored.

The WG has discussed on calls and in slack, where we are close to high level agreement on doing log storage in the watcher, and only update the Log Records on success / failure.

@khrm @enarha @sayan-biswas @avinal FYI

Jun 23 '24 13:06 gabemontero

results results copied to clipboard

log storage feature needs to be rewritten

Expected Behavior

Actual Behavior

results
results copied to clipboard