gwr
gwr copied to clipboard
Output from high throughput data sources will stall output
Use case: 10s-100s of thousands of items per second
Repro Steps:
log a LOT of items, from a lot of goroutines
curl -X WATCH localhost:4040/tap/
Symptom: The first several hundred items come through, then all output abruptly stops
Findings so far:
- The 'timeout' branch of HandleItems is hit due to high load
- the datasource is marked active = false
- processItemChan breaks out of its loop
What should happen: We should silently? drop items, but keep trying to pump them out.
https://github.com/uber-go/gwr/blob/dev/internal/marshaled/source.go#L411
The original design intent was to default to dropping items; silently at first / by default, but with an optional affordance for watchers that want to know when/if/how-many drops are happening. E.g. the resp protocol could very easily pass along such side channel data with little chance of it being confused with the actual watched items.
Currently it seems that the following happens:
-
marshaled.DataSource.HandleItem
deactivates on first timeout - the shutdown phase in the tail of
marshaled.DataSource.processItem
isn't successfully closing all active watchers
Part one is just our current naive, perhaps overly aggressive, design choice; it could be that something more like a circuit breakers "only deactivate if more than X% get dropped within T time" would be better.
Part two is a flat out bug: something's causing that http connection to zombie on.