prometheus
prometheus copied to clipboard
Reduce the impact of remote write resharding
Right now resharding, especially sharding up is very disruptive to throughput. The resharding process drains all queues, which takes a significant amount of time if the remote endpoint is having issues. This will block new samples from being appended while they queues clear, and one slow shard can cause throughput to drop significantly.
Instead of waiting for all shards to flush to remote storage we could send them into the new shards that are being created, being sure to rebalance them into the appropriate shard.
Resolved without zephyr changes, work around was to use the SO_BINDTODEVICE
socket option to lock a socket connection to a specified network interface