loki
loki copied to clipboard
[new feature] shuffle sharding: support Ingesters write path.
What this PR does / why we need it: [new feature] shuffle sharding: support Ingesters write path.
https://aws.amazon.com/cn/blogs/architecture/shuffle-sharding-massive-and-magical-fault-isolation/ https://grafana.com/docs/mimir/v2.3.x/operators-guide/configure/configuring-shuffle-sharding/
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer: The code is from mimir's distributor.go https://github.com/grafana/mimir/blob/main/pkg/distributor/distributor.go
Checklist
- [ ] Documentation added
- [x] Tests updated
- [ ] Is this an important fix or new feature? Add an entry in the
CHANGELOG.md
. - [ ] Changes that require user attention or interaction to upgrade are documented in
docs/sources/upgrading/_index.md
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki
Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
+ ingester 0.1%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0%
@liguozhong thanks for the PR.
May I know the main rationale why you need shuffle sharding on write path? Curious about how you run Loki with multit-enancy and is it really impacting /push
endpoint user-experience with noisy neighbor problem?
Main reasons we have shuffle sharding only on query path and not on write path are
- We are not yet facing any noisy neighbors problems on write path as far as Loki SaaS is concerned
- operational complexity we may introduce in Loki write path(scaling it down won't be straight forward like before), particularly when running ingester in k8 as statefulset . (learning from Mimir)
We loki-squad may not be able to maintain it even if added.
count( sum(rate(loki_distributor_bytes_received_total{}[5m])) by (tenant)>0)
topk(10,sum(rate(loki_distributor_bytes_received_total{}[5m])) by (tenant)>0)
There are 600 tenants, and the top 1 tenant occupies a lot of data, 85%
I can use this feature in my branch.
My ultimate aim is to speed up the metrics query feature. I operate multiple loki clusters. One of the larger loki has a total of 90 ingesters. Each recording rule needs to query 90 ingesters selectSamples. I prepare a tenant with a small amount of data to configure 2 shuffle size, so that the metrics query will not be 90 will go through.
If I can't complete the acceleration of metrics query, I will have to deploy a clickhouse-like OLAP database, and I don't want to learn another set of software written in C++.
func (q *query) Eval(ctx context.Context) (promql_parser.Value, error) {
queryTimeout := time.Minute * 5
userID, err := tenant.TenantID(ctx)
if err != nil {
level.Warn(q.logger).Log("msg", fmt.Sprintf("couldn't fetch tenantID to evaluate query timeout, using default value of %s", queryTimeout), "err", err)
} else {
queryTimeout = q.limits.QueryTimeout(userID)
}
ctx, cancel := context.WithTimeout(ctx, queryTimeout)
defer cancel()
expr, err := q.parse(ctx, q.params.Query())
if err != nil {
return nil, err
}
switch e := expr.(type) {
case syntax.SampleExpr:
value, err := q.evalSample(ctx, e)
return value, err
case syntax.LogSelectorExpr:
iter, err := q.evaluator.Iterator(ctx, e, q.params)
if err != nil {
return nil, err
}
defer util.LogErrorWithContext(ctx, "closing iterator", iter.Close)
streams, err := readStreams(iter, q.params.Limit(), q.params.Direction(), q.params.Interval())
return streams, err
default:
return nil, errors.New("Unexpected type (%T): cannot evaluate")
}
}
case syntax.SampleExpr:
value, err := q.evalSample(ctx, e)
return value, err
At present, after our log cluster is transferred from ELK to loki, only this part of the code executes very slowly (10m+)
https://grafana.com/docs/mimir/latest/operators-guide/run-production-environment/production-tips/
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki
Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
+ ingester 0%
+ distributor 0.1%
+ querier 0%
+ querier/queryrange 0.1%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0%
I don't think we're going to pull this in right now and there hasn't been any activity on it in quite some time. I'm going to close it. Please feel free to bring it us as a LID if you still think it's valiable.