loki icon indicating copy to clipboard operation
loki copied to clipboard

Ideas about a better query split strategy

Open honganan opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe. Loki can split a big query by interval, but if split_by_interval is low, like 1min, it processes many duplicate data which reduced it's query speed. If interval is set to higher it have risk of causing Querier OOM.

For example some of our application or service produce 10TiB logs per day, some others produce only GiBs. We need to set the interval below 1min and parallelism to low(like 1) for big apps, otherwise will cause OOM. Our pod size is 2C8G now.

Even we can set different split interval value for each tenant, but parallelism cannot be different, which makes throughput limited when query small apps.

Describe the solution you'd like

Is there a better split strategy can adapt both big and small apps? I can image two ways it can performs better:

  1. Split query by chunk number or data size to process finally: If we can query index first and calculate how many chunks to process, then do the split by chunk number, as chunk size is almost the same, the split will be more scientific. We can create a new component serve for index query and management, then we can separate index query and chunk query in Frontend component. Index-gateway can also merged into the new component.
  2. If split query won't cause process duplicate data, we can just use runtime limit config different split_by_interval for different tenant(app) (I am not sure we can do it);

Describe alternatives you've considered A better query split strategy like above first idea.

Additional context I know to realize the idea needs to change a lot, we can just discuss first. what's more, Loki is an amazing product, it reduced huge costs in some of our datacenter.

honganan avatar Mar 22 '22 10:03 honganan

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Apr 25 '22 07:04 stale[bot]

I'd like to bump this as we are seeing a similar issue in production. or if anyone has found a solution to this problem? we typicall will either see an oom or the reponse will timeout but when using the loki cli we can query for a larger range and not run into the issues.

jlynch93 avatar Aug 09 '22 19:08 jlynch93

I think this relates to @owen-d's work on TSDB that allows better query planning.

jeschkies avatar Aug 12 '22 16:08 jeschkies

I think this relates to @owen-d's work on TSDB that allows better query planning.

Sounds exciting! I noticed there are new codes about TSDB committed. Do you know if there is any document about it, likes planning, designment or something else?

honganan avatar Aug 29 '22 06:08 honganan