loki [Feature Request] Chunk Compactor

Is your feature request related to a problem? Please describe.

Too many small chunks in S3, unable to be solved by the continued increase of idle timeout due to the huge memory increase that settings results in.

With some queries needing to fetch 90,000 chunks 50-100 big chunks, 89900+ smaller chunks these smaller chunks can be the bottleneck for many queries. Quite often these smaller chunks exist because their source has bursts of activity infrequently. It would be far more ideal if this was <1,000 good sized chunks (still enough to parallize over multiple cores) were queried instead (closer to the number of streams).

Describe the solution you'd like

A utility similar to compactor (or built in?) that is able to create new chunks by merging small chunks (i.e <10KB which is 95%+ of our dataset) that had been pushed due to idle period (but later there was matching data).

Fetching these chunks is particularly expensive and most of the time spent downloading chunks. It might also improve compression ratios (if blocks are rebuilt).

Placing this in compactor might be a good idea since the index is already being updated at this time.

This compactor should get a setting like sync_period to bound the combine search. For most people this should be the same value as indexers sync_period. Chunk max size would still need to be honoured of course. Larger chunks, not just one chunk.

Something like:

if chunk size < min_threshold:
   for each chunk in index that also matches labels:
      merge new chunk
      if new chunk size > max_merge_amount or in new sync_period then
         replace with new chunk

New chunks should be entirely new (new id) and old chunks removed index_cache_validity after the index containing only the new chunks is updated (to prevent cached indexes from accessing the now non existant chunks).

If the chunk compactor exits uncleanly (or has any similar issue) unreferenced chunks may end up in the chunk store. AFAIK this is possible currently regardless and probably is a seperate matter.

Describe alternatives you've considered

Increasing chunk_idle_period (currently 6m) further. 10m was tested however resulted in too much memory being consumed.

Screenshot showing issue 1 week retention chrome_2022-03-11_18-19-41

May resolve #1258, #4296

Mar 11 '22 06:03 splitice

@liguozhong since you are on the performance drive, perhaps this would be of interest to you. It's due to your great work that chunk fetching latency is now the weakest link for us. :)

Maybe your workload is similar

Mar 11 '22 07:03 splitice

Second this. What we have found time and time again is that, no matter how overpowered your storage is (up to and including fancy all-flash storage attached via InfiniBand) tiny files always give bad performance.

Mar 14 '22 08:03 jfolz

I haven't noticed too many small files in s3 before. If this feature can be done by someone, we will definitely use it. I amlooking forward to loki-team can implement chunk-compactor

Mar 30 '22 09:03 liguozhong

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

May 01 '22 03:05 stale[bot]

Go away stalebot

May 01 '22 04:05 splitice

I was observing Mimir's bucket, hoping that Loki would have the same functionality as Mimir. 🥺

Mimir has successfully solved a similar problem. bucket_compactor.go will organize the bucket objects every 12 hours or every 24 hours later, so the number of objects and Bucket Bytes will be very compact.

Jul 04 '22 00:07 miiton

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

Aug 13 '22 12:08 stale[bot]

Get lost stale bot

Aug 13 '22 13:08 splitice

+1

Jan 10 '23 09:01 hhkbble

+1

Apr 10 '23 09:04 capacman

Hi Is there any plans to provide chunk compaction?

Apr 22 '23 06:04 timansky

This feature is much needed.

May 24 '23 08:05 xiaobeiyang

+1

May 25 '23 21:05 iskyneon

Hello,

my apps can't provide enough logs to create good chunk before pod/node rotation. I.e. I've set max_chunk_age and chunk_idle_period to 12 hours but my app rotates after 2 hours (AutoScalling in Kubernetes).

This is why imo this feature is really good idea.

Jul 13 '23 07:07 tabramczyk

has this been implemented?

Sep 11 '23 01:09 xh63

No this does not exist

Sep 11 '23 02:09 splitice

ll chunks/fake/ | wc -l
1323523

We have insane number of sub directories in chunks directory.

I think loki could create a file struct like database, include index and chunks in one single file or few multiple files.

Not one stream in one file.

Dec 18 '23 14:12 xpader

any update here?

Feb 03 '24 13:02 pingping95

Solved this problem on my side. Who face the problem first of all check your labels and streams and read careful this https://grafana.com/docs/loki/latest/get-started/labels/bp-labels/ (Use dynamic labels sparingly section)

you can check current labels using logcli series --analyze-labels '{}' also prometheus metric rate(loki_ingester_chunks_flushed_total[1m]) is a good metric to understand whats happening

There is my before

Total Streams:  2240
Unique Labels:  17

Label Name     Unique Values  Found In Streams
thread         103            2087
task_name      80             2170
job_name       68             2170
filename       50             97
host           38             99
namespace      7              2170
service        6              72
project        6              72
dc             6              2170
logger         5              2143
severity       3              2137
allocation_id  2              3
source         2              29
environment    2              41
role           2              41
app            1              2137
agent          1              68

and after adding thread label to message instead

Total Streams:  255
Unique Labels:  16

Label Name     Unique Values  Found In Streams
task_name      80             188
job_name       68             188
filename       49             94
host           38             96
namespace      7              188
project        6              69
dc             6              188
service        6              69
logger         5              161
severity       3              154
source         2              29
role           2              38
allocation_id  2              4
environment    2              38
agent          1              65
app            1              154

Amount of logs has not changed

also some tunes for loki config (changes only)

compactor:
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 20m

ingester:
  chunk_idle_period: 60m
  chunk_retain_period: 60m
  max_chunk_age: 4h  #i think is too big but is ok for me now
  chunk_target_size: 54857600

And the results of number of files in minio storage (before and after all this tunes)

Feb 07 '24 09:02 kadaj666

+1

The absence of this makes S3 less suitable as the Loki store.

Mar 20 '24 17:03 iamjvn

Solved this problem on my side. Who face the problem first of all check your labels and streams and read careful this https://grafana.com/docs/loki/latest/get-started/labels/bp-labels/ (Use dynamic labels sparingly section)

you can check current labels using logcli series --analyze-labels '{}' also prometheus metric rate(loki_ingester_chunks_flushed_total[1m]) is a good metric to understand whats happening

There is my before

Total Streams:  2240
Unique Labels:  17

Label Name     Unique Values  Found In Streams
thread         103            2087
task_name      80             2170
job_name       68             2170
filename       50             97
host           38             99
namespace      7              2170
service        6              72
project        6              72
dc             6              2170
logger         5              2143
severity       3              2137
allocation_id  2              3
source         2              29
environment    2              41
role           2              41
app            1              2137
agent          1              68

and after adding thread label to message instead

Total Streams:  255
Unique Labels:  16

Label Name     Unique Values  Found In Streams
task_name      80             188
job_name       68             188
filename       49             94
host           38             96
namespace      7              188
project        6              69
dc             6              188
service        6              69
logger         5              161
severity       3              154
source         2              29
role           2              38
allocation_id  2              4
environment    2              38
agent          1              65
app            1              154

Amount of logs has not changed

also some tunes for loki config (changes only)

compactor:
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 20m

ingester:
  chunk_idle_period: 60m
  chunk_retain_period: 60m
  max_chunk_age: 4h  #i think is too big but is ok for me now
  chunk_target_size: 54857600

And the results of number of files in minio storage (before and after all this tunes)

Good for you man. But this problem is present in correctly (at least I think they are) configured Loki instances as well. For example when you have a hundreds or thousands of short lived pods daily (easy when you have batch jobs) you can rack up files on s3 quickly. Why? Because pod name itself is great example of Loki's anti-pattern! Most people want at least namespace, pod and container in Loki's index mostly because you have no guarantee of any other metadata being present.

Mar 21 '24 15:03 boniek83

+1. We have a lot of small batch jobs that generate very few logs and then go idle, but they're logged in different tenants so can't be combined easily. And even if we did solve this, the 100+ million existing small chunk files would still be there forever.

May 24 '24 16:05 precisionconage

I've worked around this issue by using structured metadata. Structured metadata + bloom filters works really nice.

May 24 '24 18:05 boniek83

I've worked around this issue by using structured metadata. Structured metadata + bloom filters works really nice.

Can you describe your method in more detail?

May 28 '24 08:05 DANic-git

You need to define labels that are not indexed but instead moved to structured metadata in you alloy/agent config file. below is a fragment of my config file that does that and also removes all labels that I don't really need.

      loki.process "pods" {
        forward_to = [loki.write.default.receiver]

        stage.structured_metadata {
            values = {
              pod         = "",
              container   = "",
            }
        }

        stage.label_drop {
          values = [ "filename", "stream", "pod", "container" ]
        }

As I mentioned above I have a lot of small pods created daily, that don't write that much of log data. With this change, all my data is indexed only behind namespace label but I can query individual pods and containers like this: {namespace="$namespace"} | pod="$pod" Thanks to this number of streams basically collapsed. Heres how many files per second were created before and after this change (not on 22/04/2024 - I was playing around and loki was down, but after 24/04/2024): Untitled

It will all work well just with that for most small clusters, but if you have a lot of data you may want to enable experimental bloom filter queries that are described in documentation, that will make these queries fly, by reading only lines that have pods that you queried for instead of everything from queried namespace and only then filtering by pod.

May 28 '24 15:05 boniek83

loki loki copied to clipboard

[Feature Request] Chunk Compactor

loki
loki copied to clipboard