loki icon indicating copy to clipboard operation
loki copied to clipboard

Old chunks not getting deleted after retention period

Open wzjjack opened this issue 3 years ago • 45 comments

Describe the bug I've configured 168h retention for my logs, but I can see chunks 5 years old filling my disk

To Reproduce

this is my config

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  grpc_server_max_recv_msg_size: 8388608
  grpc_server_max_send_msg_size: 8388608
querier:
  engine:
    max_look_back_period: 168h      

ingester:
  wal:
    enabled: true
    dir: /tmp/wal
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 24h       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 24h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 5m    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabled

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /tmp/loki/boltdb-shipper-active
    cache_location: /tmp/loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: /tmp/loki/chunks

compactor:
  working_directory: /tmp/loki/boltdb-shipper-compactor
  shared_store: filesystem

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_streams_per_user: 1000000
  max_entries_limit_per_query: 5000000
  ingestion_rate_mb: 100
  ingestion_burst_size_mb: 20 
chunk_store_config:
  max_look_back_period: 168h

table_manager:
  retention_deletes_enabled: false
  retention_period: 168h

Expected behavior Chunks older than 168h should be deleted.

Environment:

  • Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
  • Deployment tool: [e.g., helm, jsonnet]

Screenshots, Promtail config, or terminal output We can see 49 days of logs although I've configured 168h image

wzjjack avatar Jun 03 '22 06:06 wzjjack

I got the same issue. Logs older than 7 days are deleted and they're not visible in Grafana. Only the chunk files won't be deleted on the filesystem.

loki, version 2.5.0 (branch: HEAD, revision: 2d9d0ee23)
  build user:       root@4779f4b48f3a
  build date:       2022-04-07T21:50:00Z
  go version:       go1.17.6
  platform:         linux/amd64
server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

chunk_store_config:
  max_look_back_period: 168h

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h

DeBuXer avatar Jun 03 '22 13:06 DeBuXer

Use compactor not table_manager if you arent using AWS S3.

splitice avatar Jun 04 '22 07:06 splitice

Use compactor not table_manager if you arent using AWS S3.

Thanks, that did the trick :)

DeBuXer avatar Jun 07 '22 08:06 DeBuXer

Hey @DeBuXer, could you post what you add to your config file in order to get deletion on s3 done?. I am running the same issue and I have not found the solution.

Mastedont avatar Jun 14 '22 09:06 Mastedont

@Mastedont, I don't use S3, I store my chunks directly on disk. My current configuration:

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

chunk_store_config:
  max_look_back_period: 168h

compactor:
  working_directory: /var/lib/loki/retention
  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

limits_config:
  retention_period: 168h

ruler:
  alertmanager_url: http://127.0.0.1:9093

DeBuXer avatar Jun 14 '22 10:06 DeBuXer

Thank you, @DeBuXer

Mastedont avatar Jun 14 '22 10:06 Mastedont

One last question @DeBuXer.

Do you how can I know if log retention is working or not? What output written in logs will let know thats is working?

Mastedont avatar Jun 14 '22 11:06 Mastedont

@Mastedont, Not 100% sure, but I guess:

Jun 14 15:15:44 loki loki[277929]: level=info ts=2022-06-14T13:15:44.57537489Z caller=index_set.go:280 table-name=index_19150 msg="removing source db files from storage" count=1
Jun 14 15:15:44 loki loki[277929]: level=info ts=2022-06-14T13:15:44.576099223Z caller=compactor.go:495 msg="finished compacting table" table-name=index_19150

DeBuXer avatar Jun 14 '22 13:06 DeBuXer

That log output is from Ingester?

I only can see outpur like this, despite of having Compactor enabled:

level=info ts=2022-06-14T14:05:17.574831148Z caller=table.go:358 msg="uploading table loki_pre_19157"
level=info ts=2022-06-14T14:05:17.574847901Z caller=table.go:385 msg="finished uploading table loki_pre_19157"
level=info ts=2022-06-14T14:05:17.57485537Z caller=table.go:443 msg="cleaning up unwanted dbs from table loki_pre_19157"

Mastedont avatar Jun 14 '22 14:06 Mastedont

That log output is from Ingester?

From /var/log/syslog but should have the same information. When compactor is enabled, you should see something like;

level=info ts=2022-06-14T14:24:56.072803949Z caller=compactor.go:324 msg="this instance has been chosen to run the compactor, starting compactor"

DeBuXer avatar Jun 14 '22 14:06 DeBuXer

@DeBuXer , thanks a lot for your support here. I don't see the chunk files getting rotated. I also see pretty old index directories as well. I wanted my logs to be rotated every 7 days. I am not sure what I am doing wrong here. Could you please help me with it?

auth_enabled: false
chunk_store_config:
  max_look_back_period: 168h

compactor:
  shared_store: filesystem
  working_directory: /data/loki/boltdb-shipper-compactor

ingester:
  chunk_block_size: 262144
  chunk_idle_period: 3m
  chunk_retain_period: 1m
  wal:
    dir: /data/loki/wal 
    flush_on_shutdown: true
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  max_transfer_retries: 0
 
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_rate_mb: 32
  ingestion_burst_size_mb: 36
  unordered_writes: true
  retention_period: 168h

schema_config:
  configs:
  - from: 2020-10-24
    index:
      period: 24h
      prefix: index_
    object_store: filesystem
    schema: v11
    store: boltdb-shipper

server:
  http_listen_port: 3100

storage_config:
  boltdb_shipper:
    active_index_directory: /data/loki/boltdb-shipper-active
    cache_location: /data/loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: filesystem
  filesystem:
    directory: /data/loki/chunks

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h

rickydjohn avatar Jul 05 '22 07:07 rickydjohn

@rickydjohn, I think you need to enable retention_enabled. See also https://grafana.com/docs/loki/latest/operations/storage/retention/#retention-configuration

DeBuXer avatar Jul 05 '22 07:07 DeBuXer

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Aug 13 '22 12:08 stale[bot]

Hi @Mastedont , did you manage to get the chunks deleted from s3? Im having that same problem, that i cannot see any logs about compactor and in s3 store there are older files (>7days) than my configured retention. It seems only the index is cleared, because grafana wont show older log entries.

atze234 avatar Aug 30 '22 08:08 atze234

Hi @Mastedont , did you manage to get the chunks deleted from s3? Im having that same problem, that i cannot see any logs about compactor and in s3 store there are older files (>7days) than my configured retention. It seems only the index is cleared, because grafana wont show older log entries.

I have the same problem

ghost avatar Sep 05 '22 07:09 ghost

Hi, I have this relevant config:

compactor:
      compaction_interval: 10m
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
      retention_enabled: true
      shared_store: s3
      working_directory: /var/loki/retention
limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      retention_period: 720h
      split_queries_by_interval: 30m

but log files are not deleted on S3, only index compacted.

webfrank avatar Nov 16 '22 17:11 webfrank

I'm also finding this on loki 2.4.0, using minio as storage. Even with retention_delete_delay: 5m no chunks are being deleted.

jarrettprosser avatar Nov 18 '22 03:11 jarrettprosser

@Mastedont, I don't use S3, I store my chunks directly on disk. My current configuration:

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

chunk_store_config:
  max_look_back_period: 168h

compactor:
  working_directory: /var/lib/loki/retention
  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

limits_config:
  retention_period: 168h

ruler:
  alertmanager_url: http://127.0.0.1:9093

Hi,will this configuration clean up expired files in the chunks directory?

Codecaver avatar Dec 19 '22 08:12 Codecaver

any update ?

patsevanton avatar Dec 30 '22 10:12 patsevanton

Judging from the discussion in this issue https://github.com/grafana/loki/issues/7068 I don't think the compactor will delete the chunks in s3 object store you need a bucket lifecycle policy for that.

It would be nice to have a clear answer on this though.

seany89 avatar Jan 27 '23 13:01 seany89

For everyone wondering what's going with the retention, I've tested the feature a lot in the past days so here are what will work.

Minimal Configuration Needed

First of all, you absolutely need those config setup

limits_config:
        retention_period: 10d # Keep 10 days
compactor:
        delete_request_cancel_period: 10m # don't wait 24h before processing the delete_request
        retention_enabled: true # actually do the delete
        retention_delete_delay: 2h # wait 2 hours before actually deleting stuff

You can tweak those config to delete faster or slower.

Check If It's Working

Once you got those config up and running, check that the logs are actually reporting that the retention is being applied : msg="applying retention with compaction". The "caller" for this log is compactor.go.

Next, check that the retention manager is actually doing it's job in the logs: msg="mark file created" and msg="no marks file found" from the caller marker.go.

The mark file created means that loki did found some chunks to be deleted and it has created a file to keep track of it. The no marks file found means that while performing the chunk delete routine, there was no file that matched it's filters, the filters mainly being the delay.

Whenever you see the mark file created logs, you can go into the working directory of the compactor and check for the mark files. The path should be something like /var/loki/compactor/retention/markers. These files are kept there for 2 hours or whatever is set in retention_delete_delay. After retention_delete_delay is passed, loki will delete the chunks.

Not having any of the logs mentionned above means that the retention process is not started.

Important Notes

Loki will only delete chunks that are indexed. The indexes are actually being purged before deleting the chunks. This means that if you lose files from the compactor's working directory, whatever chunks that were marked there won't be deleted ever so it is still worth to have a lifecycle policy to cover for this OR have persistent storage for this particular folder.

nvanheuverzwijn avatar Feb 15 '23 19:02 nvanheuverzwijn

@nvanheuverzwijn if I were the CTO of Grafana Labs, I would give you a job offer immediately

amseager avatar Mar 05 '23 13:03 amseager

@nvanheuverzwijn Thank you a lot! Your explanation makes me clear. The Loki document makes me confuse that Table Manager also deletes chucks when using the filesystem chuck store.

adthonb avatar Apr 28 '23 00:04 adthonb

more info about 'Check If It's Working' such as compaction_interval: 10m assuming the loki instance start at 2023-07-11T12:30:25.060395045Z, then there are logs about caller=compactor.go at ts=2023-07-11T12:40:25.047110295Z

level=info ts=2023-07-11T12:30:25.060441045Z caller=compactor.go:440 msg="waiting 10m0s for ring to stay stable and previous compactions to finish before starting compactor" level=info ts=2023-07-11T12:40:25.045542628Z caller=compactor.go:445 msg="compactor startup delay completed" level=info ts=2023-07-11T12:40:25.045568295Z caller=compactor.go:497 msg="compactor started" level=info ts=2023-07-11T12:40:25.04562367Z caller=compactor.go:454 msg="applying retention with compaction" level=info ts=2023-07-11T12:40:25.047110295Z caller=compactor.go:609 msg="compacting table" table-name=loki_index_19549 level=info ts=2023-07-11T12:40:25.047208753Z caller=table_compactor.go:325 table-name=loki_index_19549 msg="using compactor-1689078092.gz as seed file" level=info ts=2023-07-11T12:40:25.048495753Z caller=util.go:85 table-name=loki_index_19549 file-name=compactor-1689078092.gz msg="downloaded file" total_time=1.280041ms level=info ts=2023-07-11T12:40:25.06665592Z caller=compactor.go:614 msg="finished compacting table" table-name=loki_index_19549 level=info ts=2023-07-11T12:40:25.066668503Z caller=compactor.go:609 msg="compacting table" table-name=loki_index_19548 level=info ts=2023-07-11T12:40:25.067591628Z caller=util.go:85 table-name=loki_index_19548 file-name=compactor-1689041382.gz msg="downloaded file" total_time=863.125µs level=info ts=2023-07-11T12:40:25.078401878Z caller=compactor.go:614 msg="finished compacting table" table-name=loki_index_19548

yangmeilly avatar Jul 11 '23 12:07 yangmeilly

@yangmeilly

you can send full config pls loki.yaml?

for me not working.

Nurlan199206 avatar Jul 12 '23 07:07 Nurlan199206

@yangmeilly

you can send full config pls loki.yaml?

for me not working.

in my scenario, using boltdb-shipper for indexs and filesystem for chunks. the full loki config as following, and bold should have your attention.

compactor block: compaction_interval: 10m delete_request_cancel_period: 2h retention_delete_delay: 2h retention_delete_worker_count: 150 retention_enabled: true shared_store: filesystem working_directory: /var/loki/retention

limits_config block: enforce_metric_name: false max_cache_freshness_per_query: 10m reject_old_samples: true reject_old_samples_max_age: 168h split_queries_by_interval: 15m etention_period: 72h max_query_lookback: 72h

table_manager block: // this make nosense for filesystem retention_deletes_enabled: false retention_period: 0

yangmeilly avatar Jul 13 '23 08:07 yangmeilly

Whenever you see the mark file created logs, you can go into the working directory of the compactor and check for the mark files. The path should be something like /var/loki/compactor/retention/markers. These files are kept there for 2 hours or whatever is set in retention_delete_delay. After retention_delete_delay is passed, loki will delete the chunks.

Not having any of the logs mentionned above means that the retention process is not started.

@nvanheuverzwijn Thanks for the info. Regarding your statement, loki will delete the chunks, are you talking about a filesystem backend or also a s3/azure backend? I can't find a definitive answer stating that loki is able to delete chunks from external storage.

HammerNL89 avatar Jul 25 '23 12:07 HammerNL89

It will also delete on S3/Azure. I did this with Google Cloud storage but it should be the same for the other backend.

Le mar. 25 juill. 2023, 08 h 55, HammerNL89 @.***> a écrit :

Whenever you see the mark file created logs, you can go into the working directory of the compactor and check for the mark files. The path should be something like /var/loki/compactor/retention/markers. These files are kept there for 2 hours or whatever is set in retention_delete_delay. After retention_delete_delay is passed, loki will delete the chunks.

Not having any of the logs mentionned above means that the retention process is not started.

@nvanheuverzwijn https://github.com/nvanheuverzwijn Thanks for the info. Regarding your statement, loki will delete the chunks, are you talking about a filesystem backend or also a s3/azure backend? I can't find a definitive answer stating that loki is able to delete chunks from external storage.

— Reply to this email directly, view it on GitHub https://github.com/grafana/loki/issues/6300#issuecomment-1649788775, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHGI6RV26I2EEQSLUL2GJ3XR664JANCNFSM5XXPS5YA . You are receiving this because you were mentioned.Message ID: @.***>

nvanheuverzwijn avatar Jul 25 '23 13:07 nvanheuverzwijn

@nvanheuverzwijn compactor did not delete the chunk. why?

compactor log:

level=info ts=2023-08-03T06:50:12.634846248Z caller=compactor.go:497 msg="compactor started"
level=info ts=2023-08-03T06:50:12.634865722Z caller=compactor.go:454 msg="applying retention with compaction"
level=info ts=2023-08-03T06:50:12.634865349Z caller=marker.go:177 msg="mark processor started" workers=150 delay=2h0m0s
level=info ts=2023-08-03T06:50:12.634955656Z caller=expiration.go:78 msg="overall smallest retention period 1690440612.634, default smallest retention period 1690440612.634"
ts=2023-08-03T06:50:12.635021334Z caller=spanlogger.go:85 level=info msg="building index list cache"
level=info ts=2023-08-03T06:50:12.635046761Z caller=marker.go:202 msg="no marks file found"

config:

storage_config:
  aws:
    access_key_id: xxxxxx
    bucketnames: loki
    endpoint: https://s3.xxxx.com
    s3forcepathstyle: true
    secret_access_key: xxxxx
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 24h
    index_gateway_client:
      server_address: dns:///loki-distributed-index-gateway:9095
    shared_store: s3

compactor:
  retention_enabled: true
  shared_store: s3
  working_directory: /var/loki/compactor
  retention_delete_delay: 2h
  delete_request_cancel_period: 10m

limits_config:
  enforce_metric_name: false
  ingestion_burst_size_mb: 1024
  ingestion_rate_mb: 1024
  max_cache_freshness_per_query: 10m
  max_global_streams_per_user: 0
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 1h
  split_queries_by_interval: 15m

update: It was caused by my incorrect configuration.

The storage configuration needs to be placed in the common.

common:
  compactor_address: http://loki-distributed-compactor:3100
  storage:
    s3:
      access_key_id: xxxxxx
      bucketnames: loki
      endpoint: https://s3.xxxx.com
      s3forcepathstyle: true
      secret_access_key: xxxxxx

stringang avatar Aug 03 '23 10:08 stringang

@nvanheuverzwijn so beautiful

yangfan-witcher avatar Aug 15 '23 09:08 yangfan-witcher