loki icon indicating copy to clipboard operation
loki copied to clipboard

Log deletion does not work

Open mac133k opened this issue 1 year ago • 6 comments

Describe the bug 2 deletion requests were submitted to Loki delete API. The requests were received and GET requests return their states as 'received', but that has not changed for 5 days now and the logs in question remain searchable. There have been no mentions of the deletion requests in Loki compactor logs since the submission confirmation.

To Reproduce Submit the deletion request to Loki API using curl. Check Loki logs and API for the status. Wait... Search for logs.

Expected behavior The logs were expected to at least be removed from search results, but ideally from S3 storage too.

Environment:

  • Linux VMs, micro-services mode
  • Deployment tool: Ansible
  • Loki version: 2.9.1
  • Loki compactor and limits config blocs:
compactor:
  compaction_interval: 10m
  retention_delete_delay: 24h
  retention_enabled: true
  working_directory: /data1/loki/compactor
limits_config:
  cardinality_limit: 1000
  deletion_mode: filter-and-delete
  retention_period: 5y
  • Storage backend: on-prem S3 (Ceph)
  • Storage schema: v12 TSDB
  • IndexGWs are active in the cluster.

Please let me know if you need more information.

mac133k avatar Feb 13 '24 12:02 mac133k

Using TSDB, we were able to delete logs by performing a curl to localhost, directly from the compactor pod:

curl -g -X PUT 'http://localhost:3100/loki/api/v1/delete?query={namespace="foo"}&start=1706140800&end=1707916080' -H 'X-Scope-OrgID:global'

mhulscher avatar Feb 21 '24 13:02 mhulscher

We also submit delete requests with curl and then they show up as received, but none ever really worked in our PROD cluster. I tried deleting logs in the DEV cluster which is much smaller in terms of the number of hosts or the volume of logs ingested and the deletion worked only partially, ie. when I requested deletion of logs over the period of 24h I could see 8 gaps 1-2h wide when I reran the query a few minutes later. Then there had been no changes to the delete request state or the target data over the following days.

If anyone has ideas how to investigate this please suggest.

mac133k avatar Feb 21 '24 13:02 mac133k

One thing that stands out is that a delete request can be in a received state for days or weeks without any further action. If there was problem with the delete query or no logs could be found for deletion there should be an update or a change of state.

mac133k avatar Feb 27 '24 13:02 mac133k

Seeing similar issue in our cluster as well, requests are received but no logs are being removed.

Starefossen avatar Apr 17 '24 07:04 Starefossen

Using TSDB, we were able to delete logs by performing a curl to localhost, directly from the compactor pod:

curl -g -X PUT 'http://localhost:3100/loki/api/v1/delete?query={namespace="foo"}&start=1706140800&end=1707916080' -H 'X-Scope-OrgID:global'

@mhulscher The logs you successfully deleted - were they saved in a chunk store or still in ingesters' RAM? Also in your Loki cluster was the chunk store set up on a local FS or external S3 service?

mac133k avatar May 17 '24 12:05 mac133k

I have the same issue in s3 service

hieunguyen847 avatar May 22 '24 10:05 hieunguyen847

We also have a need to delete log entries. Please consider prioritizing this bug resolution.

iamjvn avatar Jun 14 '24 18:06 iamjvn

Hi @sandeepsukhani and @MichelHollands,

I looked into the compactor/deletion code and found that you two are the major contributors. I'm tagging both of you in the hope that you could have a quick look at this issue and provide some insights if possible.

I also noticed that most of the pull requests related to the boltdb-store shipper were made about three years ago. Has this deletion feature been updated with TSDB since then?

Thank you!

billmoling avatar Jul 21 '24 23:07 billmoling

We are also observing the same - log deletion request is pending but never done and no logs are deleted. Any update on this?

jakubsikorski avatar Aug 13 '24 15:08 jakubsikorski

Here is an interesting case: I recently discovered in one of our Loki clusters that there was a request for deleting logs dated Jan 29 through Feb 6 submitted to compactor API on Mar 28 that got processed on Jun 28 out of the sudden. Looking into the logs there is no clear indication why the delete request was triggered on that particular day 86 days after submission and 145 days from the start of the deletion time range:

  • compactor startup delay completed
  • applying retention with compaction
  • compactor started
  • overall smallest retention period 1716033289.014, default smallest retention period 1716033289.014
  • followed by a series of "caller=marker.go:177 msg="mark processor started" workers=150 delay=24h0m0s"
  • then these started appearing: "caller=delete_requests_manager.go:136 msg="Started processing delete request for user" delete_request_id=3dc1912b user=***"
  • concluded by: "caller=delete_requests_manager.go:214 msg="Processing 70 of 197 delete requests. More requests will be processed in subsequent compactions""
  • then there were a few batches of: "caller=delete_requests_manager.go:328 msg="delete request for user marked as processed" delete_request_id=3dc1912b sequence_num=NNN user=*** deleted_lines=XXX" on that same day (Jun 22) across the period of about 8 hours.

That day (Jun 22) compactors were processing tables from days dated back to the end of Jan and beginning of Feb, however the first reference to those index tables appeared in the logs about 1 hour after the first message "Started processing delete request for user".

I am confused by my findings so far, but I can dig into it more if someone gives me hints.

mac133k avatar Aug 15 '24 14:08 mac133k