loki
loki copied to clipboard
Compactor may fail to delete chunks in SSD mode, marker files are only stored locally
Describe the bug
The retention (and delete process) is two phased, the compactor will scan the index for chunks which should be deleted because of expired retention, as well as delete requests.
It will store these chunks in local files in a markers/ directory for the second phase which reads these files and deletes the chunks from storage.
Because SSD mode does leader election to chose a single Read or Backend node to run the compactor, it's possible for the compactor node to change when there are unprocessed chunk deletes in the markers files, most likely these chunks will then never be deleted from object storage (unless that node becomes the leader elected compactor again)
The second phase process that deletes the chunks in marker files runs every minute, however, the setting retention_delete_delay determines when a marker file is processed.
The default for retention_delete_delay is 2hrs, which creates a 2 hour window where a marker file is created before the contents will be processed for deletes.
If the active compactor node in SSD changes permanently to a new node in this window, the chunks in those marker files will never be deleted.
Leader election is done by using the hash ring loki uses for most of it's distributed system operations, a ring is created for the compactors and shared via memberlist, each compactor only creates 1 token and inserts it into the ring, with 1 node, it would have 100% ownership of the ring, with 2 it would have 50% etc. Whichever node owns key 0 in the ring is elected leader.
Therefore leader changes are probabilistic, where a new leader would only be elected if a new node randomly generates a token that results in that node now owning key 0 in the ring when inserted.
Note: if you run a single compactor statefulset with persistent storage like in microservices mode, or just a single binary Loki, you would not be affected by this
Workarounds:
- Enable TTL on chunks in object storage which exceeds your longest configured retention setting in Loki such that any chunks that are missed by the compactor can be cleaned up by object storage retention settings. NOTE if you go this route, try to figure out how you would remind yourself that you created this setting such that some day you don't decide to increase retention in Loki only to have the object storage delete your chunks anyway. Keep comments around your retention settings in your Loki config file as a reminder that the object storage is also enforcing retention for cleanup purposes.
Solutions:
- marker files should go to object storage.
Less ideal solutions:
- all compactors could run sweepers to make sure they cleanup any marker files they have even if they aren't the elected leader
hi @slim-bean , is there any update on this issue? I am running Loki 3.1.0 in SSD mode (replicator 3), when I brought up the cluster, I noticed compactor related log as below:
backend-0:
backend-1:
backend-2:
Seems backend-1 has been chosen as the pod that runs compactor, but compactor service on backend-2 also starts after the "stop", may I know if this is normal?
This issue should be prioritized for the following reasons:
- The behavior is triggered with the default values of the Helm chart deployment using Simple Scalable Deployment (SSD) Mode
- In the event a marker file is lost the tracked chunks become headless and will never be deleted by Loki, thus accumulating data over time.
- When users enable data retention for the first time, a chain of configuration changes easily triggers a case of losing markers files, thus rendering headless a big bulk of chunks data in storage.
Potentially related: https://github.com/grafana/loki/issues/13072 https://github.com/grafana/loki/issues/13687 https://github.com/grafana/loki/issues/15479
Here is a possible scenario:
- Loki is installed using the default values of Helm charts. SSD Mode. loki-backend with 3 replicas. No retention.
- Loki collect logs for 3 months which amount to ~300GiB of chunks data (~100GiB/month).
- Retention is introduced with a retention of 1 month.
- When the retention is introduced, an event (see below) causes the loss of the marker file that tracked the deletion of ~200GiB of chunks data.
- Users are unable to trigger the deletion of those ~200GiB of chunks data through loki as they became headless (i.e not in index anymore).
By default the marker files ends up in the compactor.working_directory which is not persistent.
Step 4 either:
- Could be cause by any Kubernetes events such as worker node pressure. The loki-backend pod with the marker file of the historical data is recreated before the file could be processed.
- The default value of
retention_delete_delaybeing 2 hours might lead to users not observing the expected deletion and trying out another configuration and thus recreate the loki-backend pods causing the loss of marker file. - A leader election swap as described by @slim-bean after the creation of the marker file.
- Any error causing the loki-backend pods to be recreated (i.e OOMKilled).
The docs spread this knowledge into two different "Note"
First note:
Run the Compactor as a singleton (a single instance).
https://grafana.com/docs/loki/latest/operations/storage/retention/#compactor
Second note:
Marker files should be stored on a persistent disk to ensure that the chunks pending for deletion are processed even if the Compactor process restarts.
Grafana Labs recommends running Compactor as a stateful deployment (StatefulSet when using Kubernetes) with a persistent storage for storing marker files.
https://grafana.com/docs/loki/latest/operations/storage/retention/
Using the current Helm chart following the recommendation of running compactor as a singleton is not easy nor intuitive, mainly because the alias target=backend includes compactor and the default SSD mode employs it. It would require to mix SSD and Distributed mode: the target of the loki-backend pods to exclude compactor and enable the compactor StatefulSet with replicas=1.
Ideas:
- Refactor the Helm SSD mode to deploy the compactor as a singleton by default, using a Persistent Volume Claim (PVC). Include an option to disable the PVC, with a warning if this is chosen. Adding another target alias for the backend without the compactor, or removing the compactor from the existing backend alias, could reduce the impact on the Helm chart.
- Implement support for storing marker files in object storage to enable synchronization among compactors. Similar to the existing mechanism used for delete requests.