druid icon indicating copy to clipboard operation
druid copied to clipboard

Orphan segments in ceph

Open poriniki opened this issue 1 month ago • 0 comments

Affected Version

The Druid version 33.0.0.

Description

We have identified a potential issue with orphaned segments in our deployment, which utilizes Ceph as deep storage and Postgres as the metadata store.

Several anomalies have been observed:

  • A significant number of segments are present in Ceph but missing from Postgres metadata.
  • Some of these segments are very old, have exceeded their retention period, and were never cleaned up.
  • A subset of segments had never been loaded by the cluster because they did not exist in Postgres at all, implying they were unknown to the coordinator.
  • After manually deleting these segments from Ceph, there were no related errors or recovery attempts from the cluster, and Ceph disk usage dropped noticeably, confirming they were unused and orphaned.
  • the steps we took for removing were:
    • list segments from Ceph
    • list from postgres using payload field from druid_segments table ([payload] [loadSpec] [key])
    • check differences and remove keys that were not in PostgreSQL and existed on Ceph storage

Additional context:

  • These segments appear to be completely unmanaged by Druid since their metadata entries never existed or were removed prematurely.
  • Manual deletion did not cause any segment load/unload events, coordinator log warnings, or missing segment alerts.

poriniki avatar Nov 29 '25 07:11 poriniki