rucio icon indicating copy to clipboard operation
rucio copied to clipboard

Daemons: reaper, avoid multiple reaper workers working on the same replicas; rucio#6512

Open labkode opened this issue 3 months ago • 6 comments

Fixes https://github.com/rucio/rucio/issues/6512

Overview

This branch introduces two complementary mechanisms to reduce the likelihood of multiple reaper workers working on the same replicas that are about to be deleted.

The two mechanisms are:

  • A) Immediate cleaning of replicas from the Rucio DB (configurable)
  • B) Refreshing replicas to be deleted (always enabled)

A) Immediate cleaning of replicas

Existing Mode (Default)

Configuration: enable_immediate_cleanup = false (default, can be omitted)

[reaper]
# Traditional mode - no additional configuration needed
# enable_immediate_cleanup = false  # Default value, can be omitted
delay_seconds = 600                  # Standard replica selection delay
chunk_size = 100                     # Number of replicas to process per batch

Behavior:

  • Database cleanup happens once after all physical deletions (hundreds to thousands) complete
  • Maintains the original behavior

Immediate Cleanup Mode (Opt-in)

Configuration: enable_immediate_cleanup = true

[reaper]
enable_immediate_cleanup = true      # Enable immediate cleanup optimization
db_batch_size = 50                   # Batch size for immediate database cleanup (default: 50)
refresh_trigger_ratio = 80           # Percentage of delay_seconds before refreshing (default: 80)
delay_seconds = 600                  # Standard replica selection delay
chunk_size = 100                     # Number of replicas to process per batch

Behavior:

  • Database cleanup happens in configurable batches during physical deletion
  • Increases database load (more statements executed)
  • Deletions are faster (removed from Rucio DB), therefore making them visible to external scripts.

Configuration Parameters

Parameter Default Description
enable_immediate_cleanup false Enable/disable immediate database cleanup optimization
db_batch_size 50 Number of replicas to clean from database in each immediate batch
refresh_trigger_ratio 80 Percentage of delay_seconds after which to refresh remaining replicas (applies to both traditional and immediate cleanup modes)
delay_seconds 600 Standard delay for replica selection (existing parameter)
chunk_size 100 Number of replicas to process per iteration (existing parameter)

B) Replica Refresh Control

Always enabled.

The reaper uses a delay_seconds mechanism to prevent multiple workers from processing the same replicas. When replicas are marked as BEING_DELETED, other workers will not select them until delay_seconds have passed since their last update.

To prevent race conditions when processing takes longer than expected, the reaper can refresh the updated_at timestamp of remaining replicas:

[reaper]
delay_seconds = 600                  # Replicas become selectable by other workers after 10 minutes
refresh_trigger_ratio = 80           # Refresh remaining replicas after 80% of delay_seconds (8 minutes)

How it works:

  1. Worker starts processing 100 replicas at time T=0
  2. At T=8 minutes (80% of 10 minutes), if replicas are still being processed:
    • Worker calls refresh_replicas() on remaining unprocessed replicas
    • This updates their updated_at timestamp to current time
    • Other workers will wait another 10 minutes before selecting these replicas
  3. Original worker continues processing without interference

Refresh Configuration Examples

Conservative (longer processing time allowed):

[reaper]
delay_seconds = 900                  # 15 minutes before other workers can take over
refresh_trigger_ratio = 90           # Refresh after 13.5 minutes

Aggressive (faster worker coordination):

[reaper]
delay_seconds = 300                  # 5 minutes before other workers can take over  
refresh_trigger_ratio = 70           # Refresh after 3.5 minutes

Multi-worker environment (balanced):

[reaper]
delay_seconds = 600                  # Standard 10 minutes
refresh_trigger_ratio = 75           # Refresh after 7.5 minutes (leaves 2.5min buffer)

Performance Tuning Examples

High-Throughput Environment

Optimize for maximum performance with frequent immediate cleanups:

[reaper]
enable_immediate_cleanup = true
db_batch_size = 25                   # Smaller batches for more frequent cleanup
refresh_trigger_ratio = 70           # Refresh remaining replicas earlier
delay_seconds = 300                  # Shorter delay for faster processing
chunk_size = 200                     # Larger chunks for higher throughput

Conservative Environment

Optimize for reliability with larger batches:

[reaper]
enable_immediate_cleanup = true
db_batch_size = 100                  # Larger batches, fewer database calls
refresh_trigger_ratio = 90           # Wait longer before refreshing
delay_seconds = 900                  # Longer delay for stability
chunk_size = 50                      # Smaller chunks for reliability

Multi-Worker Environment

Optimize for coordination between multiple reaper workers:

[reaper]
enable_immediate_cleanup = true
db_batch_size = 30                   # Moderate batch size
refresh_trigger_ratio = 75           # Refresh before other workers can interfere
delay_seconds = 600                  # Standard delay
chunk_size = 100                     # Standard chunk size

Monitoring and Debugging

Log Messages

Traditional Mode:

DEBUG: Deletion complete for RSE CERN-PROD - processed 150 replicas, all 150 will be cleaned up by main loop (traditional mode)
DEBUG: Main loop cleanup SUCCESS - deleted 150 remaining replicas in 2.34 seconds

Immediate Cleanup Mode:

DEBUG: Starting deletion for RSE CERN-PROD with 150 replicas, enable_immediate_cleanup=True, db_batch_size=50, delay_seconds=600
DEBUG: Immediate cleanup SUCCESS: deleted 50 replicas from database (batch #1)
DEBUG: Immediate cleanup SUCCESS: deleted 50 replicas from database (batch #2)
DEBUG: Final cleanup SUCCESS: deleted 50 remaining replicas from database
DEBUG: Deletion complete for RSE CERN-PROD - processed 150 replicas, performed 3 immediate cleanups, total immediate cleaned: 150, remaining for main loop: 0

Replica Refresh Messages:

DEBUG: Refresh trigger time set to 480.0 seconds (80% of delay_seconds=600)
DEBUG: Refresh triggered after 485.2 seconds - refreshing 45 remaining replicas (out of 100 total)
DEBUG: Successfully refreshed 45 remaining replicas after 485.2 seconds
WARNING: Failed to bump updated_at for remaining replicas BEING_DELETED

Configuration Verification

Check active configuration at startup:

DEBUG: Optimization configuration - enable_immediate_cleanup=True, db_batch_size=50, refresh_trigger_ratio=80%, delay_seconds=600, chunk_size=100, total_workers=4

Replica Refresh Function

The refresh_replicas() function in rucio.core.replica provides the underlying mechanism:

from rucio.core.replica import refresh_replicas

# Update the updated_at timestamp for replicas to prevent other workers from taking them
success = refresh_replicas(
    rse_id='CERN-PROD_DATADISK', 
    replicas=[
        {'scope': 'cms', 'name': 'file1.root'},
        {'scope': 'cms', 'name': 'file2.root'}
    ]
)

Troubleshooting

Common Issues

Workers taking over each other's work:

# Solution: Reduce refresh trigger ratio or increase delay
[reaper]
delay_seconds = 900              # Increase to 15 minutes
refresh_trigger_ratio = 70       # Refresh after 70% (10.5 minutes)

Database performance issues with immediate cleanup:

# Solution: Increase batch size to reduce DB calls
[reaper]
enable_immediate_cleanup = true
db_batch_size = 100             # Larger batches, fewer DB operations

Slow processing causing timeouts:

# Solution: Increase delay and refresh earlier
[reaper]
delay_seconds = 1200            # 20 minutes total
refresh_trigger_ratio = 60      # Refresh after 12 minutes

labkode avatar Sep 23 '25 11:09 labkode

Codecov Report

:x: Patch coverage is 0% with 144 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 7.05%. Comparing base (695df6e) to head (a112d52). :warning: Report is 50 commits behind head on master.

Files with missing lines Patch % Lines
lib/rucio/daemons/reaper/reaper.py 0.00% 112 Missing :warning:
lib/rucio/core/replica.py 0.00% 32 Missing :warning:
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #8048      +/-   ##
=========================================
- Coverage    7.13%   7.05%   -0.08%     
=========================================
  Files         272     272              
  Lines       45763   45874     +111     
=========================================
- Hits         3266    3238      -28     
- Misses      42497   42636     +139     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Sep 23 '25 11:09 codecov[bot]

@mgajek-cern thank you for the review. I've considered splitting but I decided against it because both mechanisms touch mostly the same files and do the same thing: reduce the likelihood of overlapping work by multiple workers. The changes are done in a way that are backwards compatible with existing deployments.

Nonetheless, I'll follow-up on the other suggestions and the ones from the previous attempt in #7199.

labkode avatar Sep 26 '25 06:09 labkode

@labkode please squash this now. It shouldn't be reviewed in a wip state, this just makes the work harder for the reviewers, who are already very thinly spread.

bari12 avatar Nov 13 '25 12:11 bari12

@bari12 done.

labkode avatar Nov 21 '25 10:11 labkode

Once we merge #8269, Rucio's ruff pre-commit checks will be required to pass also on the CI. Right now this is not enforced because we have miss-configured things and so, only when someone has enabled the pre-commit checks locally, things are taken care of. I have flagged couple problems that will arise in the future but to avoid mentioning all of them one by one in the review (as there are some more), maybe you can enable the pre-commit locally to fix the problems related to the files you have modified (note: do not mind about the E501 line too long). But if you don't have time, don't bother (they are just stylistic). Just address the flagged ones here and I can fix everything once we merge #8269. No problem.

Geogouz avatar Dec 09 '25 16:12 Geogouz

@Geogouz thanks for the changes, all good from my side.

labkode avatar Dec 15 '25 15:12 labkode