astra icon indicating copy to clipboard operation
astra copied to clipboard

[BUG] Unexpected Cache Behavior on ASTRA_MANAGER_REPLICA_LIFESPAN_MINS Update

Open autata opened this issue 1 year ago • 0 comments

Describe the bug

It seems that when the configuration for replicaCreationServiceConfig.replicaLifespanMins (e.g., ASTRA_MANAGER_REPLICA_LIFESPAN_MINS) is updated, existing replicas do not reflect the new value. This behavior is unexpected, as it differs from what I anticipated for cache updates.

Requirements (place an x in each of the [ ])**

  • [x] I've read and understood the Contributing guidelines and have done my best effort to follow them.
  • [x] I've read and agree to the Code of Conduct.
  • [x] I've searched for any related issues and avoided creating a duplicate issue.

To Reproduce

  1. Set ASTRA_MANAGER_REPLICA_LIFESPAN_MINS to a high value (e.g., 7 days).
  2. Keep the cluster running for 7 or more days
  3. Reduce ASTRA_MANAGER_REPLICA_LIFESPAN_MINS to a lower value (e.g., 24 hours).
  4. Query nodes still serve data from the original 7-day window and require cache capacity to accommodate the older data.

Observations

  • When a snapshot is created by the index node, the associated record in ZooKeeper reflects the value of ASTRA_MANAGER_REPLICA_LIFESPAN_MINS at the time of creation.
  • Subsequent updates to this configuration do not appear to impact existing replicas.

Expected behavior

If the lifespan is increased, I would expect the system to pull additional data from S3. Conversely, if it is decreased, I would expect the system to limit the data served to align with the reduced window.

Questions and Suggestions

I understand that caching logic is undergoing changes. Will the new implementation allow for the cache window to adapt more immediately following a configuration update? This could be particularly useful for occasional scenarios where serving older data is necessary. For example:

  • Normally, you might only require 3-7 days of data, but you keep segments in S3 for longer.
  • By temporarily increasing ASTRA_MANAGER_REPLICA_LIFESPAN_MINS and scaling up cache capacity, you could serve older data as needed.
  • Afterward, scaling down the cache and resetting the configuration would return the system to its usual state. Currently, this flexibility does not seem possible due to the described behavior. Let me know if I can provide any additional details or run further tests to assist in diagnosing this issue.

Thank you!

Screenshots

If applicable, add screenshots to help explain your problem.

Reproducible in:

Astra version: We are running a slightly older version of astra. We are built off of https://github.com/airbnb/kaldb but I don't see any PRs that change this behavior since then.

JVM version:

OS version(s):

Additional context

Add any other context about the problem here.

autata avatar Jan 17 '25 22:01 autata