druid icon indicating copy to clipboard operation
druid copied to clipboard

Historical Segment Cache Loading Strategy on Start-up

Open GWphua opened this issue 3 months ago • 6 comments

Description

  • Propose a configurable startup strategy that eagerly loads only recent (“hot”) segments, while leaving older (“cold”) segments to load lazily on first access.
  • Propose to deprecate druid.segmentCache.lazyLoadOnStart in favour for configs that gives more flexibility to historical's segment cache loading during startup.

Motivation

  • Non-lazy segment loading takes long if Historical segment count is high (observed ~22 minutes per Historical; ~39 hours cluster-wide).
  • Lazy-loading improves startup time but initial queries over hot data can be slow.
  • Many clusters primarily query the last N days/weeks; we can make that slice eager at startup to maintain query performance.

Proposal

Deprecate druid.segmentCache.lazyLoadOnStart in favor of a single strategy-driven config:

New: startupCacheLoadStrategy with options:

  1. loadLazily (all segments lazy)
  2. loadAllEagerly (all segments eager)
  3. loadEagerlyForPeriod (recent window eager, older lazy)

When loadEagerlyForPeriod is selected, require a loadPeriod config (ISO-8601 period, e.g., P7D, P30D).

Backward compatibility and migration

Keep reading druid.segmentCache.lazyLoadOnStart for at least a few more releases with a deprecation warning. We can map true -> loadLazily, false -> loadAllEagerly. Using the new startupCacheLoadStrategy overwrites the lazyLoadOnStart setting, [Optional: and a warning is logged if both settings are configured].

The pros of relying on the new config allows us to implement more load strategies that we want.

Config names are open for discussion, do drop some suggestions!

GWphua avatar Aug 28 '25 04:08 GWphua

I think this is a good balance between start up performance of historical nodes and initial query performance when historical nodes start.

Maybe we can use just one property(like loadEagerlyForPeriod ) for difference scenarios, like:

example value description
all load all segments, current default behavior
P7D load segments of the latest 7 days
P0D(not sure if it's valid) Lazy load for all segments

FrankChen021 avatar Aug 28 '25 09:08 FrankChen021

rather than a historical server level configuration, this feels like something that should be handled as part of load rules so that different strategies could be done on a per table basis, as tables can often have quite different usage patterns in some clusters, any reason I'm missing to not investigate extending load rules to also specify how segments are loaded?

This is what I am planning to look into to extend the 'virtual storage' introduced in #18176, that is, extending load rules to allow specifying 'weak' loads so that whether or not to eagerly download segments from deep storage can be configured in a much more flexible manner instead of the 'all' or 'nothing' approach that the initial PR supports.

clintropolis avatar Sep 09 '25 06:09 clintropolis

Hi @clintropolis, I have looked through #18176, and understand the operations of segment cache loading as such. Correct me if I am wrong:

  1. Coordinator assigns segments S1, S2, S3 to Historical H. When Historical H starts up, will go through either of the steps:
    • Under normal mode: H would download S1, S2, S3 immediately, store on disk.
    • Under virtual storage mode: H does not fetch them yet.
  2. When a query arrives needing S2, S3: Historical checks its cache:
    • S2: not loaded → issue mount as weak entry
    • S3: not loaded → issue mount Once loaded, execute the query against them.
  3. Later, more queries come requiring new segments S4: If disk space is nearing limit: Weak entries evicted according to SIEVE algorithm to make room, new requests mounted.
  4. If a later query again needs S2: If S2 was evicted, it will be loaded again

GWphua avatar Oct 02 '25 03:10 GWphua

I also see that you have rendered isLazyLoadOnStart() introduced in #6988 at SegmentLocalCacheManager obsolete by changing from the above to below:

final Segment segment = factory.factorize(dataSegment, segmentFiles, config.isLazyLoadOnStart(), loadFailed);
---------
final Segment segment = factory.factorize(dataSegment, storageDir, false, lazyLoadCallback);

This PR seems like an enhanced version of #6988, where we will skip loading of ALL segments when the virtual storage mode is enabled.

  1. I assume we can enable isVirtualStorage if isLazyLoadOnStart is enabled for backwards compatibility?
  2. If we want to do a lazyLoadOnStart while choosing not to maintaining the old behaviour (historical only taking on segments that it can store), what will we do here? Set druid.server.maxSize == sum of druid.segmentCache.locations?

Seeing the virtual storage functionality is an experimental feature, and the old lazyLoadOnStart isn't one, would it be OK to retain the previous lazyLoadOnStart feature instead?

GWphua avatar Oct 02 '25 03:10 GWphua

I'm not really that familiar with the load rules, will it work for Historicals during start-up, and not just for Historicals that have been running for a while?

A picture of the problem we are trying to solve: Historical start-up takes a very long time (20min), and we also do not want queries for the past 7 days to be affected. So we want to implement a functionality where we can lazily load segments that are more than 7 days old, while eagerly loading the past 7 days of data... Will your proposed virtualStorage be able to help with this?

Had quite a bit of questions, sorry for the word bomb, but would really appreciate some explanations. Thanks a lot! :)

GWphua avatar Oct 02 '25 04:10 GWphua

I also see that you have rendered isLazyLoadOnStart() introduced in https://github.com/apache/druid/pull/6988 at SegmentLocalCacheManager obsolete

This was an accident, fixed in #18637.

Also, your understanding of how the virtual storage feature works looks correct to me.

I'm not really that familiar with the load rules, will it work for Historicals during start-up, and not just for Historicals that have been running for a while?

I think the solution I was thinking of should work for startup since it involves the coordinator wrapping the load spec with another load spec to indicate stuff (like load as a weak reference, or in this case, lazy deserialize the columns on loading) and sending that to the historicals instead of the direct load spec.

A picture of the problem we are trying to solve: Historical start-up takes a very long time (20min), and we also do not want queries for the past 7 days to be affected. So we want to implement a functionality where we can lazily load segments that are more than 7 days old, while eagerly loading the past 7 days of data... Will your proposed virtualStorage be able to help with this?

Right now virtual storage could only help with this if you had separate tiers of historicals, a regular tier with the past 7 days, and another virtual storage mode tier that has the other data. Later, the load rules thing I'm thinking about would probably help more, since there will be a lot finer control over how stuff is loaded, but haven't got to this yet.

I guess since this lazy loading stuff only applies to server startup, maybe load rules would be overkill.. but i'm also slightly worried that without something like it, the use case for this feature is pretty limited since it is only helpful when all datasources on the server have a similar usage pattern.

clintropolis avatar Oct 24 '25 18:10 clintropolis