Historical Segment Cache Loading Strategy on Start-up
Description
- Propose a configurable startup strategy that eagerly loads only recent (“hot”) segments, while leaving older (“cold”) segments to load lazily on first access.
- Propose to deprecate druid.segmentCache.lazyLoadOnStart in favour for configs that gives more flexibility to historical's segment cache loading during startup.
Motivation
- Non-lazy segment loading takes long if Historical segment count is high (observed ~22 minutes per Historical; ~39 hours cluster-wide).
- Lazy-loading improves startup time but initial queries over hot data can be slow.
- Many clusters primarily query the last N days/weeks; we can make that slice eager at startup to maintain query performance.
Proposal
Deprecate druid.segmentCache.lazyLoadOnStart in favor of a single strategy-driven config:
New: startupCacheLoadStrategy with options:
- loadLazily (all segments lazy)
- loadAllEagerly (all segments eager)
- loadEagerlyForPeriod (recent window eager, older lazy)
When loadEagerlyForPeriod is selected, require a loadPeriod config (ISO-8601 period, e.g., P7D, P30D).
Backward compatibility and migration
Keep reading druid.segmentCache.lazyLoadOnStart for at least a few more releases with a deprecation warning.
We can map true -> loadLazily, false -> loadAllEagerly.
Using the new startupCacheLoadStrategy overwrites the lazyLoadOnStart setting, [Optional: and a warning is logged if both settings are configured].
The pros of relying on the new config allows us to implement more load strategies that we want.
Config names are open for discussion, do drop some suggestions!
I think this is a good balance between start up performance of historical nodes and initial query performance when historical nodes start.
Maybe we can use just one property(like loadEagerlyForPeriod ) for difference scenarios, like:
| example value | description |
|---|---|
| all | load all segments, current default behavior |
| P7D | load segments of the latest 7 days |
| P0D(not sure if it's valid) | Lazy load for all segments |
rather than a historical server level configuration, this feels like something that should be handled as part of load rules so that different strategies could be done on a per table basis, as tables can often have quite different usage patterns in some clusters, any reason I'm missing to not investigate extending load rules to also specify how segments are loaded?
This is what I am planning to look into to extend the 'virtual storage' introduced in #18176, that is, extending load rules to allow specifying 'weak' loads so that whether or not to eagerly download segments from deep storage can be configured in a much more flexible manner instead of the 'all' or 'nothing' approach that the initial PR supports.
Hi @clintropolis, I have looked through #18176, and understand the operations of segment cache loading as such. Correct me if I am wrong:
- Coordinator assigns segments S1, S2, S3 to Historical H.
When Historical H starts up, will go through either of the steps:
- Under normal mode: H would download S1, S2, S3 immediately, store on disk.
- Under virtual storage mode: H does not fetch them yet.
- When a query arrives needing S2, S3:
Historical checks its cache:
- S2: not loaded → issue mount as weak entry
- S3: not loaded → issue mount Once loaded, execute the query against them.
- Later, more queries come requiring new segments S4: If disk space is nearing limit: Weak entries evicted according to SIEVE algorithm to make room, new requests mounted.
- If a later query again needs S2: If S2 was evicted, it will be loaded again
I also see that you have rendered isLazyLoadOnStart() introduced in #6988 at SegmentLocalCacheManager obsolete by changing from the above to below:
final Segment segment = factory.factorize(dataSegment, segmentFiles, config.isLazyLoadOnStart(), loadFailed);
---------
final Segment segment = factory.factorize(dataSegment, storageDir, false, lazyLoadCallback);
This PR seems like an enhanced version of #6988, where we will skip loading of ALL segments when the virtual storage mode is enabled.
- I assume we can enable
isVirtualStorageifisLazyLoadOnStartis enabled for backwards compatibility? - If we want to do a
lazyLoadOnStartwhile choosing not to maintaining the old behaviour (historical only taking on segments that it can store), what will we do here? Set druid.server.maxSize == sum of druid.segmentCache.locations?
Seeing the virtual storage functionality is an experimental feature, and the old lazyLoadOnStart isn't one, would it be OK to retain the previous lazyLoadOnStart feature instead?
I'm not really that familiar with the load rules, will it work for Historicals during start-up, and not just for Historicals that have been running for a while?
A picture of the problem we are trying to solve: Historical start-up takes a very long time (20min), and we also do not want queries for the past 7 days to be affected. So we want to implement a functionality where we can lazily load segments that are more than 7 days old, while eagerly loading the past 7 days of data... Will your proposed virtualStorage be able to help with this?
Had quite a bit of questions, sorry for the word bomb, but would really appreciate some explanations. Thanks a lot! :)
I also see that you have rendered isLazyLoadOnStart() introduced in https://github.com/apache/druid/pull/6988 at SegmentLocalCacheManager obsolete
This was an accident, fixed in #18637.
Also, your understanding of how the virtual storage feature works looks correct to me.
I'm not really that familiar with the load rules, will it work for Historicals during start-up, and not just for Historicals that have been running for a while?
I think the solution I was thinking of should work for startup since it involves the coordinator wrapping the load spec with another load spec to indicate stuff (like load as a weak reference, or in this case, lazy deserialize the columns on loading) and sending that to the historicals instead of the direct load spec.
A picture of the problem we are trying to solve: Historical start-up takes a very long time (20min), and we also do not want queries for the past 7 days to be affected. So we want to implement a functionality where we can lazily load segments that are more than 7 days old, while eagerly loading the past 7 days of data... Will your proposed virtualStorage be able to help with this?
Right now virtual storage could only help with this if you had separate tiers of historicals, a regular tier with the past 7 days, and another virtual storage mode tier that has the other data. Later, the load rules thing I'm thinking about would probably help more, since there will be a lot finer control over how stuff is loaded, but haven't got to this yet.
I guess since this lazy loading stuff only applies to server startup, maybe load rules would be overkill.. but i'm also slightly worried that without something like it, the use case for this feature is pretty limited since it is only helpful when all datasources on the server have a similar usage pattern.