cube
cube copied to clipboard
Cube Refresh Worker - Running as a lambda/As a scheduled job - delineation of schedule
Problem
Cube Refresh worker's resource consumption is very bursty. When it runs its' periodic refresh, (roughly hourly) it requires a very high amount of compute resources. This "active" period tends to be brief.
The rest of the time, it requires far less compute resources, appearing "idle".
When provisioning the cube stack, the refresh worker requires a relatively massive amount of CPU + RAM. This is mainly required to support the brief "active" period of the duty cycle.
Intention
Ideally, I'd like to be able to spin up the expensive refresh operation as a lambda / on demand job, following a cron schedule, so that I don't have to have a large amount of compute provisioned during the "idle" period.
Assuming that nothing is happening during the "idle" period, this should be easy to do- Run the expensive job, when it terminates, spin down the pod.
Question
During the "idle" period, is the refresh worker still performing operations?
-
A Refresh Worker updates pre-aggregations and invalidates the in-memory cache in the background. They also keep the refresh keys up-to-date for all data models and pre-aggregations. Please note that the in-memory cache is just invalidated but not populated by Refresh Worker. In-memory cache is populated lazily during querying. On the other hand, pre-aggregations are eagerly populated and kept up-to-date by Refresh Worker.
delineates that the in-memory cache is NOT populated by the refresh worker, but instead lazily at runtime by some other service.
- The documentation states that the refresh worker invalidates the cache. and "eagerly" populates pre-aggs.
- It's not clear whether this "eager" operation is referring to the "active" period of the duty cycle.
Is there a way to detect when the "active" period has ended?
Ie, the refresh job has finished and the pod can spin down
Hi @strazto 👋
Cube Refresh worker's resource consumption is very bursty. When it runs its' periodic refresh, (roughly hourly) it requires a very high amount of compute resources. This "active" period tends to be brief.
You have full control over the refresh keys and you're not constrained to having every refresh key defined as "every 1 hour". You can use whatever schedule works for you, including using cron-based schedules: https://cube.dev/docs/reference/data-model/cube#refresh_key
Run the expensive job, when it terminates, spin down the pod
As specified in the docs that you're quoting, Refresh Worker is indeed active invalidating the cache entries even when it's not building the pre-aggregations. The current Cube architecture needs to have it active at all times.
I hope this helps.
@strazto Did my advice above help?
@igorlukanin your advice was helpful thank you 🙂