greptimedb
greptimedb copied to clipboard
Invalidate compiled script cache when it is updated
What type of enhancement is this?
User experience, API
What does the enhancement do?
We have to invalidate other frontend's compiled python script cache when the script is updated in cluster mode.
We need an event broadcast mechanism to do it.
Implementation challenges
No response
It can be done by calling Mailbox::broadcast
to submit a "script invalidation request" to all the frontend nodes. There's an example to start with, see struct InvalidateCache
.
Hi, I would like to try this one. Could you assign it to me?
Hi, I would like to try this one. Could you assign it to me?
Have fun🚀
Hi, I would like to try this one. Could you assign it to me?
Have fun🚀
Hi @WenyXu, I plan to communicate via Frontend--notify-->Metasrv--broadcast-->Frontend
. When I implemented it, I found that mailbox::send
in Frontend
requires an InstructionReply
, and it seems that Instruction
is always sent by Metasrv
as heartbeat response. If I make Frontend
notify Metasrv
, it seems to break this semantics and design. Should I do it this way? Cloud you give me any suggestions?
Hi, I would like to try this one. Could you assign it to me?
Have fun🚀
Hi @WenyXu, I plan to communicate via
Frontend--notify-->Metasrv--broadcast-->Frontend
. When I implemented it, I found thatmailbox::send
inFrontend
requires anInstructionReply
, and it seems thatInstruction
is always sent byMetasrv
as heartbeat response. If I makeFrontend
notifyMetasrv
, it seems to break this semantics and design. Should I do it this way? Cloud you give me any suggestions?
PTAL @waynexia
Hi, I would like to try this one. Could you assign it to me?
Have fun🚀
Hi @WenyXu, I plan to communicate via
Frontend--notify-->Metasrv--broadcast-->Frontend
. When I implemented it, I found thatmailbox::send
inFrontend
requires anInstructionReply
, and it seems thatInstruction
is always sent byMetasrv
as heartbeat response. If I makeFrontend
notifyMetasrv
, it seems to break this semantics and design. Should I do it this way? Cloud you give me any suggestions?PTAL @waynexia
It seems we need to design an RPC service to accept cache invalidation requests from the Frontend
. Would you like to implement this RPC service? We need another issue to track this feature.
We still cannot ensure a correct result. The step broadcast -> other frontends
is not synced. Let's implement this first to achieve a relax invalid process.
We still cannot ensure a correct result. The step
broadcast -> other frontends
is not synced. Let's implement this first to achieve a relax invalid process.
In this context, what level of consistency are we expecting?
Expect read-after-write like other insertion operations.
Expect read-after-write like other insertion operations.
My plan is to add a version
field to the scripts
table (or use gmt_modified
, but I'm not sure if there will be clock drift).
When Frontend
try to execute a script, it needs to read the scripts
table and determine if the local cache is outdated.
If the read and write operations follow a read-after-write consistency, it ensures the latest data.
The cost is that we have to read the scripts
table, which adds an extra RTT. However, I believe this is acceptable because I think caching the script is not intended to save this particular RTT, but rather to reduce the CPU overhead during script compilation.
Moreover, we no longer need to implement broadcasting in the MetaSrv
.
Do you think this plan is feasible?
version
should be enough. The map from script
to "compiled binary" must (and already) exist. We only need to leverage some cache invalidation mechanism over version
. Having an extra read to datanode also looks fine to me. Data feeds into scripts also come from datanode. And there are ways to reduce this overhead.
This does override the need for broadcasting if we choose to implement this "strict" mode now Can you give a more detailed per-step plan of how you break this task down and plan to implement it?
cc @discord9
version
should be enough. The map fromscript
to "compiled binary" must (and already) exist. We only need to leverage some cache invalidation mechanism overversion
. Having an extra read to datanode also looks fine to me. Data feeds into scripts also come from datanode. And there are ways to reduce this overhead.This does override the need for broadcasting if we choose to implement this "strict" mode now Can you give a more detailed per-step plan of how you break this task down and plan to implement it?
Sure.
I have made some changes to my plan because I found a new issue. After inserting a script, it cannot be executed through SQL in other Frontend
s unless it is first executed through the HTTP API using /run-script
. This is because scripts are only registered as UDFs in the ScriptManager::compile
. I previously overlooked the possibility of executing it directly through UDF instead of going through the /run-script
API.
Taking the UDF part into consideration, here is my plan:
- First, add a
providers
field and aget_function_fallback
function toFunctionRegistry
. Theget_function_fallback
function will attempt to dynamically retrieve theFunction
through theproviders
. - Then, In the
QueryEngineState::udf_function
function, useFUNCTION_REGISTRY.get_function_fallback
to retrieve the latest UDF. - The
ScriptManager
will implement this provider. I plan to modify thetry_find_script_and_compile
function to read from thescripts
table each time and compare it with the locally cached version to determine whether to recompile and update the cache(ScriptManager::compiled
). - Of course, there will be an additional field in the cache to indicate the version. But I'm not sure whether to add a new
version
field toscripts
table or simply usegmt_modifie
. I'm concerned that adding a new field might cause some complications during the upgrade.
Now, whether it is called through the HTTP API or SQL UDF, the latest script can be executed.
version
should be enough. The map fromscript
to "compiled binary" must (and already) exist. We only need to leverage some cache invalidation mechanism overversion
. Having an extra read to datanode also looks fine to me. Data feeds into scripts also come from datanode. And there are ways to reduce this overhead. This does override the need for broadcasting if we choose to implement this "strict" mode now Can you give a more detailed per-step plan of how you break this task down and plan to implement it?Sure.
I have made some changes to my plan because I found a new issue. After inserting a script, it cannot be executed through SQL in other
Frontend
s unless it is first executed through the HTTP API using/run-script
. This is because scripts are only registered as UDFs in theScriptManager::compile
. I previously overlooked the possibility of executing it directly through UDF instead of going through the/run-script
API.Taking the UDF part into consideration, here is my plan:
- First, add a
providers
field and aget_function_fallback
function toFunctionRegistry
. Theget_function_fallback
function will attempt to dynamically retrieve theFunction
through theproviders
.- Then, In the
QueryEngineState::udf_function
function, useFUNCTION_REGISTRY.get_function_fallback
to retrieve the latest UDF.- The
ScriptManager
will implement this provider. I plan to modify thetry_find_script_and_compile
function to read from thescripts
table each time and compare it with the locally cached version to determine whether to recompile and update the cache(ScriptManager::compiled
).- Of course, there will be an additional field in the cache to indicate the version. But I'm not sure whether to add a new
version
field toscripts
table or simply usegmt_modifie
. I'm concerned that adding a new field might cause some complications during the upgrade.Now, whether it is called through the HTTP API or SQL UDF, the latest script can be executed.
Looks good to me! Let's take a step.
Expect read-after-write like other insertion operations.
My plan is to add a
version
field to thescripts
table (or usegmt_modified
, but I'm not sure if there will be clock drift).When
Frontend
try to execute a script, it needs to read thescripts
table and determine if the local cache is outdated.If the read and write operations follow a read-after-write consistency, it ensures the latest data.
The cost is that we have to read the
scripts
table, which adds an extra RTT. However, I believe this is acceptable because I think caching the script is not intended to save this particular RTT, but rather to reduce the CPU overhead during script compilation.Moreover, we no longer need to implement broadcasting in the
MetaSrv
.Do you think this plan is feasible?
A version
field is great, but I wonder how to process the atomic increment of this field, greptimedb doesn't support transactions and update
currently.
Expect read-after-write like other insertion operations.
My plan is to add a
version
field to thescripts
table (or usegmt_modified
, but I'm not sure if there will be clock drift). WhenFrontend
try to execute a script, it needs to read thescripts
table and determine if the local cache is outdated. If the read and write operations follow a read-after-write consistency, it ensures the latest data. The cost is that we have to read thescripts
table, which adds an extra RTT. However, I believe this is acceptable because I think caching the script is not intended to save this particular RTT, but rather to reduce the CPU overhead during script compilation. Moreover, we no longer need to implement broadcasting in theMetaSrv
. Do you think this plan is feasible?A
version
field is great, but I wonder how to process the atomic increment of this field, greptimedb doesn't support transactions andupdate
currently.
This is indeed a problem, and I overlooked this point. Both of them have some issues
-
version
may have multiple writes to the same version during concurrency, which and may also lead to version rollback. - Updating
gmt_modified
is not a 'read-modify-write' operation; it simply inserts the latest value, similar to 'last write wins.' However, it can still have similar issues during concurrency when there is clock drift between multiple frontends. If Node A has a faster clock than Node B, concurrent updates from A and B may occur, but B's write arrives later. The cache on Node A cannot be updated because its timestamp is greater than the data written by B.
I have browsed the documentation and found Metasrv
Distributed Lock. Perhaps it can be used to ensure atomicity when updating. The likelihood of concurrent updates to the same script is low, so the lock contention is minimal. Users are aware of what they are doing, I believe it is acceptable.
I have browsed the documentation and found Metasrv Distributed Lock. Perhaps it can be used to ensure atomicity when updating. The likelihood of concurrent updates to the same script is low, so the lock contention is minimal. Users are aware of what they are doing, I believe it is acceptable.
@xxxuuu Yes, I believe it's ok.
I don't have much time on weekdays due to work, but I'm still dedicated to this issue. In fact, I have already completed the logical implementation of the functionality, but testing has not been done yet. I will be able to finish it this week :)