elsa-core
elsa-core copied to clipboard
getWorkflowDefinitionIdByName unreliable on multiple instances
We are trying to run the Elsa dashboard on multiple instances. We seem to have a strange problem of getWorkflowDefinitionIdByName not finding a workflow that exists. We have been able to produce this three separate times, always on Postgres with RabbitMq and Rebus. We have not found a reliable way to reproduce the error. Publishing a new version of either referenced and referencing workflow does not seem to help. Exporting, deleting and importing the workflows again seems to reliably fix the problem.
I'm wondering if this could be a problem with the Entity Framework cache not being shared between the instances. Do you have a suggestion as to what this problem could be?
When hosting Elsa on multiple server instances, make sure to go through the steps outlined here.
Each workflow definition will be stored in a local cache on the server. To invalidate this cache when a new workflow version is published, a signal needs to be distributed to all instances in the cluster. The article I linked to should help you to set this up.
Yes this is the article we followed. We are using the Rebus cache signal something like this.
services.AddElsa(elsa => elsa.UseRebusCacheSignal());
I'm thinking this takes care of what you mentioned. I would say things work correctly 95% of the time, that's why we are in trouble. I'm wondering how I could be checking if these signals are fired and received. I'm considering trying the UseRedisCacheSignal.
I'm wondering why there is no SubscribeToRebusCacheSignals I was only able to find SubscribeToRedisCacheSignals . https://github.com/elsa-workflows/elsa-core/tree/master/src/providers/Elsa.Providers.Redis/StartupTasks
When hosting Elsa on multiple server instances, make sure to go through the steps outlined here.
Each workflow definition will be stored in a local cache on the server. To invalidate this cache when a new workflow version is published, a signal needs to be distributed to all instances in the cluster. The article I linked to should help you to set this up.
Hi sfmskywalker, Thank you for the answer. How can we make sure that our multinode setup is correct? We followed the article you mentioned, instances start, we don't see errors, still it would be great to have a checklist to go through which would ensure that our setup is all correct.
Actually, I can see the messages in RabbitMQ console. For example when I publish a new version of one of the workflows, I see the new messages in the chart on the RabbitMQ console. Shouldn't this mean that the multinode setup is OK?
I would say things work correctly 95% of the time, that's why we are in trouble. I'm wondering how I could be checking if these signals are fired and received.
That seems odd. And tricky to troubleshoot because it sounds like it's not something easily reproduced consistently. What you might try is clone the repo and reference the projects directly, rather than the NuGet packages, and add logging statements where signals are sent and are expected to be received, collect all of this information, and try to analyze. Not a small task.
I'm wondering why there is no SubscribeToRebusCacheSignals I was only able to find SubscribeToRedisCacheSignals. https://github.com/elsa-workflows/elsa-core/tree/master/src/providers/Elsa.Providers.Redis/StartupTasks
SubscribeToRedisCacheSignals is a startup task that subscribes to a Redis bus.
For Rebus, it needs to register a consumer that processes the TriggerCacheSignal message. To register this consumer, simply call UseRebusCacheSignal on the elsa options builder, for example:
services.AddElsa(elsa => elsa.UseRebusCacheSignal());
@vargaendre
How can we make sure that our multinode setup is correct? We followed the article you mentioned, instances start, we don't see errors, still it would be great to have a checklist to go through which would ensure that our setup is all correct.
I agree a checklist would be helpful. Describing strategies to test that it is setup correctly however is a bit more involved and is not something I'll be able to do at the moment due to other pressing priorities.
But at the very least I can mention the 3 things to check for:
- [ ] In a multi-node environment, only one workflow instance will be processed by a given node (no two nodes can ever process the same workflow instance at the same time)
- [ ] When publishing changes to a workflow, the updated workflow takes into effect on all other nodes.
- [ ] Workflows containing timer activities only execute on one node.
But at the very least I can mention the 3 things to check for:
- [ ] In a multi-node environment, only one workflow instance will be processed by a given node (no two nodes can ever process the same workflow instance at the same time)
- [ ] When publishing changes to a workflow, the updated workflow takes into effect on all other nodes.
- [ ] Workflows containing timer activities only execute on one node.
Thank you! I understand you have a lot on your plate regarding Elsa. We appreciate your help!
I am not 100% sure how to test the first two of your checkpoints, but I created a workflow with a StartAt activity (it took a little time to figure out the datetime format) which was ran only once (we have two nodes currently), so I know Quartz is good. So, there was only one WF instance ran.
Is it possible that our problem (not finding an existing workflow definition by name) is not related to multinode, rather something else?
One more data point. While testing the workflow, I saw this:
After a little time and a refresh, it changed to this:
So, whatever generated this UI, did not find the workflow definition. Then it did. Since then I refreshed the page many times, always got the workflow definition. Could it be that the workflow definition could not be found only while it was ran (by another node)?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.