elsa-core icon indicating copy to clipboard operation
elsa-core copied to clipboard

getWorkflowDefinitionIdByName unreliable on multiple instances

Open GotBinGo opened this issue 3 years ago • 8 comments
trafficstars

We are trying to run the Elsa dashboard on multiple instances. We seem to have a strange problem of getWorkflowDefinitionIdByName not finding a workflow that exists. We have been able to produce this three separate times, always on Postgres with RabbitMq and Rebus. We have not found a reliable way to reproduce the error. Publishing a new version of either referenced and referencing workflow does not seem to help. Exporting, deleting and importing the workflows again seems to reliably fix the problem.

I'm wondering if this could be a problem with the Entity Framework cache not being shared between the instances. Do you have a suggestion as to what this problem could be?

GotBinGo avatar Jun 24 '22 13:06 GotBinGo

When hosting Elsa on multiple server instances, make sure to go through the steps outlined here.

Each workflow definition will be stored in a local cache on the server. To invalidate this cache when a new workflow version is published, a signal needs to be distributed to all instances in the cluster. The article I linked to should help you to set this up.

sfmskywalker avatar Jun 24 '22 13:06 sfmskywalker

Yes this is the article we followed. We are using the Rebus cache signal something like this.

services.AddElsa(elsa => elsa.UseRebusCacheSignal());

I'm thinking this takes care of what you mentioned. I would say things work correctly 95% of the time, that's why we are in trouble. I'm wondering how I could be checking if these signals are fired and received. I'm considering trying the UseRedisCacheSignal.

I'm wondering why there is no SubscribeToRebusCacheSignals I was only able to find SubscribeToRedisCacheSignals . https://github.com/elsa-workflows/elsa-core/tree/master/src/providers/Elsa.Providers.Redis/StartupTasks

GotBinGo avatar Jun 24 '22 14:06 GotBinGo

When hosting Elsa on multiple server instances, make sure to go through the steps outlined here.

Each workflow definition will be stored in a local cache on the server. To invalidate this cache when a new workflow version is published, a signal needs to be distributed to all instances in the cluster. The article I linked to should help you to set this up.

Hi sfmskywalker, Thank you for the answer. How can we make sure that our multinode setup is correct? We followed the article you mentioned, instances start, we don't see errors, still it would be great to have a checklist to go through which would ensure that our setup is all correct.

vargaendre avatar Jun 27 '22 08:06 vargaendre

Actually, I can see the messages in RabbitMQ console. For example when I publish a new version of one of the workflows, I see the new messages in the chart on the RabbitMQ console. Shouldn't this mean that the multinode setup is OK?

vargaendre avatar Jun 27 '22 09:06 vargaendre

I would say things work correctly 95% of the time, that's why we are in trouble. I'm wondering how I could be checking if these signals are fired and received.

That seems odd. And tricky to troubleshoot because it sounds like it's not something easily reproduced consistently. What you might try is clone the repo and reference the projects directly, rather than the NuGet packages, and add logging statements where signals are sent and are expected to be received, collect all of this information, and try to analyze. Not a small task.

I'm wondering why there is no SubscribeToRebusCacheSignals I was only able to find SubscribeToRedisCacheSignals. https://github.com/elsa-workflows/elsa-core/tree/master/src/providers/Elsa.Providers.Redis/StartupTasks

SubscribeToRedisCacheSignals is a startup task that subscribes to a Redis bus.

For Rebus, it needs to register a consumer that processes the TriggerCacheSignal message. To register this consumer, simply call UseRebusCacheSignal on the elsa options builder, for example:

services.AddElsa(elsa => elsa.UseRebusCacheSignal());

sfmskywalker avatar Jun 27 '22 09:06 sfmskywalker

@vargaendre

How can we make sure that our multinode setup is correct? We followed the article you mentioned, instances start, we don't see errors, still it would be great to have a checklist to go through which would ensure that our setup is all correct.

I agree a checklist would be helpful. Describing strategies to test that it is setup correctly however is a bit more involved and is not something I'll be able to do at the moment due to other pressing priorities.

But at the very least I can mention the 3 things to check for:

  • [ ] In a multi-node environment, only one workflow instance will be processed by a given node (no two nodes can ever process the same workflow instance at the same time)
  • [ ] When publishing changes to a workflow, the updated workflow takes into effect on all other nodes.
  • [ ] Workflows containing timer activities only execute on one node.

sfmskywalker avatar Jun 27 '22 10:06 sfmskywalker

But at the very least I can mention the 3 things to check for:

  • [ ] In a multi-node environment, only one workflow instance will be processed by a given node (no two nodes can ever process the same workflow instance at the same time)
  • [ ] When publishing changes to a workflow, the updated workflow takes into effect on all other nodes.
  • [ ] Workflows containing timer activities only execute on one node.

Thank you! I understand you have a lot on your plate regarding Elsa. We appreciate your help!

I am not 100% sure how to test the first two of your checkpoints, but I created a workflow with a StartAt activity (it took a little time to figure out the datetime format) which was ran only once (we have two nodes currently), so I know Quartz is good. So, there was only one WF instance ran.

Is it possible that our problem (not finding an existing workflow definition by name) is not related to multinode, rather something else?

vargaendre avatar Jun 27 '22 13:06 vargaendre

One more data point. While testing the workflow, I saw this: image After a little time and a refresh, it changed to this: Untitled So, whatever generated this UI, did not find the workflow definition. Then it did. Since then I refreshed the page many times, always got the workflow definition. Could it be that the workflow definition could not be found only while it was ran (by another node)?

vargaendre avatar Jun 27 '22 13:06 vargaendre

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 31 '22 02:08 stale[bot]