orleans icon indicating copy to clipboard operation
orleans copied to clipboard

Issue with `Storage` table and grain re-activation

Open callixte opened this issue 4 years ago • 4 comments

Hello all,

Context:

  • Orleans 3.0.0
  • Stream using SMS
  • Grain storage using ado.net on MySql, used only for SMS, we have no persistent grain.

Issue: Our grains get activated from external stimulus. In a test environment, we control that and know when they occur. We have noticed that when we restart the silo, some grains get activated with no reason, sometimes several times with the same primary key. If we clear the Storage table before starting the silo, the problem goes away. Our Storage table contains only items with the Orleans.Streams.PubSubRendezvousGrain GrainTypeString.

Questions:

  • Is SMS usable in production?
  • Is there a way to clean the Storage table properly upon shutdown of a silo?

callixte avatar Jan 30 '20 11:01 callixte

Could you please clarify a couple of things. The way I read the above, "Grain storage using ado.net on MySql" seems to contradict "we have no persistent grain". Did you mean that your grains don't use persistence, and it is only configured for pubsub?

Our grains get activated from external stimulus.

Do you mean you make calls to those grains or something else activates them?

We have noticed that when we restart the silo, some grains get activated with no reason,

They may get activated if they were producers to or consumers from streams by the pubsub, when other grains publish or subscribe to the same streams.

If we clear the Storage table before starting the silo, the problem goes away. Our Storage table contains only items with the Orleans.Streams.PubSubRendezvousGrain GrainTypeString.

By doing that you delete pubsub state, and it makes sense that when you use the same streams later, you start from a clean slate.

Is SMS usable in production?

Yes. It's been used in production for many years. Buy not being a true persistent queue provider, SMS has quirks in its behavior that are important to take into consideration.

Is there a way to clean the Storage table properly upon shutdown of a silo?

First, I'd like to understand what exactly you want to achieve.

The pubsub state persisted via PubSubRendezvousGrains is used by the whole cluster. So, it doesn't make sense to wipe it when a single silo in the cluster shuts down. If you are shutting down the whole cluster, you can clean the table after that with a script or you can simply use a different Service ID next time to start the cluster, so that the data stored by the cluster you shut down wouldn't be used by the new one.

If you don't want to persist pubsub state in storage, you could configure it to use the in-memory storage provider instead of the ADO.NET one. Such a configuration would be, of course, unreliable in case of a silo failure.

You can also consider if implicit subscriptions would fit your needs instead. Hard to tell without knowing more details about the application and its requirements.

sergeybykov avatar Feb 01 '20 23:02 sergeybykov

Thank you for your response. I am going to clarify a few things.

Could you please clarify a couple of things. The way I read the above, "Grain storage using ado.net on MySql" seems to contradict "we have no persistent grain". Did you mean that your grains don't use persistence, and it is only configured for pubsub?

yes. I meant none of the grains we implemented uses persistence and storage is configured for pubsub like this:

 .AddAdoNetGrainStorage("PubSubStore", optionsBuilder =>
                        {
                            optionsBuilder.Invariant = "MySql.Data.MySqlClient";
                            optionsBuilder.ConnectionString = connectionString;
                        })

Our grains get activated from external stimulus.

Do you mean you make calls to those grains or something else activates them?

yes. Another part of the stack call them and they get activated at this point. During activation, they subscribe to streams. During de-activation, they unsubscribe.

We have noticed that when we restart the silo, some grains get activated with no reason,

They may get activated if they were producers to or consumers from streams by the pubsub, when other grains publish or subscribe to the same streams.

Even if they unsubscribed? Here is what we experienced:

  • The cluster runs with two silos.
  • A grain gets called. It activates, subscribes to streams, does its job.
  • After this, it de-activates and while de-activation, it unsubscribes. We have logs showing this.
  • The grain soon re-activates. Maybe because something was pushed to one of the stream. It is curious though, because subscriptions are delegated to another object, not the grain itself.
  • If we shutdown the cluster and start it again, the grain gets activated.

I am trying to put together a simple project that shows this behavior to share with you, it would be easier to explain.

callixte avatar Feb 03 '20 14:02 callixte

I managed to isolate the issue in a simple project. you can find the code here: https://github.com/callixte/OrleansPoC It derives from the HelloWorld sample project. You will need to add a db-connection-string.txt file containing information on a mysql db for clustering and PubSub persistence in the server and the client to make it work. The server has two grains: one consumer and one producer of the same stream. In the client, you can call tick<message> to send a tick on the stream from the producer grain and read to get the tick from the consumer grain. Once sure that the two grains communicate, you can let two minutes pass and the collection of idle grains will de-activate the grains. Logs will show that they are de-activated. And almost immediately, the consumer grain gets activated again. Sometimes, it happens differently: the producer grain gets de-activated. The consumer grain fails to de-activate. Then a producer activates. But, in this case, from the client, it is impossible to call the consumer grain again. It hangs indefinitely. From the logs, I see it tries to activate it.

callixte avatar Feb 07 '20 13:02 callixte

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.

ghost avatar Jul 28 '22 23:07 ghost