orleans
orleans copied to clipboard
Silo has activated grains but these are not processing any messages
We are trying to migrate to Orleans 7.2.2 from 3.x. We are using kubernetes while using consul for clustering on a Linux container. When performing a rolling deploy we are noticing that silos / pods are being deployed successfully but the last created pods are not processing any messages / requests (even though there are grain activations).
Just a side note, when killing / restarting a random pod, the newly created pod tries to join the cluster but we are having the same result (the pod is not processing any messages even though there are activations).
Any idea what may be happening (this is working correctly on 3.x)?
Where are the requests for this system originating from? It's interesting that the grains are being activated at all if there are no requests being sent their way, since grains are only activated by requests. Do you have logs from those silos and other silos which would be making requests? Do you have custom placement?
Does this repro on 7.2.1? Do you have a custom directory?
Hi Reuben,
We are not using custom placement / custom directory, the requests are originating from a GrainService. What we are finding interesting is that even the ManagementGrain is showing the same issue:
In terms of errors this is the one that seems most relevant:
Error processing cluster membership updates
exceptions.Message: Argument must have a previous version to the current instance. Expected <= -9223372036854775808, encountered 12 (Parameter 'previous')
exception.StackTraceString: at Orleans.Runtime.ClusterMembershipSnapshot.CreateUpdate(ClusterMembershipSnapshot previous) in /_/src/Orleans.Runtime/MembershipService/ClusterMembershipSnapshot.cs:line 76
at Orleans.Runtime.ActivationMigrationManager.ProcessMembershipUpdates() in /_/src/Orleans.Runtime/Catalog/ActivationMigrationManager.cs:line 164
Oh, are you attempting to perform a rolling upgrade from 3.x to 7.x? That is not supported - the two versions are not wire compatible
No it is not a rolling upgrade in that sense - we have cleaned any old states / data. In the above case all silos are running Orleans 7.2.2, but when we are redeploying a silo must first die before the new one is made part of the cluster.
I removed all silos and re deployed (basically ensuring that it is completely clean), the deployment succeeded and there were activations across all silos and they were all partially processing (as they are not succeeding all the time) we are encountering the following error:
Forwarding failed: tried to forward message Request [S10.150.26.104:11111:56646254 sys.svc.user.9D282C6F/10.150.26.104:11111@56646254]->[S10.150.36.246:11111:56646254 consumer/odinKafkaConsumer/prod-kronos-user_email_notification-v1/21] Odin.Messaging.Kafka.Consuming.IConsumerGrain.ConsumeAndPublish() #198152[ForwardCount=2] for 2 times after "This grain is active on another host (S10.150.36.246:11111:56646254)." to invalid activation. Rejecting now.
seems to be related to this: https://github.com/dotnet/orleans/issues/8658#issuecomment-1766080837
Does that error only occur during rolling deployment, @tiasmt, i.e., is the error resolved after deployment completes?
This is expected, to a degree, while the grain directory is in flux (i.e, during cluster membership changes, such as a rolling deployment). You can use an external directory (like Redis) to avoid this kind of issue.
@benjaminpetit we should consider further mitigations for these conditions until we can rectify it entirely (eg using a storage backed directory), such as potentially inserting a delay, waiting for directory/membership stability, or increasing the allowed number of hops. Currently, it's 2, configured via SiloMessagingOptions.MaxForwardCount. You could try increasing that to 4 or 5 to alleviate the majority of these issues, @tiasmt.
Does that error only occur during rolling deployment, @tiasmt, i.e., is the error resolved after deployment completes?
I observed this during a rolling deploy with DefaultVersionSelectorStrategy = "LatestVersion". The two errors (https://github.com/dotnet/orleans/issues/8658#issuecomment-1766080837) seemed to persist for at least one hour (before I shut it down). After changing to the Redis grain directory I haven't been able to reproduce this state, but I have some further testing to do before I would say it's 100% solid.
Does that error only occur during rolling deployment, @tiasmt, i.e., is the error resolved after deployment completes?
I observed this during a rolling deploy with
DefaultVersionSelectorStrategy = "LatestVersion". The two errors (#8658 (comment)) seemed to persist for at least one hour (before I shut it down). After changing to the Redis grain directory I haven't been able to reproduce this state, but I have some further testing to do before I would say it's 100% solid.
I am observing the same thing on my end (the error persisted for around 3 hours with the latest deployment).
I will try and change the SiloMessagingOptions.MaxForwardCount to see if it has any effect
I performed a rolling deploy (without removing all the silos beforehand) with the option SiloMessagingOptions.MaxForwardCount set to 5 and the original issue resurfaced - the last created pods are not processing any messages / requests (even though there are grain activations).
It seems that the invalid activation occurs when all silos are able to be deployed simultaneously (needs to be completely cleared before hand, if this is of any help..)
Are you using a custom serializer (eg, JSON)? I'm looking for the source of the bug now - thank you for bearing with me. What seems to be happening is that cluster membership is being rewound to its initial version, which should not be allowed.
I wonder if there is a bug in the Consul cluster membership provider where it's losing the version.
EDIT: I opened a PR to prevent the membership version from ever rewinding, since that is not allowed and is the direct cause of the error you posted above: #8673. It's possible that there was a race somewhere, but I otherwise don't see why membership is rewinding, unless you have a custom serializer which is not correctly serializing the membership version. This should prevent that from causing as much harm, but it won't fix the root of the issue. We should block the Newtonsoft.Json, System.Text.Json, etc, serializers from serializing Orleans' own types by including a pre-filter which prevents our assemblies from being included.
Yes, we are using a custom serializer - json (but this is only for specific types as per https://learn.microsoft.com/en-us/dotnet/orleans/host/configuration-guide/serialization-configuration?pivots=orleans-7-0).
Could this be impacting the membership in some way? I will try and remove the custom serializer later today (where possible) and see if it fixes the issue
Update: removed the custom JSON serializer but the issue related to silos not processing messages is still occurring
Do you still have the same exception message as above? i.e:
Error processing cluster membership updates
exceptions.Message: Argument must have a previous version to the current instance. Expected <= -9223372036854775808, encountered 12 (Parameter 'previous')
exception.StackTraceString: at Orleans.Runtime.ClusterMembershipSnapshot.CreateUpdate(ClusterMembershipSnapshot previous) in /_/src/Orleans.Runtime/MembershipService/ClusterMembershipSnapshot.cs:line 76
at Orleans.Runtime.ActivationMigrationManager.ProcessMembershipUpdates() in /_/src/Orleans.Runtime/Catalog/ActivationMigrationManager.cs:line 164
If so, when you removed the JSON serializer, did you perform a clean deployment?
Hi, the error is not present anymore but the issue is still occurring. What we have noticed is that when the silos are up and running within a couple of seconds, everything seems to work fine but when there is a delay for some silos to initialize these do not process any messages.
We also managed to replicate the issue by performing a deployment and killing some random silos (when restarting these are not processing any messages).
This was replicated on a dummy application (having bare minimal code from our side) and we were observing the problem on the management grain.
Are you able to share logs? If you prefer not to upload them here, we could share them via email instead
Trying to gather logs from my end and will share them once available
We are seeing "This grain is active on another host" a lot too, I havent managed to dig into it yet but we are using the Redis grain directory and what I noticed at first glance was:
- Grain was active in some silo that crashed
- New silo came up, tried to activate the grain, got a RedisTimeoutException
After this no grain calls to this grain succeeds, always the same error as above. Max forward count is set to 5.
Does this feel like the same issue at all @ReubenBond? If so I can collect some logs and the current registration in Redis and add here too for some additional context.
@tanordheim this looks like a different issue to me - the Redis timeout is the likely cause. It may be worth opening a new issue with details
Are you able to share logs? If you prefer not to upload them here, we could share them via email instead
I've given you access to a repository with the logs.
The logs are part of the dummy application I mentioned previously (without the JSON serializer). I started 6 silos and these were functioning as expected, I then killed 3 random silos when these restarted they never processed any messages.
Let me know if I can provide any more details / information
It looks like your cluster is still being affected by this issue: https://github.com/dotnet/orleans/issues/8667, since cluster membership is failing to update. Can you show me how you're configuring your system? Also, your log file doesn't seem to include the Exception object which is passed to the Log call, so details are missing.
Updated the logs to show the Exception
Configuration is as follows:
.UseOrleans((ctx, builder) =>
{
var siloOptions = new SiloConfigBuilder()
.FromConfig(ctx.Configuration);
microSvcBuilder.ConfigureOrleansDelegate?.Invoke(context, siloOptions, builder);
if (siloOptions.SiloPort == 0)
siloOptions.SiloPort = hostBuilder.GetAvailablePort(11111, 12000);
builder
.Configure<ClusterOptions>(
opts =>
{
opts.ClusterId = context.AppInfo.ClusterId;
opts.ServiceId = context.AppInfo.ServiceId;
}
)
.AddMemoryGrainStorage(OrleansStoreNames.GrainMemory)
.AddIncomingGrainCallFilter<UnknownErrorIncomingCallFilter>()
.AddOutgoingGrainCallFilter<UnknownErrorOutgoingCallFilter>()
.AddIncomingGrainCallFilter<LoggingIncomingCallFilter>()
.AddOutgoingGrainCallFilter<LoggingOutgoingCallFilter>()
;
builder.Configure<ClusterMembershipOptions>(
options =>
{
options.ExtendProbeTimeoutDuringDegradation = siloOptions.ClusterConfig.ExtendProbeTimeout;
options.EnableIndirectProbes = siloOptions.ClusterConfig.EnableIndirectProbes;
options.LocalHealthDegradationMonitoringPeriod =
TimeSpan.FromSeconds(siloOptions.ClusterConfig.LocalHealthDegradationMonitoringPeriod);
}
);
if (siloOptions.OrleansDashboardConfig.IsEnabled)
builder.UseDashboard(
options =>
{
options.Username = siloOptions.OrleansDashboardConfig.Username;
options.Password = siloOptions.OrleansDashboardConfig.Password;
options.Port = siloOptions.OrleansDashboardConfig.Port;
options.CounterUpdateIntervalMs = siloOptions.OrleansDashboardConfig.CounterUpdateIntervalMs;
}
);
// Consul
builder
.ConfigureEndpoints(
siloOptions.SiloPort,
siloOptions.GatewayPort,
listenOnAnyHostAddress: true
)
.UseConsulSiloClustering(opts =>
{
opts.KvRootFolder = siloOptions.ConsulConfig.KvRootFolder;
var address = new Uri(siloOptions.ConsulConfig.Uri);
opts.ConfigureConsulClient(address);
});
Just an update from my end:
- switched clustering from
ConsultoAdoNet- no difference, issue was still present (silos not processing messages).
Just an additional note if we have 6 silos in total and currently 2 of them are processing messages, if we kill those 2, the other 4 manage to take over and start processing the messages themselves (but when the original 2 recover these do not process messages again)
EDIT: Managed to split the issue further which should provide a bit of clarity:
Consul
- When using the JSON serializer we get the
Error processing cluster membership updateserror, removing the serializer solves this error
AdoNet
- Works correctly both when having the JSON serializer and without it
Dashboard The original issue related to grains not processing any messages seems to be an issue related to the dashboard (as the grains being marked with 0 throughput and 0 latency are actually processing messages)
@ReubenBond shall I close this issue and open a separate one for the Error processing cluster membership updates, I believe this will solve it (https://github.com/dotnet/orleans/pull/8673)?
@tiasmt am I understanding correctly that this issue does not occur when using the AdoNet clustering provider, only when using both Consul & JSON.NET?
@tiasmt am I understanding correctly that this issue does not occur when using the AdoNet clustering provider, only when using both Consul & JSON.NET?
Yes, the Error processing cluster membership updates only occurs when using Consul and the JSON serializer (I was using System.Text.Json).
When using AdoNet this issue does not occur.
Hi @ReubenBond ,
I recently upgrade my orleans project from 3.6 to 8.2. I'm also facing the issue that when my ci/cd tool did the rolling upgrade of code, in a big change that several grains in the new silos stop processing messages. I tried to go with both memory grain directory (default) and redis grain directory, the issue persists.
My question is that will #9103 and DistributedGrainDirectory hosting fix this issue?
Thank you!
@shykln it's difficult to say without knowing more. Are you also using Consul for clustering?
@shykln it's difficult to say without knowing more. Are you also using Consul for clustering?
I use ado.net clustring with a mysql