orleans
orleans copied to clipboard
Orleans Silos freezing and crashing
Context: Since we’ve migrated to the Orleans 7, we’ve experienced a few performance issues..
We are running orleans in kubernetes with kubernets hosting and clustering. But after the migration we have had issues with silos not responding. Not much exceptions just timing out. We have added callthreadreentrancy where needed but it still is not reliable. Healthprobing them and killing is helping but the system is down until the failing probe threshold is hit, so its not a good fix. We probably have some issues in our code but its hard to say where to start looking when the silos just die.
2024-06-04 12:41:10.476 | {"Message":"Exception publishing client routing table to silo \"S10.244.6.9:11111:76427299\"","MessageTemplate":"Exception publishing client routing table to silo {SiloAddress}","Exception":{"Type":"Orleans.Runtime.OrleansMessageRejectionException","Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.9:11111:76427299, will retry after 895.3856ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","StackTrace":" at Orleans.Serialization.Invocation.ResponseCompletionSource.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 98\n at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Runtime.GrainDirectory.ClientDirectory.PublishUpdates() in /_/src/Orleans.Runtime/GrainDirectory/ClientDirectory.cs:line 499"},"SiloAddress":"S10.244.6.9:11111:76427299","ExceptionDetail":{"HResult":-2146233088,"Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.9:11111:76427299, will retry after 895.3856ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","Source":"System.Private.CoreLib","TargetSite":"Void Throw()","Type":"Orleans.Runtime.OrleansMessageRejectionException"},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-58646859d8-5htvm"} |
| | 2024-06-04 12:41:10.385 | {"Message":"Indirect probe request #60 to silo \"S10.244.6.9:11111:76427299\" via silo \"S10.244.5.16:11111:76502162\" failed after 00:00:01.5819712 with a direct probe response time of 00:00:01.5604363. Failure message: \"Encountered exception \nExc level 0: Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.244.6.9:11111:76427299. See InnerException\n ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.244.6.9:11111. Error: HostUnreachable\n at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 65\n at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61\n at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 193\n --- End of inner exception stack trace ---\n at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 221\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226\n at Orleans.Serialization.Invocation.ResponseCompletionSource.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 98\n at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Internal.OrleansTaskExtentions.WithTimeout(Task taskToComplete, TimeSpan timeout, String exceptionMessage) in /_/src/Orleans.Core/Async/TaskExtensions.cs:line 87\n at Orleans.Runtime.MembershipService.MembershipSystemTarget.ProbeIndirectly(SiloAddress target, TimeSpan probeTimeout, Int32 probeNumber) in /_/src/Orleans.Runtime/MembershipService/MembershipSystemTarget.cs:line 86\". Intermediary health score: 0","MessageTemplate":"Indirect probe request #{Id} to silo {SiloAddress} via silo {IntermediarySiloAddress} failed after {RoundTripTime} with a direct probe response time of {ProbeResponseTime}. Failure message: {FailureMessage}. Intermediary health score: {IntermediaryHealthScore}","Id":60,"SiloAddress":"S10.244.6.9:11111:76427299","IntermediarySiloAddress":"S10.244.5.16:11111:76502162","RoundTripTime":"00:00:01.5819712","ProbeResponseTime":"00:00:01.5604363","FailureMessage":"Encountered exception \nExc level 0: Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.244.6.9:11111:76427299. See InnerException\n ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.244.6.9:11111. Error: HostUnreachable\n at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 65\n at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61\n at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 193\n --- End of inner exception stack trace ---\n at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 221\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226\n at Orleans.Serialization.Invocation.ResponseCompletionSource.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 98\n at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Internal.OrleansTaskExtentions.WithTimeout(Task taskToComplete, TimeSpan timeout, String exceptionMessage) in /_/src/Orleans.Core/Async/TaskExtensions.cs:line 87\n at Orleans.Runtime.MembershipService.MembershipSystemTarget.ProbeIndirectly(SiloAddress target, TimeSpan probeTimeout, Int32 probeNumber) in /_/src/Orleans.Runtime/MembershipService/MembershipSystemTarget.cs:line 86","IntermediaryHealthScore":0,"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-58646859d8-96gfk"} |
| | 2024-06-04 12:41:10.372 | {"Message":"Connection id \"\"0HN44G8KKFI9R\"\", Request id \"\"0HN44G8KKFI9R:000000E9\"\": An unhandled exception was thrown by the application.","MessageTemplate":"Connection id \"{ConnectionId}\", Request id \"{TraceIdentifier}\": An unhandled exception was thrown by the application.","Exception":{"Type":"Orleans.Runtime.OrleansMessageRejectionException","Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.244.6.9:11111:76427299. See InnerException\n ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.244.6.9:11111. Error: HostUnreachable\n at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 65\n at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61\n at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 193\n --- End of inner exception stack trace ---\n at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 221\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","StackTrace":" at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at OrleansDashboard.DashboardClient.ClusterStats()\n at OrleansDashboard.DashboardMiddleware.Invoke(HttpContext context)\n at Microsoft.AspNetCore.Builder.Extensions.MapMiddleware.InvokeCore(HttpContext context, PathString matchedPath, PathString remainingPath)\n at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)"},"ConnectionId":"0HN44G8KKFI9R","TraceIdentifier":"0HN44G8KKFI9R:000000E9","EventId":{"Id":13,"Name":"ApplicationError"},"RequestId":"0HN44G8KKFI9R:000000E9","RequestPath":"/dashboard/ClusterStats","ExceptionDetail":{"HResult":-2146233088,"Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.244.6.9:11111:76427299. See InnerException\n ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.244.6.9:11111. Error: HostUnreachable\n at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 65\n at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61\n at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 193\n --- End of inner exception stack trace ---\n at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 221\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","Source":"System.Private.CoreLib","TargetSite":"Void Throw()","Type":"Orleans.Runtime.OrleansMessageRejectionException"},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-58646859d8-96gfk"} |
| | 2024-06-04 12:41:10.371 | {"Message":"Connection attempt to endpoint \"S10.244.6.9:11111:76427299\" failed","MessageTemplate":"Connection attempt to endpoint {EndPoint} failed","Exception":{"Type":"Orleans.Networking.Shared.SocketConnectionException","Message":"Unable to connect to 10.244.6.9:11111. Error: HostUnreachable","StackTrace":" at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 65\n at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61\n at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 193"},"EndPoint":"S10.244.6.9:11111:76427299","ExceptionDetail":{"HResult":-2146233088,"Message":"Unable to connect to 10.244.6.9:11111. Error: HostUnreachable","Source":"Orleans.Core","TargetSite":"Void MoveNext()","Type":"Orleans.Networking.Shared.SocketConnectionException"},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-58646859d8-5htvm"} |
I dont have many proposals for a next step here, as we have tried to scour the solution for reentrancy issues.
2024-06-05 13:58:58.922 | {"Message":"Created \"ActivationTaskScheduler-3:Queued=0\" with GrainContext=\"[SystemTarget: S10.244.6.40:11111:76593537/sys.svc.user.36F5F3BF/10.244.6.40:11111@76593537@b81578bf00000000877f5b7300000000]\"","MessageTemplate":"Created {TaskScheduler} with GrainContext={GrainContext}","TaskScheduler":"ActivationTaskScheduler-3:Queued=0","GrainContext":"[SystemTarget: S10.244.6.40:11111:76593537/sys.svc.user.36F5F3BF/10.244.6.40:11111@76593537@b81578bf00000000877f5b7300000000]","app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-mkd8v"} |
| | 2024-06-05 13:58:58.916 | {"Message":"Monitoring cluster membership updates","MessageTemplate":"Monitoring cluster membership updates","app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-mkd8v"} |
| | 2024-06-05 13:58:58.911 | {"Message":"Created \"ActivationTaskScheduler-2:Queued=0\" with GrainContext=\"[SystemTarget: S10.244.6.40:11111:76593537/sys.svc.migrator/10.244.6.40:11111@76593537@e8f4d5ad00000000877f5b7300000000]\"","MessageTemplate":"Created {TaskScheduler} with GrainContext={GrainContext}","TaskScheduler":"ActivationTaskScheduler-2:Queued=0","GrainContext":"[SystemTarget: S10.244.6.40:11111:76593537/sys.svc.migrator/10.244.6.40:11111@76593537@e8f4d5ad00000000877f5b7300000000]","app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-mkd8v"} |
| | 2024-06-05 13:58:58.824 | {"Message":"Starting \"VirtualBucketsRingProvider\" on silo \"S10.244.6.40:11111:76593537/x9F156319\".","MessageTemplate":"Starting {Name} on silo {SiloAddress}.","Name":"VirtualBucketsRingProvider","SiloAddress":"S10.244.6.40:11111:76593537/x9F156319","app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-mkd8v"} |
| | 2024-06-05 13:58:58.808 | {"Message":"Created \"ActivationTaskScheduler-1:Queued=0\" with GrainContext=\"[SystemTarget: S10.244.6.40:11111:76593537/sys.svc.catalog/10.244.6.40:11111@76593537@035e1f7900000000877f5b7300000000]\"","MessageTemplate":"Created {TaskScheduler} with GrainContext={GrainContext}","TaskScheduler":"ActivationTaskScheduler-1:Queued=0","GrainContext":"[SystemTarget: S10.244.6.40:11111:76593537/sys.svc.catalog/10.244.6.40:11111@76593537@035e1f7900000000877f5b7300000000]","app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-mkd8v"} |
| | 2024-06-05 13:58:58.765 | {"Message":"Exception during Grain method call of message \"ReadOnly IsAlwaysInterleave Request [S10.244.6.41:11111:76586547 sys.client/313d93c3d4af4ddeb4e59342800160c2]->[S10.244.8.130:11111:76586541 grantproject/4a02897701b94d04acb30a37a0a4dccd] Innbyggertjenester.GrainInterfaces.Donatio.IGrantProjectGrain[(Innbyggertjenester.GrainInterfaces.Donatio.IGrantProjectGrain)Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain].GetApplicationView() #6904\": ","MessageTemplate":"Exception during Grain method call of message {Message}: ","Exception":{"Type":"Orleans.Runtime.OrleansMessageRejectionException","Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","StackTrace":" at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 732\n at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29\n at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 364\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 471\n at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain.GetApplication(Grant grant) in /app/src/Innbyggertjenester.Grains/Donatio/StatefulGrains/GrantProjectGrain.cs:line 59\n at Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain.GetApplicationView() in /app/src/Innbyggertjenester.Grains/Donatio/StatefulGrains/GrantProjectGrain.cs:line 296\n at Orleans.Runtime.TaskRequest`1.CompleteInvokeAsync(Task`1 resultTask)\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 132\n at OrleansDashboard.Implementation.GrainProfilerFilter.Invoke(IIncomingGrainCallContext context)\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 94\n at Innbyggertjenester.Silo.ExceptionConversionFilter.Invoke(IIncomingGrainCallContext context) in /app/src/Innbyggertjenester.Silo/ExceptionConversionFilter.cs:line 51\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 94\n at Orleans.Runtime.InsideRuntimeClient.Invoke(IGrainContext target, Message message) in /_/src/Orleans.Runtime/Core/InsideRuntimeClient.cs:line 263"},"_Message":"ReadOnly IsAlwaysInterleave Request [S10.244.6.41:11111:76586547 sys.client/313d93c3d4af4ddeb4e59342800160c2]->[S10.244.8.130:11111:76586541 grantproject/4a02897701b94d04acb30a37a0a4dccd] Innbyggertjenester.GrainInterfaces.Donatio.IGrantProjectGrain[(Innbyggertjenester.GrainInterfaces.Donatio.IGrantProjectGrain)Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain].GetApplicationView() #6904","EventId":{"Id":100322},"ExceptionDetail":{"HResult":-2146233088,"Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","Source":"System.Private.CoreLib","TargetSite":"Void Throw()","Type":"Orleans.Runtime.OrleansMessageRejectionException"},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-k9tmk"} |
| | 2024-06-05 13:58:58.764 | {"Message":"HandleMessage \"ReadOnly IsAlwaysInterleave Unrecoverable Rejection (info: Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226\n at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 732\n at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29\n at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 364\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 471) Response [S10.244.8.130:11111:76586541 grantapplication/5d9cbe2600254c048a31e683585864f9]->[S10.244.8.130:11111:76586541 grantproject/4a02897701b94d04acb30a37a0a4dccd] Orleans.Runtime.RejectionResponse #37288\"","MessageTemplate":"HandleMessage {Message}","_Message":"ReadOnly IsAlwaysInterleave Unrecoverable Rejection (info: Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226\n at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 732\n at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29\n at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 364\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 471) Response [S10.244.8.130:11111:76586541 grantapplication/5d9cbe2600254c048a31e683585864f9]->[S10.244.8.130:11111:76586541 grantproject/4a02897701b94d04acb30a37a0a4dccd] Orleans.Runtime.RejectionResponse #37288","EventId":{"Id":101523},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-k9tmk"} |
| | 2024-06-05 13:58:58.764 | {"Message":"Creating Unrecoverable rejection with info '\" Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226\n at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 732\n at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29\n at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 364\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 471\"' at:\n\" at Orleans.Runtime.MessageFactory.CreateRejectionResponse(Message request, RejectionTypes type, String info, Exception ex)\n at Orleans.Runtime.Messaging.MessageCenter.RejectMessage(Message message, RejectionTypes rejectionType, Exception exc, String rejectInfo)\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__OnAddressingFailure\|39_1(Message m, Exception ex)\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m)\n at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)\n at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)\n at System.Threading.Tasks.TaskSchedulerAwaitTaskContinuation.<>c.<Run>b__2_0(Object state)\n at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)\n at Orleans.Runtime.Scheduler.WorkItemGroup.Execute()\n at System.Threading.ThreadPoolWorkQueue.Dispatch()\n at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()\n\"","MessageTemplate":"Creating {RejectionType} rejection with info '{Info}' at:\n{StackTrace}","Exception":{"Type":"Orleans.Runtime.OrleansMessageRejectionException","Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","StackTrace":" at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 732\n at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29\n at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 364\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 471\n at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain.GetApplication(Grant grant) in /app/src/Innbyggertjenester.Grains/Donatio/StatefulGrains/GrantProjectGrain.cs:line 59\n at Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain.GetApplicationView() in /app/src/Innbyggertjenester.Grains/Donatio/StatefulGrains/GrantProjectGrain.cs:line 296\n at Orleans.Runtime.TaskRequest`1.CompleteInvokeAsync(Task`1 resultTask)\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 132\n at OrleansDashboard.Implementation.GrainProfilerFilter.Invoke(IIncomingGrainCallContext context)\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 94\n at Innbyggertjenester.Silo.ExceptionConversionFilter.Invoke(IIncomingGrainCallContext context) in /app/src/Innbyggertjenester.Silo/ExceptionConversionFilter.cs:line 51\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 94\n at Orleans.Runtime.InsideRuntimeClient.Invoke(IGrainContext target, Message message) in /_/src/Orleans.Runtime/Core/InsideRuntimeClient.cs:line 263"},"RejectionType":"Unrecoverable","Info":" Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226\n at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 732\n at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29\n at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 364\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 471","StackTrace":" at Orleans.Runtime.MessageFactory.CreateRejectionResponse(Message request, RejectionTypes type, String info, Exception ex)\n at Orleans.Runtime.Messaging.MessageCenter.RejectMessage(Message message, RejectionTypes rejectionType, Exception exc, String rejectInfo)\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__OnAddressingFailure\|39_1(Message m, Exception ex)\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m)\n at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)\n at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)\n at System.Threading.Tasks.TaskSchedulerAwaitTaskContinuation.<>c.<Run>b__2_0(Object state)\n at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)\n at Orleans.Runtime.Scheduler.WorkItemGroup.Execute()\n at System.Threading.ThreadPoolWorkQueue.Dispatch()\n at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()\n","ExceptionDetail":{"HResult":-2146233088,"Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","Source":"System.Private.CoreLib","TargetSite":"Void Throw()","Type":"Orleans.Runtime.OrleansMessageRejectionException"},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-k9tmk"} |
| | 2024-06-05 13:58:58.764 | {"Message":"Rejected message \"ReadOnly IsAlwaysInterleave Request [S10.244.8.130:11111:76586541 grantproject/4a02897701b94d04acb30a37a0a4dccd]->[ grantapplication/5d9cbe2600254c048a31e683585864f9] Innbyggertjenester.GrainInterfaces.Donatio.IGrantApplicationGrain.Get(Innbyggertjenester.GrainInterfaces.Donatio.Models.Grant.Grant) #37288\" with reason 'null' (Unrecoverable)","MessageTemplate":"Rejected message {Message} with reason '{Reason}' ({RejectionType})","Exception":{"Type":"Orleans.Runtime.OrleansMessageRejectionException","Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","StackTrace":" at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 732\n at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29\n at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 364\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 471\n at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain.GetApplication(Grant grant) in /app/src/Innbyggertjenester.Grains/Donatio/StatefulGrains/GrantProjectGrain.cs:line 59\n at Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain.GetApplicationView() in /app/src/Innbyggertjenester.Grains/Donatio/StatefulGrains/GrantProjectGrain.cs:line 296\n at Orleans.Runtime.TaskRequest`1.CompleteInvokeAsync(Task`1 resultTask)\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 132\n at OrleansDashboard.Implementation.GrainProfilerFilter.Invoke(IIncomingGrainCallContext context)\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 94\n at Innbyggertjenester.Silo.ExceptionConversionFilter.Invoke(IIncomingGrainCallContext context) in /app/src/Innbyggertjenester.Silo/ExceptionConversionFilter.cs:line 51\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 94\n at Orleans.Runtime.InsideRuntimeClient.Invoke(IGrainContext target, Message message) in /_/src/Orleans.Runtime/Core/InsideRuntimeClient.cs:line 263"},"_Message":"ReadOnly IsAlwaysInterleave Request [S10.244.8.130:11111:76586541 grantproject/4a02897701b94d04acb30a37a0a4dccd]->[ grantapplication/5d9cbe2600254c048a31e683585864f9] Innbyggertjenester.GrainInterfaces.Donatio.IGrantApplicationGrain.Get(Innbyggertjenester.GrainInterfaces.Donatio.Models.Grant.Grant) #37288","Reason":null,"RejectionType":"Unrecoverable","EventId":{"Id":101042,"Name":"Orleans.Messaging.Dispatcher.Rejected"},"ExceptionDetail":{"HResult":-2146233088,"Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","Source":"System.Private.CoreLib","TargetSite":"Void Throw()","Type":"Orleans.Runtime.OrleansMessageRejectionException"},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-k9tmk"} |
| | 2024-06-05 13:58:58.764 | {"Message":"Failed to address message \"ReadOnly IsAlwaysInterleave Request [S10.244.8.130:11111:76586541 grantproject/4a02897701b94d04acb30a37a0a4dccd]->[ grantapplication/5d9cbe2600254c048a31e683585864f9] Innbyggertjenester.GrainInterfaces.Donatio.IGrantApplicationGrain.Get(Innbyggertjenester.GrainInterfaces.Donatio.Models.Grant.Grant) #37288\"","MessageTemplate":"Failed to address message {Message}","Exception":{"Type":"Orleans.Runtime.OrleansMessageRejectionException","Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","StackTrace":" at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 732\n at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29\n at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 364\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 471\n at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain.GetApplication(Grant grant) in /app/src/Innbyggertjenester.Grains/Donatio/StatefulGrains/GrantProjectGrain.cs:line 59\n at Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain.GetApplicationView() in /app/src/Innbyggertjenester.Grains/Donatio/StatefulGrains/GrantProjectGrain.cs:line 296\n at Orleans.Runtime.TaskRequest`1.CompleteInvokeAsync(Task`1 resultTask)\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 132\n at OrleansDashboard.Implementation.GrainProfilerFilter.Invoke(IIncomingGrainCallContext context)\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 94\n at Innbyggertjenester.Silo.ExceptionConversionFilter.Invoke(IIncomingGrainCallContext context) in /app/src/Innbyggertjenester.Silo/ExceptionConversionFilter.cs:line 51\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 94\n at Orleans.Runtime.InsideRuntimeClient.Invoke(IGrainContext target, Message message) in /_/src/Orleans.Runtime/Core/InsideRuntimeClient.cs:line 263"},"_Message":"ReadOnly IsAlwaysInterleave Request [S10.244.8.130:11111:76586541 grantproject/4a02897701b94d04acb30a37a0a4dccd]->[ grantapplication/5d9cbe2600254c048a31e683585864f9] Innbyggertjenester.GrainInterfaces.Donatio.IGrantApplicationGrain.Get(Innbyggertjenester.GrainInterfaces.Donatio.Models.Grant.Grant) #37288","EventId":{"Id":100071,"Name":"Orleans.Messaging.Dispatcher.SelectTargetFailed"},"ExceptionDetail":{"HResult":-2146233088,"Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.7159ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","Source":"System.Private.CoreLib","TargetSite":"Void Throw()","Type":"Orleans.Runtime.OrleansMessageRejectionException"},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-k9tmk"} |
| | 2024-06-05 13:58:58.763 | {"Message":"Exception during Grain method call of message \"ReadOnly IsAlwaysInterleave Request [S10.244.6.41:11111:76586547 sys.client/313d93c3d4af4ddeb4e59342800160c2]->[S10.244.8.130:11111:76586541 grantproject/d83b72eaf9e241548bfa48a8e49951f7] Innbyggertjenester.GrainInterfaces.Donatio.IGrantProjectGrain[(Innbyggertjenester.GrainInterfaces.Donatio.IGrantProjectGrain)Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain].GetApplicationView() #6896\": ","MessageTemplate":"Exception during Grain method call of message {Message}: ","Exception":{"Type":"Orleans.Runtime.OrleansMessageRejectionException","Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.3011ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","StackTrace":" at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 732\n at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29\n at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 364\n at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync\|39_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 471\n at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230\n at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)\n--- End of stack trace from previous location ---\n at Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain.GetApplication(Grant grant) in /app/src/Innbyggertjenester.Grains/Donatio/StatefulGrains/GrantProjectGrain.cs:line 59\n at Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain.GetApplicationView() in /app/src/Innbyggertjenester.Grains/Donatio/StatefulGrains/GrantProjectGrain.cs:line 296\n at Orleans.Runtime.TaskRequest`1.CompleteInvokeAsync(Task`1 resultTask)\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 132\n at OrleansDashboard.Implementation.GrainProfilerFilter.Invoke(IIncomingGrainCallContext context)\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 94\n at Innbyggertjenester.Silo.ExceptionConversionFilter.Invoke(IIncomingGrainCallContext context) in /app/src/Innbyggertjenester.Silo/ExceptionConversionFilter.cs:line 51\n at Orleans.Runtime.GrainMethodInvoker.Invoke() in /_/src/Orleans.Runtime/Core/GrainMethodInvoker.cs:line 94\n at Orleans.Runtime.InsideRuntimeClient.Invoke(IGrainContext target, Message message) in /_/src/Orleans.Runtime/Core/InsideRuntimeClient.cs:line 263"},"_Message":"ReadOnly IsAlwaysInterleave Request [S10.244.6.41:11111:76586547 sys.client/313d93c3d4af4ddeb4e59342800160c2]->[S10.244.8.130:11111:76586541 grantproject/d83b72eaf9e241548bfa48a8e49951f7] Innbyggertjenester.GrainInterfaces.Donatio.IGrantProjectGrain[(Innbyggertjenester.GrainInterfaces.Donatio.IGrantProjectGrain)Innbyggertjenester.Grains.Donatio.StatefulGrains.GrantProjectGrain].GetApplicationView() #6896","EventId":{"Id":100322},"ExceptionDetail":{"HResult":-2146233088,"Message":"Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.244.6.40:11111:76586541, will retry after 916.3011ms\n at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99\n at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync\|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226","Source":"System.Private.CoreLib","TargetSite":"Void Throw()","Type":"Orleans.Runtime.OrleansMessageRejectionException"},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-k9tmk"} |
| | 2024-06-05 13:58:58.761 | {"Message":"Validating budget details","MessageTemplate":"Validating budget details","StackTrace":"GrantApplicationGrain.Get","GrainType":"GrantApplicationGrain","Method":"Get","GrainId":"65662b2454d0413d8a21548378595900","app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-5c949dd856-k9tmk"} |
```
Any news about this problem? Did you find workaournd that solving that?
Any news about this problem? Did you find workaournd that solving that?
Most of the freezes of the silos themselves is related to reentrancy issues, but we have a couple of timeouts that we are not sure why happens. It seems the threads are exhausted somehow.
We are using kubernetesclustering so it seems there is something not caught in the membership table if the silo just doesnt answer. By increasing the nodes, and using probes it at least restarts pretty quick. But the source of the problem is not found. The symptoms are those of memory or processing starvation but without any particular high usage in any of the nodes.
latency graph is like: 700ms 760ms 800ms 1000ms 30sec timeout
so it does not make much sense other than some type of deadlock
Do you have CPU limits in Kubernetes? How many cores per node & per pod do you have? CPU limits can induce starvation. Which exact version of Orleans are you using?
We have tested with and without limits in our 4 deployment envs the last couple of months, but it just had a hitch in production:
Orleans version: 8.1.0 Node allocation:
- 1900mCpu
- 5.2 GB Memory
- 111 GB Storage
Node pool settings:
- Minimum node count 7
- Maximum node count 25
- Autoscaling on
Does not add any more nodes tho..
Seems like the memory might be starved. Although it does not really show in the kubernetes nodes overview in AKS. Trying to limit cachesize in orleans now, but it seems that when the graincount hits a couple thousand it freezes one silo, and the client times out.
Production is running 2/3 silos stable now and the dashboard running on the silo is showing no grain activity on fetch request. So the client seems to be going against the frozen silo. Seems the clustering does not notice the silo is frozen
the solution to increase memory in k8s is just adding more nodes, so there might be an issue with memory balancing between nodes or pods?
Seems setting the cachesize for the silo helped in balancing. It now spun up a couple more silos instead of the original 3 @ReubenBond
It seems you manually have to limit the memory for it to not run out of pod memory space. In general the autoscaling does not trigger before the silos are down. Experiencing huge grain latency but the k8s-clusters autoscaling is scaling down.
I Ran some tests now and managed to get OOMKilled pods pretty quickly with 15000 grains. They are still just spread onto 3 silos. even with cachesettings set much lower
Any news about this problem? Did you find workaournd that solving that?
We have managed to optimize ourself around the freezes. it seems as mentioned above as some kind of starvation. and kubernetes not scaling before the silos crash. We have reduced the number of cached grains to in each silo to 2500 and that seems to avoid the issues with memory starvation. Also set the memory requirement in kubernetes to minimum 2GB, and target 4GB to make sure the cluster scales. We compensated with optimizing the database pooling to get a bit more performance on loading most of the grains at the same time.(using adonet for persistence with psql).
Now running 10-15.000 persisted grains in test env and it does not freeze any silos. still a bit trigger happy on down scaling, but way better
1900 mCPU will likely cause starvation - the way that CPU limits work in k8s is that once the limit is reached, all threads are paused until the next scheduling quanta (each quanta is 100ms). If your node has 8 cores and you have a 1900mCPU limit, then you can exceed that limit in ~25ms and spend the remaining 75ms suspended. I recommend you do not set CPU limits on any request-serving workload. Here's a blog post discussing the matter: https://home.robusta.dev/blog/stop-using-cpu-limits. I also recommend a higher core count to, say 4+ cores per pod, before scaling out significantly.
It seems you manually have to limit the memory for it to not run out of pod memory space.
.NET respects the memory limits set in the container, so if you set limits on your deployment, the .NET GC will try to adhere to them.
I recommend you do not set CPU limits on any request-serving workload.
1900 mCPU is the node allocation size not limits. these were hit without resource limits.
So the conclusion is insufficient cpu power in the cluster? Seems plausible, but its still weird because the monitoring shows the cpus for the nodes in the pool never hitting more than 30%.. memory usage is sometimes high, but never critical(never went OOM without loadtesting the cluster). Still without limiting amount of grains in memory for each silo, the silos are freezing up. with cache limits and minimum requirement(only) we were able to push the cluster to create more pods with silos under heavy load, which helped avoid the instant freezing. if we remove cachesize from the silo, the requests dont ramp all the way to 30s depending on the amount of grains, but jumps from a couple of seconds to a timeout.
I am aware that calling a admin grain fetching from most other grains with a relatively long call chain is battling the purpose of orleans, and sets physical limits to the speed of fetching said data set. i just think its weird that the cluster originally starves without any AKS resource monitoring catching it. And it seemed like the membership calls to the silos were frozen too, so the cluster did not recuperate.
The node pool is running on Standard_D2ds_v4 in our e2e env.
If noone else encounters this, there is no point in holding issue open. And if you do:
- Ensure (max)Resource limits for pods are not set, requirement limit is recommended
- Lower the CacheSize variable of the silo if you have a lot of stateful grains:
siloBuilder.Configure<GrainDirectoryOptions>(options => { options.CacheSize = 2500; }); - Increase cluster compute
If CPU limits are not set, then it shouldn't be the issue. Average CPU usage doesn't represent spikes well, so it can be a bit deceptive, but there is a good chance it isn't the issue. If you want to check if you are seeing CPU throttling, you can SSH into one of the pods and look at the contents of the cpu.stat file somewhere under /sys/fs/cgroup/cpu/. The nr_throttled & throttled_usec fields in that file have the relevant info.
Do these grains do anything while active, eg high-frequency grain timers, lots of logging? You could be seeing Thread Pool starvation (distinct from CPU starvation), do you have log messages from LocalSiloHealthMonitor?
I am aware that calling a admin grain fetching from most other grains with a relatively long call chain is battling the purpose of orleans, and sets physical limits to the speed of fetching said data set.
Are you able to share more about the workload/calling pattern?
no timers, maybe a couple 100 reminders in total, but they're outside of these grains.
Are you able to share more about the workload/calling pattern?
We are calling in the request that freezes the silo: Grain A -> (Grain B) -> (Grain C) -> (Grain D) -> (Grain C) -> (Grain D) -> (Grain B) -> (Grain C) -> (Grain D) -> (Grain C) -> (Grain D)
all grains are stateful with adonet binary state. Datamodel is fetching properties from B, C and D to create a list of overview models to return to the API. For testing purposes the dataset was 1 Grain B with 5000 Grain Cs each with its own Grain D.
do you have log messages from LocalSiloHealthMonitor?
Cant find any of those in the current logstream. I can try to reproduce again in the test env to get some logging.
It reoccurred:
First Probe that failed now.{"Message":"Did not get response for probe #28660 to silo \"S10.244.2.48:11111:90430940\" after 00:00:04.9991334. Current number of consecutive failed probes is 1","MessageTemplate":"Did not get response for probe #{Id} to silo {Silo} after {Elapsed}. Current number of consecutive failed probes is {FailedProbeCount}","Exception":{"Type":"System.OperationCanceledException","Message":"The ping attempt was cancelled after 00:00:04.9991112. Ping #28660","InnerException":{"Type":"TaskCanceledException","Message":"A task was canceled.","StackTrace":" at Orleans.Runtime.MembershipService.SiloHealthMonitor.ProbeDirectly(CancellationToken cancellation) in /_/src/Orleans.Runtime/MembershipService/SiloHealthMonitor.cs:line 255"}},"Id":28660,"Silo":"S10.244.2.48:11111:90430940","Elapsed":"00:00:04.9991334","FailedProbeCount":1,"EventId":{"Id":100613},"ExceptionDetail":{"Type":"System.OperationCanceledException","HResult":-2146233029,"Message":"The ping attempt was cancelled after 00:00:04.9991112. Ping #28660","Source":null,"InnerException":{"Type":"System.Threading.Tasks.TaskCanceledException","HResult":-2146233029,"Message":"A task was canceled.","Source":"System.Private.CoreLib","TargetSite":"Void ThrowForNonSuccess(System.Threading.Tasks.Task)","CancellationToken":"CancellationRequested: true","Task":{"Id":1,"Status":"Canceled","CreationOptions":"None"}},"CancellationToken":"CancellationRequested: false"},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-6d4f76f6b4-pdkgb"}
No throttling applied:
nr_periods 0 nr_throttled 0 throttled_usec 0
From those messages, we can see that the silo was evicted - but we don't know what happened on the silo itself. Do you have logs from the evicted pod? You can get logs for a previous instance of a container via kubectl.
Do you still have CacheSize set to that low number? It should not be set that low - I'd leave it on the default value of 1M, at least for now
We were only logging to grafana, I've turned on logging to the pods console as well for now. Anything in particular I should look for in the debug logs from the time of freeze?
As in the screenshot I posted above, there is a sudden spike in requests canceled and dropped. The orleans dashboard is running on the silo host so the networking should not be an issue.
LocalSiloHealthMonitor is the main canary to look out for. It runs self-health checks and sounds the alarm, usually before catastrophe strikes. I strongly recommend not changing that CacheSize value - can you try setting it back to the default? The directory cache has a pretty severe issue when the limit is hit currently. We are working to get a replacement LRU merged into dotnet/extensions. In the meantime, if you cannot increase the cache size, you could try using a custom cache: https://gist.github.com/ReubenBond/c438867e9660407c0b71f5af2272aaf5
EDIT: more details in this thread: https://github.com/dotnet/orleans/issues/8736#issuecomment-2040705016
Explore-logs-2024-11-15 10_54_01.json nothing from LocalSiloHealthMonitor, but attached the logs from SiloHealthMonitor. It seems unwarned.
EDIT: also set the cache size back to default. But the issues started before reducing cache size.
With debug set to lowest, it still does not show any signs of why it suddenly cuts the silos..
This bug also occurred in production now @ReubenBond
Im just throwing stuff out at this point, could it be related to reminders? Its got about 500 reminders running. still does not make much sense.
When testing i am pretty sure reducing grain cache size helped the scaling. The error occurred after setting it back to default.. But it does not show any signs of exhaustion. It just times out.
2024-11-20T12:41:44.899537648Z stdout F [12:41:44 DBG] Going to DeclareDead silo S10.244.10.17:11111:90930690 in the table. About to write entry SiloAddress=S10.244.10.17:11111:90930690 SiloName=innbyggertjenester-silo-e2e-675999dcc6-4bctr Status=ShuttingDown.
2024-11-20T12:41:44.916347762Z stdout F [12:41:44 DBG] Successfully updated S10.244.10.17:11111:90930690 status to Dead in the membership table.
The direct probes are starting to fail, but succeeds by indirect probes. Does not tell me much.. but could be an angle?:
{"Message":"Did not get response for probe #35862 to silo \"S10.244.10.17:11111:90930690\" after 00:00:04.9983328. Current number of consecutive failed probes is 2","MessageTemplate":"Did not get response for probe #{Id} to silo {Silo} after {Elapsed}. Current number of consecutive failed probes is {FailedProbeCount}","Exception":{"Type":"System.OperationCanceledException","Message":"The ping attempt was cancelled after 00:00:04.9983177. Ping #35862","InnerException":{"Type":"TaskCanceledException","Message":"A task was canceled.","StackTrace":" at Orleans.Runtime.MembershipService.SiloHealthMonitor.ProbeDirectly(CancellationToken cancellation) in /_/src/Orleans.Runtime/MembershipService/SiloHealthMonitor.cs:line 255"}},"Id":35862,"Silo":"S10.244.10.17:11111:90930690","Elapsed":"00:00:04.9983328","FailedProbeCount":2,"EventId":{"Id":100613},"ExceptionDetail":{"Type":"System.OperationCanceledException","HResult":-2146233029,"Message":"The ping attempt was cancelled after 00:00:04.9983177. Ping #35862","Source":null,"InnerException":{"Type":"System.Threading.Tasks.TaskCanceledException","HResult":-2146233029,"Message":"A task was canceled.","Source":"System.Private.CoreLib","TargetSite":"Void ThrowForNonSuccess(System.Threading.Tasks.Task)","CancellationToken":"CancellationRequested: true","Task":{"Id":22370,"Status":"Canceled","CreationOptions":"None"}},"CancellationToken":"CancellationRequested: false"},"app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-675999dcc6-zk24p"}
{"Message":"Indirect probe request #35863 to silo \"S10.244.10.17:11111:90930690\" via silo \"S10.244.9.8:11111:90930676\" succeeded after 00:00:00.0025095 with a direct probe response time of 00:00:00.0010152.","MessageTemplate":"Indirect probe request #{Id} to silo {SiloAddress} via silo {IntermediarySiloAddress} succeeded after {RoundTripTime} with a direct probe response time of {ProbeResponseTime}.","Id":35863,"SiloAddress":"S10.244.10.17:11111:90930690","IntermediarySiloAddress":"S10.244.9.8:11111:90930676","RoundTripTime":"00:00:00.0025095","ProbeResponseTime":"00:00:00.0010152","app":"Innbyggertjenester.Silo","ENV":"Production","APP_NAME":"innbyggertjenester-silo-e2e","POD_NAMESPACE":"e2e","POD_NAME":"innbyggertjenester-silo-e2e-675999dcc6-zk24p"}
Decreasing cache size won't help scaling. Is your pod being evicted by k8s? 500 reminders is not much, so that should not be an issue. Settings logging to Debug can also cause issues if the log rate is faster than the console can keep up with. Capturing a memory dump from the evicted process is likely to help us to quickly identify what the exact cause is
pod is not being evicted. Silo is killed, but pods not affected it seems.
_my original thought process was; that smaller cache limit per silo would affect scaling of silos on load creating new silos earlier and thereby new pods, and it seemed like it did.. i will try to find some dumps
EDIT: latest response from client:
"message": "Exception was thrown by handler. OrleansException: Current directory at S10.244.10.19:11111:91111329 is not stable to perform the lookup for grainId administrativeunit/65c9378c-c687-409e-8b01-1d15b4aa24db (it maps to S10.244.6.20:11111:91111410, which is not a valid silo). Retry later.",
All gateways have previously been marked as dead is logged by client. The pods are not killed and silos are shown as active in dashboard but they are dead.
i cant seem to find any memory from the processes, and the dotnet crashdump does not register anything either..
Are you running your containers with crash dump env vars set (make sure it's a full heap dump)? https://learn.microsoft.com/en-us/dotnet/core/diagnostics/collect-dumps-crash
You may need to mount a volume so you can persist the dump across container restarts
had it set to Heap dump. trying full now. Currently no data
The direct probes are starting to fail, but succeeds by indirect probes. Does not tell me much.. but could be an angle?:
Most likely, this indicates that the silo which initiated the indirect probe request was itself running very slowly but the buddy silo which sent the subsequent probe and the target silo which received the probe were fine at that time. That is the most likely cause, and it's why indirect probes exist. It is also possible that it indicates that the target silo is operating very slowly, eg maybe a long GC was running when the first probe occurred, but it had completed before the indirect probe was made. It could also indicate a network issue, causing a partial network partition. The first option (original accuser was slow) is the most likely reason.
I am currently out on paternity leave, so my response times may be spotty.
Do you collect any metrics for the processes by the way? If so, what are the thread pool sizes & queue lengths, lock contention rates, exception rates, etc?
Do you use Hangfire in your project?
Do you use Hangfire in your project?
We do not use Hangfire, no. At the moment it is nothing in the silo application that runs on a different threadpool than orleans i think.
Do you collect any metrics for the processes by the way? If so, what are the thread pool sizes & queue lengths, lock contention rates, exception rates, etc?
Nope, but i guess our hand's being forced. Ill set it up as soon as i get the time. Thanks for taking the time to reply, really appreciated!