GraphEngine icon indicating copy to clipboard operation
GraphEngine copied to clipboard

CommunicationInstance: System.IO.IOException: Remote message handler throws an exception.

Open TaviTruman opened this issue 5 years ago • 8 comments

GraphEngine Client crash when attempting to connect to Graph Engine Server running inside a Service Fabric Cluster image

  • I'm working with Microsoft support and the Service Fabric product team to resolve Load Balancer and Reverse Proxy configuration issues Here is the offending source code: image

At line 28 in the source, we are just trying to connect TCP endpoint in the cluster; we can reach the GE service listening on the Exposed SF Listener. Looks like it can connect but the custom IMessagePassingEndpoint seems to fall-down when trying to send/receive data.

TaviTruman avatar Jul 07 '20 22:07 TaviTruman

More data and information for research

System.IO.IOException HResult=0x80131620 Message=Remote message handler throws an exception. Source=Trinity.Core StackTrace: at Trinity.Storage.RemoteStorage._error_check(TrinityErrorCode err) at Trinity.Storage.RemoteStorage._use_synclient(Func`2 func) at Trinity.Storage.RemoteStorage.SendMessage(Byte* message, Int32 size, TrinityResponse& response) at Trinity.Network.CommunicationModule.SendMessage(IMessagePassingEndpoint endpoint, Byte* buffer, Int32 size, TrinityResponse& response) at Trinity.Storage.MessagePassingExtensionMethods.SendMessage[T](IMessagePassingEndpoint storage, Byte* message, Int32 size, TrinityResponse& response) at Trinity.Client.TrinityClientModule.MessagePassingExtension.RegisterClient(IMessagePassingEndpoint storage, RegisterClientRequestWriter msg) in G:\IKW-GraphEngine\src\Modules\GraphEngine.Client\Trinity.Client\obj\Debug\netstandard2.0\GeneratedCode\Lib\Protocols.cs:line 279 at Trinity.Client.ClientMemoryCloud.RegisterClient() in G:\IKW-GraphEngine\src\Modules\GraphEngine.Client\Trinity.Client\ClientSide\ClientMemoryCloud.cs:line 48 at Trinity.Client.TrinityClient.RegisterClient() in G:\IKW-GraphEngine\src\Modules\GraphEngine.Client\Trinity.Client\ClientSide\TrinityClient.cs:line 93 at Trinity.Client.TrinityClient.StartPolling() in G:\IKW-GraphEngine\src\Modules\GraphEngine.Client\Trinity.Client\ClientSide\TrinityClient.cs:line 81 at Trinity.Network.CommunicationProtocolGroup._RaiseStartedEvent() at Trinity.Network.CommunicationInstance._RaiseStartedEvents() at Trinity.Network.CommunicationInstance.Start()

TaviTruman avatar Jul 09 '20 02:07 TaviTruman

I'm uploading the trinity.log file

trinity-[07_08_2020_08_40_03_PM].log

TaviTruman avatar Jul 09 '20 17:07 TaviTruman

So I am able to duplicate the problem on my local SF Cluster and Azure SF Cluster; it looks like we aren't getting a connection to the remote IClientRegistry (memory cloud). So we are getting a connection and we can send data but unable to receive data.

TaviTruman avatar Jul 09 '20 20:07 TaviTruman

with the SF Load Balancer configured to let TCP traffic flow through to the SF Cluster, traffic seems to flow into the cluster but can't flow out.

image

image

This is incoming traffic from the Azure LB - via Health Probe

image

image

TaviTruman avatar Jul 18 '20 00:07 TaviTruman

Okay - I've travel way down the GE rabbit hole now and have an open ticket with Microsoft Service Fabric Support. We are making progress folks. So to actually connect a GE Client to a GE Server running an Azure Service Fabric cluster is a matter of configuring the Azure LB at Level 4 to let TCP traffic pass through; once that done properly the GE GraphEgnineClient API-set is able to partially connect to the GE Service instance running in the SF Cluster. What I have come to understand and to appreciate the brilliance of the Graph Engine networking stack, and that connecting to the server is a multi-step process and that the GE is dogged w.r.t. keep that connection in place.

Here's all you need to do this point the GE Client to your SF Cluster:

image

FYI: I've been documenting the ins-n-outs of developing with the GE in the SF and will publish them at my GitHub GraphEngine repository soon.

This is what I found most recently. Processing on the GE Client-side will make this call into the GE Server

image

The GE Client is setting up Client Response registration with Server and will get ready to start polling the server before each RPC call as well as lower-level Graph Engine Network infrastructure RPCs into the server; the GE is truly type-safe distributed across memorycloud instances, even in the SF-Cluster. The call, however, fails on the GE Server-side when running in an SF Cluster; otherwise, the stuff just works.

image

This is bad and as a result, a true or complete connection is never made; the GE Client in the means time is re-trying the Polling and of course that fails too.

I've got another remote debugging session schedule with Mike Wong from MS SF Support; this guy is great! I think we are getting down the root-cause of this thing and then a fix can be applied.

TaviTruman avatar Jul 24 '20 17:07 TaviTruman

Okay, so when the GE Client connects (TrinityClient.Start()) to my GE Server, outside of the SF Cluster, the RegisterClientHandler is firstly called before CheckInstanceCookie on the server-side; this call sequence is very important because the RegisterClientHandler is the only method that adds the client cookie to m_client_storages:

`public override void RegisterClientHandler(RegisterClientRequestReader request, RegisterClientResponseWriter response)
    {
        if (m_memorycloud != null)
        {
            response.PartitionCount = m_memorycloud.PartitionCount;

            if (ClientRegistry is null)
                ClientRegistry = m_memorycloud as IClientRegistry;

            var cstg = m_client_storages.AddOrUpdate(request.Cookie, _ =>
            {
                ClientIStorage stg = new ClientIStorage(m_memorycloud) {Pulse = DateTime.Now};

                if (ClientRegistry != null)
                {
                    int new_id = ClientRegistry.RegisterClient(stg);
                    stg.InstanceId = new_id;
                }

                return stg;
            }, (_, stg) => stg);
            response.InstanceId = cstg.InstanceId;
        }
    }`

When the GE server is running in the SF Cluster the RegisterClientHandler is not called first, instead, the CheckInstanceCookie is called first:

  `private ClientIStorage CheckInstanceCookie(int cookie, int instanceId)
    {
        if (m_client_storages.TryGetValue(cookie, out var storage) && instanceId == storage.InstanceId) return storage;
        throw new ClientInstanceNotFoundException();
    }`

and the m_client_storages is empty. So even though the GE Client has made the initial connection to the server the order of method calls on the server-side seem to out of order; this sounds like something might be off in the lower-level code like the MessageDispatcher and DispatcherProc code.

TaviTruman avatar Jul 25 '20 16:07 TaviTruman

when this code is called from a GE client over Trinity Native (Default) Sockets and the GE Client is directed to connect to a GE Server running in an SF Cluster:

image

On the server-side, this code is called via DispatchMessage

image

TaviTruman avatar Aug 06 '20 02:08 TaviTruman

After working with the SF team for a few weeks there is a certain deficiency in the Graph Engine TCP/IP stack; I've been able to narrow it down to code in the TCP layer. I'll come back here to update our continued development and testing we perfect the GE/SF integration, specifically, external GE Client connection to SF/GE Cluster.

TaviTruman avatar Sep 25 '20 04:09 TaviTruman