pulsar-dotpulsar ConsumerFaultedException: Timeout while inspecting metadata; this may indicate a deadlock

Description

I am getting a ConsumerFaultedException when my application starts up and tries to create a consumer. The full message and stacktrace can be seen in the attached screenshot. This happens when calling the GetLastMessageIds on the consumer.

I have seen this on several occasions in production after we updated the DotPulsar package to 3.3.1. Cannot recall seeing it on 3.2.1 or earlier. exception

The application runs in a pod in K8s. I stop the application when errors like this happen after retrying for a number of times, and I've seen that at some point, after many startups (controlled by the k8s deployment), the application does not run into this exception and then can continue normally. But it happens after several restart attempts and crashloopbackoffs.

Reproduction Steps

I am not sure how this can be reproduced. Have not seen this on a local environment, only in K8s clusters in production and test environments. But I suspect this could be related to the 3.3.1 DotPulsar version, but cannot 100% confirm this.

Expected behavior

Since I am not explicitly in control of any serializers under the hood of DotPulsar, I expect the package to not run into the reported deadlock situation if that is the case.

Actual behavior

Low level exception with details about a potential deadlock issue that I cannot see myself being responsible for.

Regression?

Not sure but I suspect it is happening since version 3.3.1 of the DotPulsar package.

Known Workarounds

None that I am aware of.

Configuration

No response

Other information

No response

Jun 28 '24 15:06 htbmw

I am seeing this on DotPulsar 3.2.1 as well, so not specific to 3.3.1 as initially reported. This seems to be related to protobuf-net and some more information can be found here:

https://stackoverflow.com/a/17096460

Can someone please check what can be done inside DotPulsar to make it thread safe?

Jul 02 '24 13:07 htbmw

Hi @htbmw

Could you please provide more information on your .NET configuration:

Which version of .NET is the code running on?
What OS and version, and what distro if applicable?
What is the architecture (x64, x86, ARM, ARM64)?
Do you know whether it is specific to that configuration?
If you're using Blazor, which web browser(s) do you see this issue in?

Jul 15 '24 07:07 entvex

Hi @entvex , thanks for your request for further details.

Which version of .NET is the code running on? .NET 8
What OS and version, and what distro if applicable? Dockerfile uses this base image to build the final runtime image: mcr.microsoft.com/dotnet/aspnet:8.0-jammy-amd64
What is the architecture (x64, x86, ARM, ARM64)? x64
Do you know whether it is specific to that configuration? Unfortunately I have no idea
If you're using Blazor, which web browser(s) do you see this issue in? Not using Blazor

Jul 15 '24 08:07 htbmw

Hi @htbmw We have never seen this issue before but would like to help. As stated in the StackOverflow post, using 'PrepareSerializer' might bring about another issue. This seems to be an old issue so I guess no solution is coming from protobuf-net. We could protect the 'ProtoBuf.Serializer.Serialize' call with a lock, but I think that will hurt performance. If you can, could you create your own DotPulsar.dll after adding: static Serializer() => Serialize(new BaseCommand()); to 'DotPulsar.Internal.Serializer'? I hope this call will force protobuf-net to create stuff needed for serializing the base command so that we don't see this issue. It's a long shot, but worth a try.

Jul 15 '24 09:07 blankensteiner

Hi @htbmw.

Can you please try and see if https://www.nuget.org/packages/DotPulsar/3.3.2-rc.1 fixes the issue ?

Jul 17 '24 07:07 entvex

Hi @entvex , thanks I will give it a go and report back sometime this week.

Hi @blankensteiner, sorry for not replying sooner. I will give it a go if it is different from the fix that @entvex posted and asked me to test.

Appreciate everyone's help and suggestions so far!

Jul 17 '24 07:07 htbmw

Hi @htbmw It's the same fix :-)

Jul 17 '24 08:07 blankensteiner

@htbmw How did it go ?

Sep 25 '24 08:09 entvex

Closing this

Oct 10 '24 08:10 blankensteiner