service-fabric-distributed-cache icon indicating copy to clipboard operation
service-fabric-distributed-cache copied to clipboard

TimeoutException on SetAsync()

Open rwallerms opened this issue 5 years ago • 9 comments

While running load tests against our service, we're noticing that call to SetAsync() will sometimes fail with a TimeoutException, with what appears to be a 5 minute timeout.

System.TimeoutException: This can happen if message is dropped when service is busy or its long running operation and taking more time than configured Operation Timeout.

Tried setting the CancellationToken to limit the duration of the call, but it looks like that is not used in this method.

https://github.com/SoCreate/service-fabric-distributed-cache/blob/bf2db079b615e38c13269443339a94fdb721738c/src/SoCreate.Extensions.Caching.ServiceFabric/ServiceFabricDistributedCache.cs#L80

Our load tests did not seem particularly intensive. They only pushed the machine to around 20% CPU.

rwallerms avatar Jun 12 '20 19:06 rwallerms

Looks like there's a new PR out that should help you narrow down the issue;

https://github.com/SoCreate/service-fabric-distributed-cache/pull/13

We moved from Set to SetAsync() just a couple days ago have now experienced service failures ("Partition is in Quorum Loss") and the app's exit code appears to have been swallowed (""System.Hosting' reported Error for property 'CodePackageActivation:Code:EntryPoint:132315329573570487'. There was an error during CodePackage activation.The process/container terminated with exit code:255.")

Did you find the source of the issue or are you still working?

Ryanman avatar Jun 25 '20 15:06 Ryanman

I was hoping someone from the SoCreate team would chime in with their opinion as to why this might happen and/or whether they can add support for using the CancellationToken to work around.

rwallerms avatar Jun 25 '20 15:06 rwallerms

I haven't gotten the patch yet either, I'll mention I recently was able to watch SetAsync() run through. When performing multiple set operations in a row, Async was hundreds/thousands of times slower than non-async set.

Perhaps there are speed benefits when you queue up these cache sets? I don't have any more information besides the fact I could sit there and watch how slow it was to set individual cache buckets asynchronously.

Ryanman avatar Jun 26 '20 21:06 Ryanman

Have you run this against the latest version of the of the library? How many items are you adding to the cache and how large are the items? Do you have a load test example I could run to duplicate the issue?

lurock avatar Jul 02 '20 19:07 lurock

We’ve been running on 2.0.1. I will upgrade to the latest and try tonight. We start seeing this when running around 100 RPS or higher against a 5 machine cluster (VMs are 4 core, 14GB RAM). We have set HasPersistedState="false". We do a single Read & Write on each call. Typical Read/Write size is ~ 250 bytes.

The load tests I have are specific to our service, so I’m not sure how much value that would be for you. They are just built on VS 2019 Load Test framework so pretty easy to get going with. Happy to help with seeing how we could debug against our service if that’s what you’re thinking.

From: Jami Lurock [email protected] Sent: Thursday, July 2, 2020 12:08 PM To: SoCreate/service-fabric-distributed-cache [email protected] Cc: Ryan Waller [email protected]; Author [email protected] Subject: Re: [SoCreate/service-fabric-distributed-cache] TimeoutException on SetAsync() (#12)

Have you run this against the latest version of the of the library? How many items are you adding to the cache and how large are the items? Do you have a load test example I could run to duplicate the issue?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSoCreate%2Fservice-fabric-distributed-cache%2Fissues%2F12%23issuecomment-653175467&data=02%7C01%7Crwaller%40microsoft.com%7Cb6bfe23e3549453366f608d81ebb4143%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637293136895396822&sdata=BVuzsnaVqMBDY0kU8Mx7gSw%2FdBa4%2BMfQfyDeuB1AB%2B0%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH2U7VRA4L5FV2G6LPAPBATRZTLJPANCNFSM4N4SPSDQ&data=02%7C01%7Crwaller%40microsoft.com%7Cb6bfe23e3549453366f608d81ebb4143%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637293136895406818&sdata=OzYO2Tc%2B97wh3KNpgUUve8sLWDcps11uNZ5h%2FzeGDeg%3D&reserved=0.

rwallerms avatar Jul 02 '20 23:07 rwallerms

Please try version 2.1.2 that I have just pushed up. I found an threading issue causing some random exceptions. Let me know if this fixes your issues.

lurock avatar Jul 03 '20 02:07 lurock

Upgraded to 2.1.2 and ran a few cycles of tests over the weekend and we are still seeing some Timeout errors, but it doesn’t seem it’s as many as before (sorry, I don’t have all the old data now ). I also notice a lot of latency in the cache calls. These were probably there before but I had just been focusing on the ones that ran longer than 5 minutes and gave an error.

Here's a couple graphs of the latencies we saw on one of the runs. This run was a constant 1000 user load for 7 hours, which resulted in about 30K cache Reads and 30K Writes. Vertical axis is Count, Horizontal is Call Duration in MS, bucketed into 1 second intervals. This just shows the calls that took < 50 seconds. We only had 14 calls in this run that took > 300 seconds (5 minutes).

Cache Read

Note that there is a large amount of calls at 0 (< 1) seconds. This could be because in our ‘sessions’, for the first call in the session, the cache is always empty. Might suggest that if there is no data to transfer, the call is really fast. In fact, 80% of those 14K calls are < 1 ms, and another 14% < 10 ms, the rest < 100 ms, meaning that the delays we see after could be related to data transfer. This is just a guess. We’re not logging the data size so I can’t definitively correlate that.

[cid:[email protected]]

Cache Write

[cid:[email protected]]

From: Ryan Waller Sent: Thursday, July 2, 2020 7:11 PM To: SoCreate/service-fabric-distributed-cache [email protected]; SoCreate/service-fabric-distributed-cache [email protected] Cc: Author [email protected] Subject: RE: [SoCreate/service-fabric-distributed-cache] TimeoutException on SetAsync() (#12)

Will do.

From: Jami Lurock <[email protected]mailto:[email protected]> Sent: Thursday, July 2, 2020 7:10 PM To: SoCreate/service-fabric-distributed-cache <[email protected]mailto:[email protected]> Cc: Ryan Waller <[email protected]mailto:[email protected]>; Author <[email protected]mailto:[email protected]> Subject: Re: [SoCreate/service-fabric-distributed-cache] TimeoutException on SetAsync() (#12)

Please try version 2.1.2 that I have just pushed up. I found an threading issue causing some random exceptions. Let me know if this fixes your issues.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSoCreate%2Fservice-fabric-distributed-cache%2Fissues%2F12%23issuecomment-653296087&data=02%7C01%7Crwaller%40microsoft.com%7Ccc6bd6a5bac34cea22ae08d81ef6305a%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637293390007868298&sdata=a4kAnvFbFhRvunyRG4nUiZh%2FfR%2BVgAM9ttsdR593oFg%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH2U7VQR6JFVPNLNKRLFMMTRZU4XPANCNFSM4N4SPSDQ&data=02%7C01%7Crwaller%40microsoft.com%7Ccc6bd6a5bac34cea22ae08d81ef6305a%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637293390007868298&sdata=qEm%2B2Nk4BimKGz%2FZpcKJ1rAJBcPJrPS8n5HLiVLi8Nw%3D&reserved=0.

rwallerms avatar Jul 06 '20 21:07 rwallerms

Hi @lurock

we are using SetAsync() heavly it's too slow using vesion 2.1.3 , we are cahcing small data like {"numericId":"300000000000021","expirationTime":"2020-08-24T08:11:31.4126188Z"}

do u have any idea?

mohammedjarwan avatar Aug 27 '20 06:08 mohammedjarwan