memcached-session-manager icon indicating copy to clipboard operation
memcached-session-manager copied to clipboard

Operation timeout

Open nyx93 opened this issue 8 years ago • 11 comments

Getting irregular timeouts with the session manager a lot, when no apparent issue is happening. This is causing unexpected behavior and looks like a session interruption.

The session manager is used with aws elastiCache with this configuration:

"< Manager className="de.javakaffee.web.msm.MemcachedBackupSessionManager" memcachedNodes="traqpath-test-201.umnnui.cfg.use1.cache.amazonaws.com:11211" requestUriIgnorePattern=".*.(ico|jpg|css|js)$" / >"

The error is:

26-May-2017 20:01:21.789 WARNING [http-bio-443-exec-5027] de.javakaffee.web.msm.MemcachedSessionService.loadFromMemcached Could not load session with id E034B9EF69B457C8998CEDF1823F234C.Prod-TP-3 from memcached. net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 1,000 ms. Node status: Connection Status { /172.15.1.165:11211 active: true, authed: true, last read: 4,710 ms ago /172.15.1.26:11211 active: true, authed: true, last read: 53,833 ms ago } at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1483) at de.javakaffee.web.msm.storage.MemcachedStorageClient.get(MemcachedStorageClient.java:62) at de.javakaffee.web.msm.MemcachedSessionService.loadFromMemcached(MemcachedSessionService.java:1142) at de.javakaffee.web.msm.MemcachedSessionService.findSession(MemcachedSessionService.java:597) at de.javakaffee.web.msm.MemcachedBackupSessionManager.findSession(MemcachedBackupSessionManager.java:221) at org.apache.catalina.connector.Request.doGetSession(Request.java:2906) at org.apache.catalina.connector.Request.getSession(Request.java:2298) at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:895) at com.etq.reliance.control.HttpProxy.service(HttpProxy.java:112) at javax.servlet.http.HttpServlet.service(HttpServlet.java:729) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:292) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:207) at com.etq.reliance.control.filter.SSOAuthenticationFilter.doFilter(SSOAuthenticationFilter.java:42) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:240) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:207) at com.etq.reliance.control.filter.gzip.GZIPFilter.doFilter(GZIPFilter.java:82) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:240) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:207) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:240) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:207) at com.etq.reliance.traqpath.control.TraqPathFilter.doFilter(TraqPathFilter.java:146) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:240) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:207) at com.etq.reliance.traqpath.control.UrlRewriteFilter.doFilter(UrlRewriteFilter.java:44) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:240) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:207) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:212) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106) at de.javakaffee.web.msm.RequestTrackingContextValve.invoke(RequestTrackingContextValve.java:99) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:614) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:141) at de.javakaffee.web.msm.RequestTrackingHostValve.invoke(RequestTrackingHostValve.java:125) at de.javakaffee.web.msm.RequestTrackingHostValve.invoke(RequestTrackingHostValve.java:157) at de.javakaffee.web.msm.RequestTrackingHostValve.invoke(RequestTrackingHostValve.java:125) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:616) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88) at org.apache.catalina.ha.session.JvmRouteBinderValve.invoke(JvmRouteBinderValve.java:192) at org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:318) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:522) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1095) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:672) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:279) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Unknown Source) Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node: /172.15.1.165:11211 at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:167) at net.spy.memcached.internal.GetFuture.get(GetFuture.java:69) at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1472) ... 47 more

Any clue on why this seems to be happening a lot?

nyx93 avatar May 29 '17 09:05 nyx93

I also met this problem

zane-rl avatar Jun 01 '17 21:06 zane-rl

any updates on this???

nyx93 avatar Jun 04 '17 06:06 nyx93

Have you tried increasing the operationTimeout according to the documentation?

magro avatar Jun 06 '17 22:06 magro

@magro i haven't but, isn't 1 second (the default) more than enough for something like this? considering that the memcached and the instance I'm using are both on AWS, in the same network. wouldn't increasing the operation timeout mean the time used will be more than what's acceptable for a cache?

nyx93 avatar Jun 07 '17 06:06 nyx93

Sorry for the long delay! The question is what's causing the timeout. Can you tell what you have checked already (if this still an issue)?

magro avatar Aug 08 '17 20:08 magro

it is still an issue, increasing the timeout helped make it less apparent but it still happens. can't tell the reason because it doesn't have an obvious scenario to duplicate, my biggest suspect though is having too many requests at the same time (around 200 in 1 second), because this is the most likely scenario for it to happen, even though it doesn't happen each time

nyx93 avatar Aug 09 '17 06:08 nyx93

Some questions to help with debugging...

Have you checked pauses caused by garbage collection (running tomcat with verbose gc and checking gc logs when timeouts happen)?

Is the issue related to the size of the session - does it happen similarly if there are big or small sessions?

Can you scale elasticache, i.e. add more nodes - does this change anything?

Do you have performance metrics for elasticache (like max response time)?

Can you reproduce operation timeouts with a benchmark that just uses the memcached client and performs reads/writes? (with increasing requests/second and different "session" sizes)

magro avatar Aug 09 '17 15:08 magro

already checked the gc, not related some times there was gc running and others not. it happens when size of session is relatively big. already tried scaling the nodes, and no change, it did not help. i'll try to reproduce with a benchmark but as i already said it does not always happen with the same number of request/size of session, a lot of times things go smoothly without the timeout happening at all.

nyx93 avatar Aug 09 '17 15:08 nyx93

Looking forward to the results of the benchmark, perhaps this sheds some light on the issue. If course it might also caused by noisy neighbors. This would only be seen in a long running benchmark. The ec2 instance type then is also a variable that would be worth to test. Does elasticache provide any SLAs btw?

magro avatar Aug 09 '17 16:08 magro

aws doesn't provide any SLA for elasticahe, the issue is most probably noisy neighbors (not sure how much i can improve on that though). will keep you updated on the benchmark

nyx93 avatar Aug 10 '17 07:08 nyx93

I think noisy neighbors become less likely for bigger instances.

magro avatar Aug 10 '17 08:08 magro