chronos icon indicating copy to clipboard operation
chronos copied to clipboard

High CPU utilization

Open chandu-atina opened this issue 8 years ago • 5 comments

Observed chronos is reaching 100% CPU Utilization on a three node cluster on master node. Once node is restarted, observed high CPU Utilization on newly elected master node.

I found nothing interesting from the logs.

chandu-atina avatar Mar 08 '17 03:03 chandu-atina

Which version of Chronos is it? This is possibly the same issue as #803.

brndnmtthws avatar Mar 09 '17 18:03 brndnmtthws

I found the following exception when the CPU Utilisation went high

[2017-03-07 10:49:53,273] WARN Exception while serving request (org.apache.mesos.chronos.scheduler.api.JobManageme ntResource:213) java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719) at java.util.LinkedHashMap$LinkedKeyIterator.next(LinkedHashMap.java:742) at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at scala.collection.Iterator$class.foreach(Iterator.scala:743) at scala.collection.AbstractIterator.foreach(Iterator.scala:1177) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.mutable.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:46) at scala.collection.SetLike$class.map(SetLike.scala:92) at scala.collection.mutable.AbstractSet.map(Set.scala:46) at org.apache.mesos.chronos.scheduler.api.JobManagementResource.list(JobManagementResource.scala:206) at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.codahale.metrics.jersey.InstrumentedResourceMethodDispatchProvider$TimedRequestDispatcher.dispatch(InstrumentedResourceMethodDispatchProvider.java:30)

chandu-atina avatar Mar 10 '17 05:03 chandu-atina

Yikes! Looks like we might be missing a lock somewhere.

brndnmtthws avatar Mar 10 '17 16:03 brndnmtthws

I have collected the thread dump and found the following threads taking more CPU. This may be helpful to pinpoint the issue

pool-4-thread-1 - priority:5 - threadId:0x00007f0f2811a000 - nativeId:0xb6 - state:RUNNABLE stackTrace: java.lang.Thread.State: RUNNABLE at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method) at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.liftedTree1$1(JobScheduler.scala:542) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.mainLoop(JobScheduler.scala:540) locked <0x00000000d9332640> (a java.util.concurrent.locks.ReentrantLock) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anon$1.run(JobScheduler.scala:516) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: <0x00000000d9309ce0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

pool-4-thread-1 - priority:5 - threadId:0x00007f0f2811a000 - nativeId:0xb6 - state:RUNNABLE stackTrace: java.lang.Thread.State: RUNNABLE at org.apache.mesos.state.AbstractState.__fetch(Native Method) at org.apache.mesos.state.AbstractState.access$2400(AbstractState.java:34) at org.apache.mesos.state.AbstractState$FetchFuture.(AbstractState.java:229) at org.apache.mesos.state.AbstractState.fetch(AbstractState.java:42) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.liftedTree1$1(JobScheduler.scala:542) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.mainLoop(JobScheduler.scala:540) locked <0x00000000d9332640> (a java.util.concurrent.locks.ReentrantLock) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anon$1.run(JobScheduler.scala:516) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: <0x00000000d9309ce0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

chandu-atina avatar Jul 27 '17 11:07 chandu-atina

@brndnmtthws Did you get a chance to look at it. Below is the multiple thread dumps analysis report

http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTcvMDcvMjcvLS1kdW1wLnppcC0tMTItNDUtMTQ7Oy0tZHVtcC56aXAtLTEyLTQ1LTE0OzstLWR1bXAuemlwLS0xMi00NS0xNDs7LS1kdW1wLnppcC0tMTItNDUtMTQ7Oy0tZHVtcC56aXAtLTEyLTQ1LTE0

The __fetch() native method is causing high CPU utilization

This is happening when we randomly restart nodes in the cluster i.e when leadership changes from one node to the other

chandu-atina avatar Jul 28 '17 06:07 chandu-atina