aws-greengrass-nucleus icon indicating copy to clipboard operation
aws-greengrass-nucleus copied to clipboard

deployment.DeploymentService/IotJobsHelper: deployment fails and gets stuck "in progress"

Open felixeriksson opened this issue 2 years ago • 7 comments

Describe the bug A deployment failed when rolling out to a core device, seemingly due to lack of memory. This deployment got stuck at IN_PROGRESS. Subsequent attempts to send out a new deployments got stuck with the status QUEUED. I couldn't find any way of getting the core device responsive to any subsequent updates through Greengrass, so in the end I manually restarted (pulled the power plug) the device, which resolved the problem. These are the logs from the Greengrass nucleus:

2022-04-12T08:49:32.191Z [INFO] (Thread-3) com.aws.greengrass.deployment.IotJobsHelper: Received new deployment notification. Requesting details. {ThingName=GreengrassVibscreenGWPrototype}
2022-04-12T08:49:32.320Z [INFO] (Thread-3) com.aws.greengrass.deployment.IotJobsHelper: Received Iot job description. {Status=QUEUED, queueAt=Tue Apr 12 09:49:30 BST 2022, ThingName=GreengrassVibscreenGWPrototype, JobId=ddebd2d6-1227-4>2022-04-12T08:49:32.322Z [INFO] (Thread-3) com.aws.greengrass.deployment.IotJobsHelper: Added the job to the queue. {ThingName=GreengrassVibscreenGWPrototype, JobId=ddebd2d6-1227-4a55-a5e4-11635ca3cb53}
2022-04-12T08:49:41.851Z [INFO] (pool-2-thread-19) com.aws.greengrass.deployment.DeploymentService: Received deployment in the queue. {deploymentId=ddebd2d6-1227-4a55-a5e4-11635ca3cb53, DeploymentType=IOT_JOBS, serviceName=DeploymentSe>2022-04-12T08:49:41.851Z [INFO] (pool-2-thread-19) com.aws.greengrass.deployment.DeploymentService: Received deployment document in queue. {document={"configurationValidationPolicy":{"timeout":60.0},"components":{"TransceiverInterface">2022-04-12T08:49:42.103Z [INFO] (pool-2-thread-19) com.aws.greengrass.deployment.DeploymentStatusKeeper: Stored deployment status. {DeploymentId=ddebd2d6-1227-4a55-a5e4-11635ca3cb53, DeploymentStatus=IN_PROGRESS}
2022-04-12T08:49:42.103Z [INFO] (pool-2-thread-19) com.aws.greengrass.deployment.IotJobsHelper: Updating status of persisted deployment. {Status=IN_PROGRESS, StatusDetails={}, ThingName=GreengrassVibscreenGWPrototype, JobId=ddebd2d6-12>2022-04-12T08:49:42.301Z [INFO] (Thread-3) com.aws.greengrass.deployment.IotJobsHelper: Job status update was accepted. {Status=IN_PROGRESS, ThingName=GreengrassVibscreenGWPrototype, JobId=ddebd2d6-1227-4a55-a5e4-11635ca3cb53}
2022-04-12T08:49:42.422Z [INFO] (pool-2-thread-19) com.aws.greengrass.deployment.DeploymentDirectoryManager: Create work directory for new deployment. {deploymentId=888ffd97-606a-4812-ae25-ca66a3eea609, link=/greengrass/v2/deployments/>2022-04-12T08:49:42.423Z [INFO] (pool-2-thread-19) com.aws.greengrass.deployment.DeploymentDirectoryManager: Persist deployment metadata. {file=/greengrass/v2/deployments/888ffd97-606a-4812-ae25-ca66a3eea609/deployment_metadata.json, d>2022-04-12T08:49:42.532Z [ERROR] (pool-2-thread-19) com.aws.greengrass.deployment.DeploymentService: service-errored. {serviceName=DeploymentService, currentState=RUNNING}
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        at java.base/java.lang.Thread.start0(Native Method)
        at java.base/java.lang.Thread.start(Thread.java:798)
        at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
        at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1354)
        at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
        at com.aws.greengrass.deployment.DeploymentService.createNewDeployment(DeploymentService.java:499)
        at com.aws.greengrass.deployment.DeploymentService.startup(DeploymentService.java:246)
        at com.aws.greengrass.lifecyclemanager.Lifecycle.lambda$handleStateTransitionStartingToRunningAsync$9(Lifecycle.java:530)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

2022-04-12T08:49:42.536Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=RUNNING, newState=ERRORED}
2022-04-12T08:49:42.537Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=ERRORED, newState=STOPPING}
2022-04-12T08:49:42.539Z [ERROR] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-state-transition-error. {serviceName=DeploymentService, currentState=STOPPING}
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        at java.base/java.lang.Thread.start0(Native Method)
        at java.base/java.lang.Thread.start(Thread.java:798)
        at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
        at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1354)
        at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
        at com.aws.greengrass.lifecyclemanager.Lifecycle.handleCurrentStateStopping(Lifecycle.java:570)
        at com.aws.greengrass.lifecyclemanager.Lifecycle.startStateTransition(Lifecycle.java:315)
        at com.aws.greengrass.lifecyclemanager.Lifecycle.lambda$initLifecycleThread$12(Lifecycle.java:707)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

2022-04-12T08:49:42.540Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-state-transition-retry. {serviceName=DeploymentService, currentState=STOPPING}
2022-04-12T08:49:42.750Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=STOPPING, newState=INSTALLED}
2022-04-12T08:49:42.750Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=INSTALLED, newState=STARTING}
2022-04-12T08:49:42.751Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=STARTING, newState=RUNNING}
2022-04-12T08:50:28.321Z [INFO] (Thread-8) software.amazon.awssdk.eventstreamrpc.RpcServer: Server connection closed code [socket is closed.]: [Id 11709, Class ServerConnection, Refs 1](2022-04-12T08:49:27.916471Z) - <null>. {}
2022-04-12T08:50:28.322Z [INFO] (Thread-8) software.amazon.awssdk.eventstreamrpc.RpcServer: Server connection closed code [socket is closed.]: [Id 11707, Class ServerConnection, Refs 1](2022-04-12T08:48:27.897124Z) - <null>. {}

After this, the logs continue with hundreds of lines of the same type as the last two.

To Reproduce I was running the Python publisher/subscriber codes in two components: https://docs.aws.amazon.com/greengrass/v2/developerguide/ipc-publish-subscribe.html#ipc-publish-subscribe-examples At first it worked fine, I changed a small thing in recipe.yaml of the subscriber (name of the policy), and made a new deployment, and this happened. I've run the same code afterwards, and it seems to work fine, so I don't think this caused the memory problem indicated by the logs above.

Expected behavior That the core device makes itself available for new deployments, if a deployment fails.

Actual behavior The core device got stuck with a failed deployment "In progress", blocking new update attempts.

Environment

  • OS: Debian GNU/Linux 11 (bullseye)
  • JDK version: javac 11.0.12
  • Nucleus version: 2.5.5

Additional context We're planning to use Greengrass as the only "point of access" to these devices. If the devices stop responding to new deployments, they will be rendered inaccessible for remote software updates, and we'll need to instruct our users/customers to e.g. reset the devices. This is not desirable.

felixeriksson avatar Apr 12 '22 12:04 felixeriksson

Thank you for reporting this issue to us. We will try to reproduce the issue and see what can be done to keep the Nucleus in a consistent state.

How much memory is available in total on your device?

MikeDombo avatar Apr 12 '22 13:04 MikeDombo

I believe it has 1GB. grep MemTotal /proc/meminfo gives me MemTotal: 931732 kB. Thanks for looking into it!

felixeriksson avatar Apr 12 '22 13:04 felixeriksson

Hey, Felix. Could you also help check the process/thread limit? I was not able to produce it with normal setup. Kinda trick to hit the limit at the right timing.

But the issue seems solid. When process/thread limit hit (could also be memory), Nucleus couldn't terminate the deployment here without creating new thread.

Will bring to the team for further discussion.

rightpeter avatar Apr 12 '22 20:04 rightpeter

Hi Peter! cat /proc/sys/kernel/threads-max prints 5202. I'm happy you're investigating it, stable updatability of the devices is important for us.

felixeriksson avatar Apr 13 '22 07:04 felixeriksson

Thanks again for bringing this to our attention. We will investigate a fix.

J-C-L avatar Apr 20 '22 21:04 J-C-L

@felixeriksson , thanks for writing up the issue. We've also struggled with GG as only "point of access to the device". In fact, it was suggested by AWS that we install Systems Agent , as a sort of 'cheap insurance' that we'd be able to SSH into these devices if we ever had to... which we have used many times. Incidentally, the SSM costs land up being more than GG (a pricing error IMHO) and we routinely see "Greengrass Offline / SSM Online" for the same device. To mitigate the cost issue, we've actually implemented a cron on/off window, so that our SSM billable-hours are limited each day. It's a pain, but it allows us a stable window to (usually) go in and diagnose a Greengrass problem. Your mileage may vary...

pmueller1 avatar May 05 '22 17:05 pmueller1

@pmueller1 I see, thanks for sharing your experiences. It sounds wise to have a backup access route, although we would wish to avoid it in the long run both for cost and security reasons. We've previously used ngrok to enable SSH access without need of special network configuration where the device is. We'll probably continue doing so, or look at Systems Agent, for our production devices too, until we feel confident that it's not necessary.

felixeriksson avatar May 07 '22 15:05 felixeriksson

We are also experiencing this serious problem - can AWS provide an update on investigation please?

cgddrd-gardin avatar Jun 27 '23 16:06 cgddrd-gardin

@cgddrd-gardin Please file a ticket with AWS Support for assistance with any issues. Ensure you are using the very latest version available (Nucleus 2.10.3) first.

Closing this issue as lacking information to proceed with any solutions and we've made many changes regarding deployment consistency since this issue was filed. Devices with limited memory available may not be able to run Greengrass.

MikeDombo avatar Jun 27 '23 16:06 MikeDombo

Amazing - nothing prompts a 'ticket closure' faster than someone saying they have same issue :)..

pmueller1 avatar Jun 27 '23 16:06 pmueller1

If you have additional information, please make a new ticket with that information. Thank you.

MikeDombo avatar Jun 27 '23 16:06 MikeDombo