initialization-actions icon indicating copy to clipboard operation
initialization-actions copied to clipboard

Dataproc jobs hang on single node Dataproc cluster in rocky-8 images

Open flacode opened this issue 3 years ago • 0 comments

Tests done with single node, image 2.0-rocky8 clusters:

  1. Dataproc cluster without init script: job is successful
  2. Dataproc cluster with init script and job that does not specify gpu resources: hangs
  3. Dataproc cluster with init script and job that specifies gpu resources: hangs

Exception in nodemanager logs

2022-05-11 18:53:26,852 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: ResourceHandlerChain.preStart() failed!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Failed to create cgroup at /sys/fs/cgroup/devices/yarn/container_e01_1652294182719_0002_01_000082
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.createCGroup(CGroupsHandlerImpl.java:466)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:109)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:511)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:481)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:491)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:102)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
2022-05-11 18:53:26,853 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Failed to launch container.
java.io.IOException: ResourceHandlerChain.preStart() failed!
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:553)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:481)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:491)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:102)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Failed to create cgroup at /sys/fs/cgroup/devices/yarn/container_e01_1652294182719_0002_01_000082
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.createCGroup(CGroupsHandlerImpl.java:466)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:109)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:511)
        ... 8 more

On create: folder /sys/fs/cgroup/devices/yarn/ exists on debian10 but does not exist on rocky-8. However, when the cluster has been restarted, the folder exists and previously hanging jobs completed successfully. debian10: $ systemctl status dataproc-cgroup-device-permissions

● dataproc-cgroup-device-permissions.service - Set permissions to allow YARN to access device directories
   Loaded: loaded (/etc/systemd/system/dataproc-cgroup-device-permissions.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Wed 2022-05-11 20:22:01 UTC; 32min ago
 Main PID: 10147 (code=exited, status=0/SUCCESS)

May 11 20:22:01 dataproc-gpu-debian10-m systemd[1]: Started Set permissions to allow YARN to access device directories.
May 11 20:22:01 dataproc-gpu-debian10-m systemd[1]: dataproc-cgroup-device-permissions.service: Succeeded.

rocky-8: $ systemctl status dataproc-cgroup-device-permissions

● dataproc-cgroup-device-permissions.service - Set permissions to allow YARN to access device directories
   Loaded: loaded (/etc/systemd/system/dataproc-cgroup-device-permissions.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2022-05-11 20:23:38 UTC; 33min ago
 Main PID: 149857 (code=exited, status=0/SUCCESS)

May 11 20:23:38 dataproc-gpu-rocky-m systemd[1]: Started Set permissions to allow YARN to access device directories.
May 11 20:23:38 dataproc-gpu-rocky-m systemd[1]: dataproc-cgroup-device-permissions.service: Succeeded.

flacode avatar May 12 '22 15:05 flacode