scheduling icon indicating copy to clipboard operation
scheduling copied to clipboard

Forked task failed to remove serialized task context

Open winghc opened this issue 9 years ago • 3 comments

Here is my environment:

Host: 3 host Nodes: Two NodeStarter on each host

When I run workflow, I always gain the "Forked task failed to remove serialized task context" error.

After I adjust configure to be " One NodeStarter on each host" , then whole work flow succeed.

Pls guide me on this. Thanks

[2016-02-25 14:12:43,905 INFO o.o.p.s.u.TaskLogger] task 1365t7 (Task172) started on m-dn02(node: SSH-dn02-1) [2016-02-25 14:12:44,686 ERROR o.o.p.s.u.TaskLogger] task 1365t7 (Task172) error org.ow2.proactive.scheduler.task.exceptions.ForkedJvmProcessException: Failed to execute task in a forked JVM at org.ow2.proactive.scheduler.task.executors.ForkedTaskExecutor.createTaskResult(ForkedTaskExecutor.java:164) at org.ow2.proactive.scheduler.task.executors.ForkedTaskExecutor.execute(ForkedTaskExecutor.java:133) at org.ow2.proactive.scheduler.task.TaskLauncher.doTask(TaskLauncher.java:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.objectweb.proactive.core.mop.MethodCall.execute(MethodCall.java:353) at org.objectweb.proactive.core.body.request.RequestImpl.serveInternal(RequestImpl.java:214) at org.objectweb.proactive.core.body.request.RequestImpl.serve(RequestImpl.java:160) at org.objectweb.proactive.core.body.BodyImpl$ActiveLocalBodyStrategy.serveInternal(BodyImpl.java:552) at org.objectweb.proactive.core.body.BodyImpl$ActiveLocalBodyStrategy.serve(BodyImpl.java:485) at org.objectweb.proactive.core.body.AbstractBody.serve(AbstractBody.java:426) at org.objectweb.proactive.Service.blockingServeOldest(Service.java:206) at org.objectweb.proactive.Service.blockingServeOldest(Service.java:181) at org.objectweb.proactive.Service.fifoServing(Service.java:146) at org.objectweb.proactive.core.body.ActiveBody$FIFORunActive.runActivity(ActiveBody.java:337) at org.objectweb.proactive.core.body.ActiveBody.run(ActiveBody.java:175) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Forked task failed to remove serialized task context, probably a permission issue on folder /tmp/PA_JVM210940313/SSH-dn02-1/1365/771415615 ... 18 more [2016-02-25 14:12:44,687 INFO o.o.p.s.u.TaskLogger] task 1365t7 (Task172) finished with errors

winghc avatar Feb 25 '16 09:02 winghc

Hello,

can you provide more information? 1)What results does your workflow return? 2) What type of languages does your workflow use? 3) How many tasks are inside the workflow and which task is failing? 4) Which language executes the failing task?

First thoughts: "Forked task failed to remove serialized task context, probably a permission issue on folder /tmp/PA_JVM210940313/SSH-dn02-1/1365/771415615" Are you running all the nodes (Nodestarter) with the same user? Are you running with runasme? And the user you run with has no write access to the data which the executing(nodestarter) user creates or vice versa?

tobwiens avatar Feb 25 '16 09:02 tobwiens

An issue with a similar error message has been reported a few days ago : #2468 using the RunAsMe mode

But the scenario you describe seems different and it's very curious that the number of proactive nodes you deploy changes the behavior.

If I understood correctly, you deployed your infrastructure using an SSHInfrastructure (or SSHInfrastructureV2) or did you start ProActive nodes manually on each machine by using the command <scheduling_folder>/bin/proactive-node ?

fviale avatar Feb 25 '16 10:02 fviale

1)What results does your workflow return? -- Hive job just submit etl task ( java program ) to hadoop cluster, client do not need any more work.

  1. What type of languages does your workflow use? -- I use native. bash shell used to start java program

  2. How many tasks are inside the workflow and which task is failing? -- one parent job and 10 sub jobs in parallel. error task is totally random.

  3. Which language executes the failing task? -- both use the native which to call java program .

  4. Deployed your infrastructure using an SSHInfrastructure -- yes. both NodeStarter start as the same user. By the way, not click the RunAsMe mode.

winghc avatar Feb 25 '16 14:02 winghc