functory icon indicating copy to clipboard operation
functory copied to clipboard

ECHILD

Open nilsbecker opened this issue 7 years ago • 10 comments

hi, i was trying functory for parallelizing a program in which the workers may handle large-ish matrices. when the matrix size stays below a certain value, parallelization works. when they exceed it, i get

master: ** PID 65367 killed or stopped! **
master: ** PID 65368 killed or stopped! **
master: ** PID 65369 killed or stopped! **
Fatal error: exception Unix.Unix_error(Unix.ECHILD, "wait", "")

any idea what could go wrong? i tried the same with Parmap before and got a failure at the same size; i thought it had to do with memmapped files there but does functory even use those? another data point: this error (still with Parmap) happened only on os x, not on linux..

nilsbecker avatar Mar 22 '17 11:03 nilsbecker

ok, i finally have a "kind of" reproducible case. in the ocaml toploop, do

#use "topfind";;
#require "lacaml";;
#require "functory";;
Printexc.record_backtrace true;;
open Lacaml.D;;
Functory.Cores.set_number_of_cores 4;;
let a = [2;3;4;3;2;3;3];;
let f size i =
  let m = Mat.random size size in
  let () = getri m in
  let () = Printf.printf "%d" (Mat.dim1 m) in
  let () = Unix.sleep 1 in
  i + size;;
Functory.Cores.map ~f:(f 12) a;;

this works fine for me, also for much higher matrix sizes. however, when i interrupt the parallel computation with ctrl-c from the toplevel (where i expectedly get some exception backtrace), and then try repeatedly to call the last line, then i get erratic behavior: sometimes the parallel computation finishes, sometimes it doesn't.

i don't know if this is really the same as my original problem since there i do not interrupt the computation. but it might be that some exceptions raised internally and then handled confuse the parallel mapping? (just speculating)

nilsbecker avatar Mar 22 '17 13:03 nilsbecker

i now also tried a version without any lacaml, just Unix.sleep. this does not reproduce the behavior. after the first map after the user interrupt, no additional exceptions happen no subsequent tries.

nilsbecker avatar Mar 22 '17 13:03 nilsbecker

With Cores, master and workers communicate using regular files and OCaml's input/output_value functions.

So I see no reason why large values are an issue here.

-- Jean-Christophe

On 22/03/2017 12:03, nilsbecker wrote:

hi, i was trying functory for parallelizing a program in which the workers may handle large-ish matrices. when the matrix size stays below a certain value, parallelization works. when they exceed it, i get

|master: ** PID 65367 killed or stopped! ** master: ** PID 65368 killed or stopped! ** master: ** PID 65369 killed or stopped! ** Fatal error: exception Unix.Unix_error(Unix.ECHILD, "wait", "") |

any idea what could go wrong? i tried the same with Parmap before and got a failure at the same size; i thought it had to do with memmapped files there but does functory even use those?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/backtracking/functory/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/ADWLtpLWoEMMumFIAmTkpfb96tQ_Af_Iks5roP_qgaJpZM4MlCH7.

backtracking avatar Mar 23 '17 08:03 backtracking

I could reproduce it, i.e., I indeed get messages such as

master: ** PID 20577 killed or stopped! **

These are workers from the previous computation (the one you stopped with Ctrl-C in the toplevel).

I agree this is annoying, and should be fixed. Yet I always get the correct answer for the next computation. I could not reproduce your non-termination issue.

I though a simple fix would be to enable catch_break (with "Sys.catch_break true"), since Cores.compute catches any exception and then kills all pending tasks. So I tried that. But it makes no difference, surprisingly.

-- Jean-Christophe

On 22/03/2017 14:05, nilsbecker wrote:

ok, i finally have a "kind of" reproducible case. in the ocaml toploop, do

|#use "topfind";; #require "lacaml";; #require "functory";; Printexc.record_backtrace true;; open Lacaml.D;; Functory.Cores.set_number_of_cores 4;; let a = [2;3;4;3;2;3;3];; let f size i = let m = Mat.random size size in let () = getri m in let () = Printf.printf "%d" (Mat.dim1 m) in let () = Unix.sleep 1 in i + size;; Functory.Cores.map ~f:(f 12) a;; |

this works fine for me, also for much higher matrix sizes. however, when i interrupt the parallel computation with ctrl-c from the toplevel (where i expectedly get some exception backtrace), and then try repeatedly to call the last line, then i get erratic behavior: sometimes the parallel computation finishes, sometimes it doesn't.

i don't know if this is really the same as my original problem since there i do not interrupt the computation. but it might be that some exceptions raised internally and then handled confuse the parallel mapping? (just speculating)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/backtracking/functory/issues/1#issuecomment-288391928, or mute the thread https://github.com/notifications/unsubscribe-auth/ADWLtvFOVmmTK8IUoAuaWVhWxRYcLvxTks5roRySgaJpZM4MlCH7.

backtracking avatar Mar 23 '17 11:03 backtracking

I could not reproduce your non-termination issue.

sorry, i did not mean it hangs. what i see is: when repeatedly trying Functory.Cores.map ~f:(f 12) a;;, sometimes the computation finishes with the correct result, sometimes an ECHILD exception is raised. this erratic behavior does not seem to go away even after repeatedly doing it ten times or more. this only happens if i once use control-c in the very beginning.

nilsbecker avatar Mar 23 '17 12:03 nilsbecker

I never got the ECHILD in my own experiments (but I was not using lacaml, only a simpler computation with a Unix.sleep 1 like you did).

-- Jean-Christophe

On 23/03/2017 13:26, nilsbecker wrote:

I could not reproduce your non-termination issue.
sorry, i did not mean it hangs. what i see is: when repeatedly
trying |Functory.Cores.map ~f:(f 12) a;;|, sometimes the computation
finishes with the correct result, sometimes an |ECHILD| exception is
raised. this erratic behavior does not seem to go away even after
repeatedly doing it ten times or more.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/backtracking/functory/issues/1#issuecomment-288702926, or mute the thread https://github.com/notifications/unsubscribe-auth/ADWLtr5MCvWbHw4GzKWVlSAYlGBMe9HKks5romTwgaJpZM4MlCH7.

backtracking avatar Mar 23 '17 16:03 backtracking

yes indeed, with only Unix.sleep i, too, only get one set of exceptions after the first interrupt, not the much weirder reappearance of exceptions in later repetitions of the parallel call.

nilsbecker avatar Mar 23 '17 17:03 nilsbecker

i found this:

https://mail.scipy.org/pipermail/numpy-discussion/2012-August/063589.html

which may indicate that use of the Accelerate framework which provides optimized blas and lapack used by lacaml conflicts with the parallelization strategy in functory?

nilsbecker avatar Mar 24 '17 14:03 nilsbecker

ok, i recompiled lacaml with openblas instead of Accelerate. for me this seems to take care of the intermittent crashing of child processes. it seems that really forking + Accelerate on mac is a bad combination. so i think it's enough to fix anything that appears without lacaml, and to tell os x users to beware of Accelerate.

nilsbecker avatar Mar 28 '17 09:03 nilsbecker

I'm glad you found a solution (I had no time to investigate, unfortunately). ANd I don't have a mac anyway, so I would not have been able to reproduce the issue.

Jean-Christophe

On 28/03/2017 11:44, nilsbecker wrote:

ok, i recompiled lacaml with openblas instead of Accelerate. for me this seems to take care of the intermittent crashing of child processes. it seems that really forking + Accelerate on mac is a bad combination. so i think it would be enough to just fix anything not related to

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/backtracking/functory/issues/1#issuecomment-289718297, or mute the thread https://github.com/notifications/unsubscribe-auth/ADWLtmXfVXLGyiFCZ3Fom21T66WDdxscks5rqNZqgaJpZM4MlCH7.

backtracking avatar Mar 28 '17 11:03 backtracking