smartdispatch
smartdispatch copied to clipboard
Cannot handle huge command.txt files
Attached, an example file from a worker that had hard time...
Further details to come tomorrow... @mgermain
Congrats! You broke Linux :D After exploration, I saw that the problem comes from Linux freaking out because the process has been waiting for the lock too long. I'll see if there is anything I can do.
I think a have a better understanding of the issue now. When the file commands.txt
with all the pending commands is huge, e.g. 2.5Gb worth of text, getting the next command to run is highly inefficient.
We currently need to read and rewrite the ~2.5Gb file each time we get the next command to run. This is because we consider the next command to run as the first one in commands.txt
. So, we read the first line, read the rest of the file and then write back all lines except the first one, ouch!
We should take the last line of commands.txt
as the next command to run and simply truncate the file.
@MarcCote What you say is true, but this is 2 distinct issues. @gauvinalexandre How big was your command.txt?
@MarcCote I have a vague memory that we had a proper reason to do so in the past. There is also the fact that it was NEVER intended to be used in this way. Finally, I think that having the unfinished_commands.txt and/or having a database might mitigate all this.
Maybe those are two distinct issues. Please @gauvinalexandre let us know :).
@mgermain How is having a lot of commands to "dispatch" is not the intended purpose of smart-dispatch? You are right about the database solving the issue I mentioned.
@MarcCote Well in the case of the 70 brains, in a way, SD is used as a substitute to MPI to implement parallelism inside their own program. Maybe another way to say what I said in my previous post is that SD was never designed to launch an actual 50 Million jobs at once :P If we can find a way to do it nicely I'm not against it though.
Hey guys! Yes, I agree it's 2 different issues, but they are related in some way.
My issue was that my tasks had a very short process-time, while the command.txt (big, but not gygabytes) was relatively long to write. So, the workers were continuously asking for more tasks. Then Linux was thinking it was stuck in a multiprocess concurrent lock because of its OS default waiting timeout parameter, so it killed workers to prevent it. In other words, workers we're dying bored (thanks for @mgermain for figuring this out).
In Max's case, the file is very long to write. So I guess the same will happen at the end, killing workers with boredom waiting for millions of tasks to be rewritten millions of times.
So in one side, there's overhead in reading too frequently the command.txt file, while in the other there's overhead writing too much. I think it's related because the process-time of tasks over the number of tasks should be a reasonable ratio. Otherwise we'll have problems.
The final message here is, we cannot just throw anything there, yet, until geniuses like @mgermain and @MarcCote find a solution. There's a smart-dispatch tweaking part. Thanks again guys!
From what I understand speeding up the "picking a new command to execute" would solve both issues. I suggest "taking the last line of commands.txt
(a.k.a. pending) as the next command to run and simply truncate the file".