mpi-master-slave suboptimal processing speed with mpi-master-slave

Dear Luca-S,

I am writing to ask for your advice on poor processing performance I recently encountered with mpi-master-slave. The task is about processing hundreds of thousands of protein sequence files via MAFFT multiple sequence alignment. The script is executed on a supercomputer grid requesting 50 compute nodes with 32 CPU-s each. One CPU is allocated per MPI rank. (Total number of simultaneous tasks: 1600).

With this setup, in 60 minutes runtime, the number of processed files only hits ~5000. Some component must reach saturation since decreasing the number of nodes to 20 or 10 gives pretty much the same throughput.

I guess that the observed poor performance might be a consequence of inappropriate sleep time for the master thread. (I just now realize that I modified the sleep from 0.3 s to 0.03 s to achieve a more responsive system but it is quite possible that the system just got worse with the alteration :(.

Knowing the number of workers, what would be a reasonable sleep time for the master process? Depending on the number of proteins in the processed file,the worker process (MAFFT) runtime varies between ~0.4 second and several days.

With kind regards,

Balazs Balint

Jun 26 '20 07:06 h836472

Hi Balazs Balint, I used this library for my project on a super computer like you and It took some time to make sure the code was able to properly take advantage of all the cores and to have the performance scaling linearly with the number of resources (I used ~2000 cores).

I guess that the observed poor performance might be a consequence of inappropriate sleep time for the master thread. (I just now realize that I modified the sleep from 0.3 s to 0.03 s to achieve a more responsive system but it is quite possible that the system just got worse with the alteration :(.

The master sleep time can be a reason but it is directly related to the average execution time of the slaves. If the master sleep time is 0.3 sec, when a slave finish its jobs it has to wait 0.15 sec on average for the master to finish sleeping, get the return value from the slave and make use of the slave core for more work. If the slave average execution time is 5 sec, than the average wasted resource time is 3% ( 0.15 / 5.0 * 100 ). If the slave average execution time is 50 sec, than the average wasted resource time is 0.3% and so on. But you cannot control the slave execution time easily, that depends mostly on your application and partially on your design. So you can change the sleep time to a very small number to achieve the same benefit. Actually, in your case, you can totally remove the sleep time and the master will use one core at 100% always checking for slaves. This completely avoid wasted time. In my example I added the sleep time for cases where only 4/8/12 cores were used and in those case you don't want to waste one core for the master. This is not the case for you; one of the 1600 cores can be used for the master at 100%.

Does all the previous make sense to you? In your case, what do you believe is the average slave execution time?

Is your master code similar to the example shown in the README or does it additional computation? This is relevant because if the master spends 1 second of computation before sleeping at every loop than this 1 second has to be considered when calculating the wasted time. Does this make sense to you?

In my case I had my sleep time to 0.01, the master was identical to the example in the README (so no additional computation time for it) and the slave average execution time was ~10 sec for some of them and ~ 3 minutes for others. In theory the master sleep time was correct for the slaves but still my slaves were very slow, like in you case. That made me think the problem was in the slave code, not in the master. That was indeed the case. My problem was that the fast slaves (~10 sec) were waiting for the results coming from the slow ones (~3min) so the slaves were the bottleneck. After that I added the API that handles the resources (Example 4 and Example 5 in the readme) to this project and that fixed my performance issue. Anyhow this is a case specific case and I cannot generalise for you but the take away is that you have to make sure of what your slaves are doing: are they busy? If not, what are they waiting for? You need to modify the design of the code so that the slaves have all the resources they need to perform their task when a core is given to them so that they do not wait for something that is computed only later. This is a use case specific troubleshooting and you have to try to figure it out what is preventing your slaves to make use of their cpu core.

Knowing the number of workers, what would be a reasonable sleep time for the master process? Depending on the number of proteins in the processed file,the worker process (MAFFT) runtime varies between ~0.4 second and several days.

As I said above, you can totally remove the sleep time from the master code. One core will be totally used by the master in this way, but that's not a big deal in a case where 1600 cores at your disposal. If you remove the sleep you can be sure the problem is not that one and you can investigate other areas of the bottleneck persists.

I hope this helps and let me know how it goes

Jun 26 '20 11:06 luca-s

Dear Luca-s,

Thank you for your reply and sorry for the delay in my answer. Indeed, we are using a large number of CPU-s so it is absolutely OK to assign one of them to be the master, without any sleep. I tried that modification and the yield seems to be quite better that way. Luckily, the worker threads are completely independent so they probably do not wait for each other.

The task_runtime iself is very tricky. We start with the smallest data chunks meaning 0.3-0.5 s runtime, initially. Then, as the processed files grow, so do the runtimes. In the end, even a "several days / task" execution time is possible.

I wonder: do you think there can be a worker number that we should not exceed when using a single master thread?

Thank you for your help,

Balazs

On Fri, Jun 26, 2020 at 1:27 PM luca-s [email protected] wrote:

Hi Balazs Balint, I used this library for my project on a super computer like you and It took some time to make sure the code was able to properly take advantage of all the cores and to have the performance scaling linearly with the number of resources (I used ~2000 cores).

I guess that the observed poor performance might be a consequence of inappropriate sleep time for the master thread. (I just now realize that I modified the sleep from 0.3 s to 0.03 s to achieve a more responsive system but it is quite possible that the system just got worse with the alteration :(.

The master sleep time can be a reason but it is directly related to the average execution time of the slaves. If the master sleep time is 0.3 sec, when a slave finish its jobs it has to wait 0.15 sec on average for the master to finish sleeping, get the return value from the slave and make use of the slave core for more work. If the slave average execution time is 5 sec, than the average wasted resource time is 3% ( 0.15 / 5.0 * 100 ). If the slave average execution time is 50 sec, than the average wasted resource time is 0.3% and so on. But you cannot control the slave execution time easily, that depends mostly on your application and partially on your design. So you can change the sleep time to a very small number to achieve the same benefit. Actually, in your case, you can totally remove the sleep time and the master will use one core at 100% always checking for slaves. This completely avoid wasted time. In my example I added the sleep time for cases where only 4/8/12 cores were used and in those case you don't want to waste one core for the master. This is not the case for you; one of the 1600 cores can be used for the master at 100%.

Does all the previous make sense to you? In your case, what do you believe is the average slave execution time?

Is your master code similar to the example shown in the README or does it additional computation? This is relevant because if the master spends 1 second of computation before sleeping at every loop than this 1 second has to be considered when calculating the wasted time. Does this make sense to you?

In my case I had my sleep time to 0.01, the master was identical to the example in the README (so no additional computation time for it) and the slave average execution time was ~10 sec for some of them and ~ 3 minutes for others. In theory the master sleep time was correct for the slaves but still my slaves were very slow, like in you case. That made me think the problem was in the slave code, not in the master. That was indeed the case. My problem was that the fast slaves (~10 sec) were waiting for the results coming from the slow ones (~3min) so the slaves were the bottleneck. After that I added the API that handles the resources (Example 4 and Example 5 in the readme) to this project and that fixed my performance issue. Anyhow this is a case specific case and I cannot generalise for you but the take away is that you have to make sure of what your slaves are doing: are they busy? If not, what are they waiting for? You need to modify the design of the code so that the slaves have all the resources they need to perform their task when a core is given to them so that they do not wait for something that is computed only later. This is a use case specific troubleshooting and you have to try to figure it out what is preventing your slaves to make use of their cpu core.

Knowing the number of workers, what would be a reasonable sleep time for the master process? Depending on the number of proteins in the processed file,the worker process (MAFFT) runtime varies between ~0.4 second and several days.

As I said above, you can totally remove the sleep time from the master code. One core will be totally used by the master in this way, but that's not a big deal in a case where 1600 cores at your disposal. If you remove the sleep you can be sure the problem is not that one and you can investigate other areas of the bottleneck persists.

I hope this helps and let me know how it goes

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/luca-s/mpi-master-slave/issues/8#issuecomment-650129569, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTA3ZFP4PQYRD6DS6N3RYSAZBANCNFSM4OJCCHPA .

Jul 02 '20 13:07 h836472

I wonder: do you think there can be a worker number that we should not exceed when using a single master thread?

In my use case we used up to ~2000 slaves and the processing time was decreasing linearly with the number of slaves so we didn't find a bottleneck in the master (which was only taking care of the slaves, no processing in itself and that made the master free to scale).

Jul 20 '20 11:07 luca-s