pyuvsim icon indicating copy to clipboard operation
pyuvsim copied to clipboard

consider using task push/pull instead of numpy array chunking

Open mkolopanis opened this issue 4 years ago • 8 comments

allows one rank to be the "scheduler" sending tasks to the rest of the compute nodes. This could help balance sims a lot better. see an example.

mkolopanis avatar Feb 25 '21 19:02 mkolopanis

@mkolopanis tested this. This does do better load balancing, but seems to be slower. It might be the right thing for scaling. Requires better pickling of pyuvdata objects (in a branch now).

bhazelton avatar Mar 16 '21 15:03 bhazelton

I'm not convinced that this offers any real advantages over the current setup. I can't find a branch with this new code anywhere -- has it been uploaded to github yet? By the examples I've seen so far, it just seems to replace the local task generation (that is, initializing UVTask objects on each process) with initializing them on rank 0 and then sending them (which means initializing, pickling, sending, and unpickling).

It only coincidentally helps load-balancing because it avoids running any tasks on rank 0, which we saw was running to near completion long before other task loops were started. There's still a delay, but it's just hidden because the remaining processes are as balanced as they always were.

aelanman avatar Mar 16 '21 18:03 aelanman

It has not been, it was a local branch I made to test this out.

The load-balancing is not coincidental. It ensures that as long as there are tasks to be computed, then the PUs can grab more work. In this scenario a rank cannot finish its local_task_iter and just sit idle. I've seen some (small) evidence for this in smaller simulations, but like we mentioned on the meeting today we don't have a great idea about how things scale up for very large computers. There may be a slight initial delay, but that's not the same thing "load-balancing." but what I did see for a small reference simulation is that the hand off was taking awhile as every processes finished faster than the sending rank could give out tasks. I could alleviate this slightly by chunking the task list up so that the compute nodes where not wasting as much time waiting to be assigned work.

mkolopanis avatar Mar 16 '21 19:03 mkolopanis

Since there is a fixed number of tasks, theoretically it shouldn't be necessary to send them out dynamically. However, given the more complicated way that some axes take longer than others, I can see how you'd end up with some tasks taking longer than others and hence lose balance. We should do more tests to see if this is an issue that's worth sacrificing a processor. I only brought up the delay because I remember that being the original issue, and it looked like bad load balancing because it made one process finish first, but the others were mostly fine. As I said before, we should investigate how much processors are idling due to the unpredictable imbalance I mentioned here vs. this still unsolve initial delay bug.

Another option to make push/pull faster would be to send the task ids out, not UVTask objects. Sending an integer would certainly be faster than sending a serialized object. Then you still let each rank make its own UVTask instance.

aelanman avatar Mar 16 '21 19:03 aelanman

Yeah we definitely need more information both for smaller and very very large numbers of PUs. I still have not figured out the delay thing myself. In my current simulation environment I have sacrificed rank 0 and use it to run a tqdm bar and there is still a slight delay. I am beginning to wonder if it is some kind of initial calculation that is cached, but then why does rank 0 not get that delay?

hm an interesting idea too! lots to play with if we want to go down this road.

mkolopanis avatar Mar 16 '21 19:03 mkolopanis

I've been having trouble recreating it in a MWE. I thought I had it, but more recently the delay has not appeared. I do still get that if I run time.sleep() after the loop for one rank, all ranks will pause in that time. So that's weird..

aelanman avatar Mar 16 '21 22:03 aelanman

I've actually added a branch for this now here if you're interested in checking it out.

mkolopanis avatar Mar 17 '21 20:03 mkolopanis

We should run something big to see if there are noticeable differences in performance with this branch.

jpober avatar Jul 06 '21 15:07 jpober