decima-sim icon indicating copy to clipboard operation
decima-sim copied to clipboard

What about cross-server data transmission overhead?

Open hliangzhao opened this issue 3 years ago • 2 comments

Sorry to bother you again.

In my research area, each stage is scheduled to be placed on some VM node. If its child stages are placed on different VM nodes, cross-node data transmission overhead should be considered. Thus, minimize the makespan can be divided into two subgoals, the execution time and the cross-node communication overhead.

But I found that Decima does not consider the transmission time of intermediate data between the fore-and-aft stages of each job. Is this because the scheduling environment is Spark? Or all the jobs are running on the same "VM node"?

hliangzhao avatar Dec 17 '20 02:12 hliangzhao

I agree that data locality is an important aspect to optimize. Our simulator didn't capture it explicitly because the particular workload we run on Spark did not show much difference (all VMs are in a single datacenter, where the large network throughput makes this locality issue minimum).

However, I would say it shouldn't be hard to add the transmission time in the simulator. You can create a multiplier on the task run time based on parent and child node.

Also, for RL, you might want to still optimize directly for the end-objective as opposed to divide the goal into sub-goals and optimize them individually. It might be difficult to hand-tune the balance between execution time and cross-node communication overhead.

Hope these help!

hongzimao avatar Dec 20 '20 04:12 hongzimao

Thanks! This helps a lot!

hliangzhao avatar Dec 21 '20 08:12 hliangzhao