We should shuffle around the realization order to minimize peak memory usage
Consider a pipeline with three outputs f2, g2, h2. These call Funcs f1, g1, h1 respectively. Everything is compute_root. The realization order f1 f2 g1 g2 h1 h2 is going to use a lot less intermediate memory than the order f1 g1 h1 f2 g2 h2.
We should shuffle the realization order of realizations at each loop level in schedule_functions to minimize the number of overlapping lifetimes. This could be done by identifying each loop level used in a compute_at, and then for each, coming up with a new realization order for that loop level. This would have to be done at the level of fused groups, not Funcs.
This seems like something that should be scheduable, instead of automatic?
This seems like something that should be scheduable, instead of automatic?
Is there ever a situation where we'd choose to use more than the minimum?
It also affects locality, so there might be a trade-off here. Also if the allocations are all dynamic-size, the peak usage and thus the order will depend on those sizes, so the compiler won't be able to infer it.
You can already sort of schedule it with compute_at(Var::outermost(), the_func_you_want_to_go_before)