dace New and Improved MapFusion

A new and improved version of the map fusion transformation. The transformation is implemented in a class named MapFusionSerial, furthermore the MapFusionParallel transformation is added, that allows to fuse parallel maps together. The new transformation analyses the graph more carefully when it checks if and how it should perform the fusing. Special consideration was given about the correction of memlets. However, there is still some aspects that should be improved and allowed to handle.

The new transformation produces graphs that are slightly different from before, and certain (other) transformations can not handle the resulting SDFG. For that reason a compatibility flag strict_dataflow was introduced. However, by default this flag is disabled. The only place where it is activated is inside the auto optimization function.

Furthermore, the SDFGState._read_and_write_sets() function has been rewritten to handle the new SDFGs, because of some bugs. However, one bug has been kept because of other transformations that would fail otherwise. But it is a bug, tests were written to demonstrate this.

Collection of known issues in other transformation:

Aug 22 '24 13:08 philip-paul-mueller

Currently the auto opt uses strict dataflow

Sep 05 '24 13:09 philip-paul-mueller

I have some trouble understanding the difference between -Serial and -Parallel and why both are necessary?

Sep 25 '24 14:09 acalotoiu

Thanks for reviewing and the wall of text.

To give you some context. Initially I started with JaCe (JAX frontend for DaCe), I applied it to stencils from ICON4Py. However, for some of them DaCe's auto_optimizer, was unable to handle them, either the fusion was not performed, the resulting SDFG was invalid or the computation was wrong. I traced them down to MapFusion, at first I was trying to fix the original implementation, but I had trouble understanding the code at all, so I started to rewrite the transformation.

The main issues I found (not limited to ICON4Py) were:

The subsets (not the .subset member of the Memlet; I mean the concept of where we write to it and from where we read) of the new intermediate array were not computed correctly.
The transformation did not make a difference between .subset and .other_subset of a Memlet and in most cases just accessed .subset which might be wrong. In fact this is a general impression I had that a lot of code simply accesses .subset (which happens to be the right choice in most but not all cases) and does not care about the intrinsic direction of the Memlet.
The check if an intermediate can be removed or must be recreated afterwards was wrong. For this the whole SDFG has to be scanned, there is no way around it, but it was not done.
The .dynamic property of the Memelts where fully ignored.
As a side note, the check for WCR is on line 427
The code that propagates the change (removed intermediate) into the scope was wrong; again .subset was not handled correctly. (Although, I have to say that the current code should also be improved, but just a little.)

I want to point out that this PR adds a lot of tests for MapFusion (approximately 40% of the edits) and the previous version is not able to pass them; roughly 1/3 of them fails.

Regarding the description, I agree the doc string of the class is not that good, however, the code itself is in my view better documented than before, but I have updated the description of the transformation to give a better high level overview, which points to the functions that performs the tests.

I do not know OTFMapFusion and SubgraphFusion very well, however, I have seen that SubgraphFusion is much more general, for example, instead of reducing the intermediate it will move the intermediate data access inside the Map. The only capability I know SubgraphFusion has is that it is able to handle Maps that are parallel. This is a capability that my MapFusion currently lacks (it was originally included in the PR, but removed afterwards).

I think the best way to see my MapFusion is not as something new but just as a new iteration of what was already there, it just performs more analysis to handle more cases than before. This allows it to handle more cases. However, there are still some todo's that are open.

I have to admit that I have not performed any testing of the runtime, but I do not have the impression that it takes much more time than before. The reason is that MapFusion is, beside two exceptions, a very local operation. The first exception is, the check if an intermediate can be removed or not. However, this information is more or less static, so the transformation computes this set at the beginning and then caches it. The downside is that it is hard to tell if the cache should be renewed. However, the cache remains valid as long as no AccessNodes are added. I checked that the use in auto_optimizer is fine. Further, to avoid this I added the assume_always_shared flag. This tells the transformation that every intermediate is shared. Thus no scan is ever needed, however, it will lead to dead dataflow. The second exception is where we have to ensure that no cycles are created, however, this will only explore the dataflow graph locally (everything downstream). Furthermore, when I wrote the thing I tried to order the checks in such a way that the ones that are either cheap or very likely to fail come first.

Dec 17 '24 12:12 philip-paul-mueller

Here is a summary that compares the behaviour of the old MapFusion transformation with the new one. They are based on the unit tests. new_map_fusion_summary_of_changes.pdf

Jan 23 '25 06:01 philip-paul-mueller