onnx-mlir
onnx-mlir copied to clipboard
Better mapping of zAIU ops
Our current mapping of ONNX to zAIU ops is as follow
- Map all zAIU-eligible operations to zAIU greedily, adding stick/unstick for input/outputs while eventually removing redundant stick/unstick or unsick/stick pairs.
- Migrate zAIU add/sub/mul/div ops back to CPU if they use the
stick -> zAIU (add/sub/mul/div) -> unstick
as this allow us to get rid of a stick/unstick pair. This is currently only done under optional flag--enable-zhigh-to-onnx
flag.
One issue with this approach is that we often map to the zAIU operations that are too small to be beneficially exploited on the accelerator. This can be determined by a cost model that estimate for each operation (individually) if it's better on the CPU or the accelerator.
While I agree that in general, a graph partitioning algorithm might be the most flexible approach, I wonder if we can improve our greedy algorithm to do better by a simple modification.
- Map beneficial zAIU-eligible operations on zAIU greedily, adding stick/unstick for input/outputs while eventually removing redundant stick/unstick or unsick/stick pairs.
- Migrate zAIU-eligible CPU ops to the accelerator regardless of cost model if they exhibit this pattern:
unstick -> CPU op -> stick
as it allows us to remove unstick/stick pair - Migrate zAIU add/sub/mul/div ops back to CPU if they use the
stick -> zAIU (add/sub/mul/div) -> unstick
as this allow us to get rid of a stick/unstick pair.
Possibly doing the Steps 2 & 3 together. Additionally, in Step 1, we should remove the unstick/stick in unstick-> CPU op->stick
for op that can be performed on the CPU in stickified format (thinking of all the data movement ops).