[autoparallel] add pooling metainfo
What’s New?
In this PR, I implement the metainfo generator for pooling operations, including AdaptiveAvgPool and MaxPool. Also I found one interesting point during aligning the estimated memory cost with the real one. The _split in comm_spec.py actually being triggered twice when you meet sharing spec like S01, it will split the tensor along two dimensions of device mesh respectively, producing a piece of memory which could confusing when you measure the memory during runtime.
For example, you have an input with the shape of [4, 128, 64, 64] with dtype=float32, it takes 8192KB memory, and you want to split it on a device mesh with shape of (2, 2) and the sharding spec is RS01RR. To split it, you will found the shape consistency will first call _split on one dimension, producing a tensor with the shape of [4, 64, 64, 64], which will consume 4096KB extra memory because split the tensor on dimension 1 will create non-contiguous tensor. Then the second split will produce a tensor with the shape of [4, 32, 64, 64] to meet our requirement, thus producing another 2048KB memory, and the former created 4096KB memory will be discarded. Thus, you will observe a peak of 4096KB and the actual memory allocated is 2048KB. It is not being discovered in the previous op patch because the output is much bigger than the input, as we test the memory peak and memory allocated for the whole forward phase, the output it produces is much bigger than the peak that _split produces, so it covers this tricky little case.