oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

rfcs: proposal for a reusable kernel knob

Open Simonsays095 opened this issue 1 year ago • 18 comments

Link to the rendered document

Simonsays095 avatar Oct 25 '23 22:10 Simonsays095

I've updated the RFC to hopefully clarify some of the questions raised so far and to add some more context to the motivation and goal of the RFC.

Simonsays095 avatar Oct 30 '23 21:10 Simonsays095

Some facts from PyTorch side.

  1. For eager mode, we can always get the specific input shapes and static attribute values. All the inputs to the kernel including the sizes of activations and weights, even attributes like strides and padding could be changed.
  2. For graph mode, we can get the specific input shapes as example inputs in most cases. Sizes of weights and attributes are static. We can know whether sizes of activations are dynamic or static.

I saw the debates on the "Option 1: Using DNNL_RUNTIME_DIM_VAL" and "Option 4: Primitive attribute". Regardless of the options, a good API in my mind is that 1) it allows frameworks to pass all the known information; 2) it doesn't expose implementation details; 3) finer-grained primitive-level control is preferred than a global only setting. For PyTorch, we know whether things are dynamic or static for both eager mode and graph mode. We can also provide example information when things are dynamic in most cases. These are the info we can pass to oneDNN. (If needed, we can even pass in the range of the shapes in the graph mode. But this is something advanced and can be discussed separately.) On the other hand, we don't bother to know whether a kernel is "reusable" or not. These are implementation details better to be hidden inside oneDNN.

jgong5 avatar Nov 09 '23 08:11 jgong5

Another aspect is about weight packing. We'd prefer to keep a single copy of weight in the framework. What's the implication to that with this RFC?

jgong5 avatar Nov 09 '23 08:11 jgong5

For graph mode, we can get the specific input shapes as example inputs in most cases. Sizes of weights and attributes are static. We can know whether sizes of activations are dynamic or static.

@jgong5, I just want to make sure that I understand what static and dynamic mean here.

  • Static sizes are the sizes that are known at the graph creation stage
  • Dynamic sizes are the size that are not known at the graph creation stage

But in both cases it is not guaranteed that the operations in the created graph will always be called for the same shapes. Do I get it right?

densamoilov avatar Nov 10 '23 03:11 densamoilov

But in both cases it is not guaranteed that the operations in the created graph will always be called for the same shapes. Do I get it right?

That's not correct. With the graph mode, static sizes won't change during runtime.

jgong5 avatar Nov 10 '23 04:11 jgong5

@jgong5, got it. So we have to assume that the non-static sizes may change at runtime even if in certain cases it may not be true?

densamoilov avatar Nov 10 '23 06:11 densamoilov

@jgong5, got it. So we have to assume that the non-static sizes may change at runtime even if in certain cases it may not be true?

That's correct.

jgong5 avatar Nov 10 '23 09:11 jgong5

@Simonsays095, looks like the proposed options (except for the DNNL_RUNTIME_DIM_VAL one) will not solve the problem. So far it seems that our best shot is to focus on developing the re-usable kernels and closing the performance gap between them and the JIT ones as much as possible. Down the road, based on the feedback from @jgong5, we may want to make those re-usable kernels semi-specialized to enable the runtime dimensions so that the frameworks can provide the static sizes.

densamoilov avatar Nov 13 '23 04:11 densamoilov

@jgong5, at this point our main concern (and hence the RFC) is not how to improve performance of models with truly dynamic shapes (e.g. Hifagan on ATS-M), but how not to introduce big perf regression to the models where there's actually no dynamic shapes.

Can you please give us an example, are there dynamic shapes in any well-optimized CNN models on ATS-M like RN-50 or Stable Diffusion? If some parameters in some Conv layers are dynamic according to the IPEX knowledge, we will be prioritizing reusable kernels for these layers, and these layers will see regressions.

Based on that, I can see two scenarios:

  1. If non-static shapes are rare and really happen only in models where some sizes are dynamic, we can stick to the option # 4. In this case you would need to pass a hint if at least one dimension is non-static, and everything else will be handled by oneDNN. If you don't like dnnl_primitive_attr_get_hint_reusable_kernels name as it exposes implementation details, we can rename it to smth like dnnl_primitive_attr_get_hint_dynamic_shapes.

  2. If a fair amount of fixed-shape models is affected by non-static shapes, then the problem can only be addressed if reusable kernels can be enabled unconditionally which means their performance is on par with JIT kernels. Or, alternatively, you need to introduce a knob at IPEX level to choose between jit/reusable, and this will be a tunable parameter on the user side.

karturov avatar Nov 16 '23 23:11 karturov

Can you please give us an example, are there dynamic shapes in any well-optimized CNN models on ATS-M like RN-50 or Stable Diffusion? If some parameters in some Conv layers are dynamic according to the IPEX knowledge, we will be prioritizing reusable kernels for these layers, and these layers will see regressions.

@karturov In general, batch dim could be dynamic for conv and gemm (inner-product and batch matmul) and spatial dim could be dynamic for conv. Examples: Conv in stable diffusion: dynamic batch dim due to dynamic batching, dynamic spatial dim due to changing input and output image sizes. Gemm in stable diffusion and LLMs: dynamic batch dim due to varied sequence length.

If non-static shapes are rare and really happen only in models where some sizes are dynamic, we can stick to the option # 4. In this case you would need to pass a hint if at least one dimension is non-static, and everything else will be handled by oneDNN.

I won't say non-static shapes are rare. Also, we cannot always know if the shapes could be changing or not. As I explained previously, for eager mode, we have to assume shapes are changing and for graph mode, we have more info and know what specific dims could be changing for individual ops with sample input shapes. That's why I would prefer a flexible API for frameworks to pass all the known information, e.g., able to mark individual dim as dynamic and provide size hint for them. Then, oneDNN can decide how to serve the things best. A coarser-grained API like "dynamic hint" is also fine but that would leave oneDNN less room to specialize for graph mode in the future.

If a fair amount of fixed-shape models is affected by non-static shapes, then the problem can only be addressed if reusable kernels can be enabled unconditionally which means their performance is on par with JIT kernels.

Do you mean to make the performance of these dynamic shape kernels on par with static shape kernels?

jgong5 avatar Nov 17 '23 02:11 jgong5

That's why I would prefer a flexible API for frameworks to pass all the known information, e.g., able to mark individual dim as dynamic and provide size hint for them.

@jgong5 It makes the kernel implementation more complex (and also results in more scenarios to cover in testing) so we would rather keep it simple at the kernel level: either JIT compile all sizes into the kernel or have an internal list of pre-defined kernels supporting dynamic sizes to dispatch between them.

A coarser-grained API like "dynamic hint" is also fine but that would leave oneDNN less room to specialize for graph mode in the future.

Yeah, that's true but more fine-grained API is usually harder to develop, support and use so maybe better to stick to the simpler version. Another perspective - we don't have performance data to justify investing into more flexible API.

Do you mean to make the performance of these dynamic shape kernels on par with static shape kernels?

Yes, in the short term dynamic shape kernels can't reach the same level of performance as JIT kernels so they should not be used unless it's absolutely needed (like for models that are known to expose dynamic shapes).

echeresh avatar Nov 17 '23 02:11 echeresh

It makes the kernel implementation more complex (and also results in more scenarios to cover in testing) so we would rather keep it simple at the kernel level: either JIT compile all sizes into the kernel or have an internal list of pre-defined kernels supporting dynamic sizes to dispatch between them. Yeah, that's true but more fine-grained API is usually harder to develop, support and use so maybe better to stick to the simpler version. Another perspective - we don't have performance data to justify investing into more flexible API.

If these dynamic hints for individual dims are passed as hints, you have the flexibility to choose the algorithms you like. You can keep it simple to either JIT compile all sizes or use AOT-compiled kernels or something more complicated. Finer-grained semantics not necessary make your implementation harder.

jgong5 avatar Nov 17 '23 05:11 jgong5

If these dynamic hints for individual dims are passed as hints, you have the flexibility to choose the algorithms you like

That adds complexity at our side, but is doable. At least I believe we can do better when, for example, only batch size is dynamic. If these parameters are hints, we still have enough flexibility, in particular to ignore them and still generate JIT once we don't have a good reusable kernel for a combination of dynamic shapes yet. @echeresh, what are your thoughts?

Conv in stable diffusion: dynamic batch dim due to dynamic batching, dynamic spatial dim due to changing input and output image sizes.

This is a problem, esp. for Eager mode. It means if we implemented the mechanism with per-dimension hints, we would create reusable kernels for all SD Convolutions, and performance would go down. And I guess there will be broad regressions. Hence, again, recommendations are: (1) Eager mode: make a global, tunable, disabled by default parameter on the IPEX side enabling dynamic shapes. I guess you have a User Guide where you can recommend users to try this in case of low performance. (2) Graph mode: guarantee that you set a hint only for really changing dimensions, so, for example, only relevant Hifagan layers are affected, but SD/ResNet-50 layers are not affected. If it is not possible, see (1) (3) Can you implement a logic in IPEX detecting that a layer was suffering from a cache miss? In that case, you would trigger/enable dynamic hint for a model on the fly. For example, you ignore 1st iteration, and if you see a cache miss at iter 2+, you enable passing dynamic hints, and all subsequent ops benefit from reusable kernels, and hopefully iterations 3+ are good.

karturov avatar Nov 17 '23 16:11 karturov

@karturov,

Graph mode: guarantee that you set a hint only for really changing dimensions, so, for example, only relevant Hifagan layers are affected, but SD/ResNet-50 layers are not affected. If it is not possible, see (1)

Based on this recommendation it looks like there is a misunderstanding here. As far as I understand, in the graph mode the framework knows what dimensions are static and what aren't. But it's not necessary that those dimensions that aren't static will actually change at execution time. So even such models as ResNet-50 will be affected.

densamoilov avatar Nov 17 '23 17:11 densamoilov

Based on this recommendation it looks like there is a misunderstanding here. As far as I understand, in the graph mode the framework knows what dimensions are static and what aren't. But it's not necessary that those dimensions that aren't static will actually change at execution time. So even such models as ResNet-50 will be affected.

@densamoilov, I've read @jgong5's comments once again, and I agree with you. It looks like in Eager mode everything is non-static. It means IPEX can't provide any useful information in Eager mode, and the information provided in Graph mode won't not accurate. Hence, regressions are expected in both cases.

karturov avatar Nov 17 '23 18:11 karturov

@karturov,

Graph mode: guarantee that you set a hint only for really changing dimensions, so, for example, only relevant Hifagan layers are affected, but SD/ResNet-50 layers are not affected. If it is not possible, see (1)

Based on this recommendation it looks like there is a misunderstanding here. As far as I understand, in the graph mode the framework knows what dimensions are static and what aren't. But it's not necessary that those dimensions that aren't static will actually change at execution time. So even such models as ResNet-50 will be affected.

Based on this comment from @jgong5 it's not quite true:

For graph mode, we can get the specific input shapes as example inputs in most cases. Sizes of weights and attributes are static. We can know whether sizes of activations are dynamic or static.

Once the framework can distinguish between dynamic vs static size activations then we can definitely address the following scenario:

  • Graph mode
  • Activations are dynamic sized

By "address" I mean we can provide much lower primitive creation time at the cost of worse kernel performance. This is possible under the assumption the framework will propagate this information (that we are in the graph mode with dynamic size activations) to oneDNN.

One important caveat here is that static vs dynamic size activations distinction must be clear. At this point oneDNN got requests only about a few models with dynamic sizes while most models work with static activation sizes. So the expectation is that the framework can reliably detect the first set of models (with dynamic size activations) without any changes in how we handle the other set (with static size activations).

echeresh avatar Nov 17 '23 21:11 echeresh

@jgong5, can you please address the confusion?

@karturov,

Graph mode: guarantee that you set a hint only for really changing dimensions, so, for example, only relevant Hifagan layers are affected, but SD/ResNet-50 layers are not affected. If it is not possible, see (1)

Based on this recommendation it looks like there is a misunderstanding here. As far as I understand, in the graph mode the framework knows what dimensions are static and what aren't. But it's not necessary that those dimensions that aren't static will actually change at execution time. So even such models as ResNet-50 will be affected.

Based on this comment from @jgong5 it's not quite true:

For graph mode, we can get the specific input shapes as example inputs in most cases. Sizes of weights and attributes are static. We can know whether sizes of activations are dynamic or static.

densamoilov avatar Nov 17 '23 23:11 densamoilov

@jgong5, can you please address the confusion?

PyTorch graph mode allows users to specify if the shapes are dynamic or static. In both cases, example inputs with specific shapes are known. For dynamic shapes, the runtime dims might change. For static shapes, those dims won't change and can be specialized during compilation. Does that address your confusion?

jgong5 avatar Nov 20 '23 08:11 jgong5