oneDNN Time consumption in primitive creation between conv1d (nwc) and conv2d (block format)

Hi,

We found that, the time consumption of primitive creation of conv1d (nwc input) is much higher than that of conv2d (block format), especially the first creation.

Though, it has primitive cache, the average creation time of conv1d (nwc) is still higher than conv2d (block format). When cache misses, conv1d almost takes 10x or 100x time of conv2d cache miss.

Is this reasonable? Will you improve this primitive creation in your plan? Thank you!

PS: This is the example of primitive creation. conv1d-conv2d.zip

May 06 '22 02:05 yanbing-j

Hi, I also change conv2d's src and dst desc to tag::acdb. Its primitive creation time is also much larger than conv2d of block format. Seems primitive creation takes too much time when src and dst are channels-last format. It will severely harm the performance when using channels-last path.

auto conv_src_md = memory::desc(src_dims, dt::f32, tag::acdb); auto conv_weights_md = memory::desc(weights_dims, dt::f32, tag::any); auto conv_dst_md = memory::desc(dst_dims, dt::f32, tag::acdb);

May 06 '22 07:05 yanbing-j

Hi @yanbing-j, thank you for you question.

Primitive creation time may vary depending on different factors including memory format tags. Mostly it's occupied by kernel generation.

Though, it has primitive cache, the average creation time of conv1d (nwc) is still higher than conv2d (block format).

Could you elaborate on this, please? According to our measurements, when primitive is taken from cache it takes ~5 us for any primitive.

Is this reasonable? Will you improve this primitive creation in your plan?

I'm not aware of any plans to improve primitive creation time since it's expected that the first call which involves kernel generation takes time.

oneDNN programming model assumes that users create object one time, keep them alive and use only execute functions from them. There were scenarios when users had a programming model which involved re-creation of primitives and it costs a lot for them to follow this model. That's why a primitive (and primitive descriptor) cache was implemented which decreased a cost of re-creation of primitives to minimum.

Could you describe the scenario when primitive cache doesn't help you to resolve the performance issue? Thank you.

May 07 '22 02:05 dzarukin

Hi @dzarukin , thank you for your response.

In my test shape, when cache hits, avg. time of conv2d is ~2.3 us, avg. time of conv1d is ~4 us. Though, conv1d spends 1.74x time of conv2d, it seems ~5 us of primitive creation is accetpable.

In my test model, primitive re-creation happens frequently due to the dynamic input shape. Primitive cache barely hits. Therefore, most of them are regarded as the fist call, which takes too much time. However, the first call of conv2d (block format) is much faster than that of conv1d (nwc input). This leads to performance downgrade when we use conv1d instead of conv2d. (Note: PyTorch treats conv1d as conv2d, conv2d here is a baseline. We try to make conv1d go oneDNN conv1d implementation directly, not view to conv2d and use conv2d implementation.)

Will you optimize the first call of conv1d (nwc), at least as good as that of conv2d (block format)?

May 07 '22 02:05 yanbing-j

Hi @yanbing-j, I think I've reproduced the behavior you are observing and now clarifying it with the team. To make sure we are on the same page - the problem is not in 1d or 2d. Two cases hit different implementations and they act differently. The reason they hit different implementations is memory layout of activations - nhwc takes longer than nChw16c. I'll let you know once I dig out something. Thank you.

May 09 '22 21:05 dzarukin

Hi @dzarukin , thank you so much! Looking forward to good news.

May 10 '22 01:05 yanbing-j

This is expected behavior. Initialization time is implementation specific and can vary between different versions.

Oct 06 '22 16:10 vpirogov