intel-extension-for-pytorch required rank 4 tensor to use channels

Hi, I am Nathan. These day I tried to make inference based on libtorch(c++) speed up by using intel-extension I am using wenet-e2e toolkit for speech recognition system.

As follow your Release(intel-extension) for 1.11.200 I applied run file to orginal libtorch_1.11.0_cpu as "bash libintel-ext-pt-shared-with-deps-1.11.200+cpu.run install workspace/libtorch_cpu/"

And there are two things i did. First, (in python3.9 and torch1.10(train)). I think libtorch(intel-extension) is ready for running on c++ I also Prepare torch-script model like "script_model = torch.jit.script(model)" "script_model.save(args.output_file)"

Second,(in c++) torch::Tensor feats =torch::zeros({1, num_frames, feature_dim}, torch::kFloat); feats = feats.to(c10::MemoryFormat::ChannelsLast); //for intel-extension

Finally, I build run file(c++) check for linking lib $ldd e2e-intel-ext | grep torch libintel-ext-pt-cpu.so => ../e2edecoder/libtorch/lib/libintel-ext-pt-cpu.so (0x00007f638f59f000) libtorch_cpu.so => ../e2edecoder/libtorch/lib/libtorch_cpu.so (0x00007f63789b6000) libc10.so => ../e2edecoder/libtorch/lib/libc10.so (0x00007f6392bff000) libgomp-a34b3233.so.1 => ../e2edecoder/libtorch/lib/libgomp-a34b3233.so.1 (0x00007f6374da8000)

When i run binary-file and get error required rank 4 tensor to use channels_last format Exception raised from empty_tensor_restride at ../c10/core/TensorImpl.h:2145 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa7f4abbf72 in ../e2edecoder/libtorch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x5f (0x7fa7f4ab86bf in ../e2edecoder/libtorch/lib/libc10.so) frame #2: + 0x106709f (0x7fa7db89d09f in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #3: at::detail::empty_generic(c10::ArrayRef, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optionalc10::MemoryFormat) + 0x80e (0x7fa7db899ffe in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #4: at::detail::empty_cpu(c10::ArrayRef, c10::ScalarType, bool, c10::optionalc10::MemoryFormat) + 0x41 (0x7fa7db89aa61 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #5: at::detail::empty_cpu(c10::ArrayRef, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, c10::optionalc10::MemoryFormat) + 0x34 (0x7fa7db89aab4 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #6: at::native::empty_cpu(c10::ArrayRef, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, c10::optionalc10::MemoryFormat) + 0x1e (0x7fa7dbd6e01e in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #7: + 0x1c2577a (0x7fa7dc45b77a in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #8: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, c10::optionalc10::MemoryFormat) + 0x12c (0x7fa7dc21cfdc in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #9: + 0x1c0c2af (0x7fa7dc4422af in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #10: at::_ops::empty_memory_format::call(c10::ArrayRef, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, c10::optionalc10::MemoryFormat) + 0x1e0 (0x7fa7dc257e30 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #11: at::native::_to_copy(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x582 (0x7fa7dbd6a632 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #12: + 0x1d2758a (0x7fa7dc55d58a in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #13: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x139 (0x7fa7dc024229 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #14: + 0x1c0c6c7 (0x7fa7dc4426c7 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #15: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x139 (0x7fa7dc024229 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #16: + 0x28405a6 (0x7fa7dd0765a6 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #17: + 0x2840aed (0x7fa7dd076aed in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #18: at::_ops::_to_copy::call(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x1aa (0x7fa7dc091b7a in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #19: at::native::to(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat) + 0x112 (0x7fa7dbd66702 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #20: + 0x1dbdf30 (0x7fa7dc5f3f30 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #21: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat) + 0x1c1 (0x7fa7dc184141 in ../e2edecoder/libtorch/lib/libtorch_cpu.so) frame #22: + 0x14d50a (0x7fa7f433850a in ../e2edecoderbin/../lib/libe2e-core-cpu.so) frame #23: wenet::TorchAsrDecoder::AdvanceDecoding() + 0x44d (0x7fa7f433ae3d in ../e2edecoderbin/../lib/libe2e-core-cpu.so) frame #24: E2EInference::E2EInfer(short const*, int, bool) + 0x31b (0x7fa7f434658b in ../e2edecoderbin/../lib/libe2e-core-cpu.so) frame #25: ./e2e-intel-ext() [0x41980b] frame #26: __libc_start_main + 0xf5 (0x7fa7d9a4f555 in /lib64/libc.so.6) frame #27: ./e2e-intel-ext() [0x41beff]

Could you please look at my error message? What is a problem in my case.
Training torch and libtorch version must be same?
Before saving torch-script model, do i have to do as follow model = model.to(memory_format=torch.channels_last) #in python model = ipex.optimize(model)
Is this only for batch type inference?

Thank you for reading mine.

Jul 07 '22 02:07 NathanJHLee

torch::Tensor feats =torch::zeros({1, num_frames, feature_dim}, torch::kFloat);

The input tensor 3-dims. It cannot be converted to torch.channels-last because torch.channels-last serves for 4-dims tensor - nchw.

Jul 10 '22 03:07 EikanWang

Thank you so much. I use squeeze to add one more dim and changed some python code and finally, it works. But I feel it's not faster than before. So I searched on your post and found something that this is for CNN model. Actually My model consist of 'conformer' that include some CNN modules. So I think that conformer can have tiny advantages using intel-extention-for-pytorch. or It's helpful for other models too? Could you please answer my question? Thank you.

Jul 13 '22 23:07 NathanJHLee

In general, IPEX could boost a broad set of workloads. And the performance improvement is significant on the top of JIT graph if the model could be traced as a graph. I'm not familiar with conformer. Could you please share with me the model repo? I will profile it.

Jul 14 '22 13:07 EikanWang