TableMASTER-mmocr how about the speed in inference

Thanks for your great work!

How about the inference speed ? I use a GTX-1080 and it cost a few seconds(almost 10s) per image. So I want to know is it normal?

Aug 26 '21 05:08 zzhanq

Thanks for your great work!

How about the inference speed ? I use a GTX-1080 and it cost a few seconds(almost 10s) per image. So I want to know is it normal?

I think it takes 10s for the end2end inference(text-line detection, text-line recognition, table structure restruction and match process) is normal, with GTX-1080.

Sep 01 '21 04:09 JiaquanYe

In the released code, we did not put the speedup module in the repo. You can refer to our MASTER paper for the speedup module. I expect the end2end inference can be speedup to 2-4s.

Sep 01 '21 11:09 delveintodetail

In the released code, we did not put the speedup module in the repo. You can refer to our MASTER paper for the speedup module. I expect the end2end inference can be speedup to 2-4s.

But I found the bottleneck of speed is not the encoder backbone MASTER, but the self-regressed decoder which max length could be up too 500 steps, do you have any idea of it?

Sep 01 '21 11:09 sonack

In the self-regressed decoder, there are many repeated operations. In the master paper, we use a memory-cached mechanism to speedup the inference. The speedup is extremely effecient for the long length decoder. Please check the paper. Actually, o(n^2) complexity can speedup to o(n).

Sep 01 '21 11:09 delveintodetail

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

Sep 01 '21 11:09 delveintodetail

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder? I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement? BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain. If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.

cc @zzhanq @JiaquanYe @delveintodetail

Sep 06 '21 08:09 sonack

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder? I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement? BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain. If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.

cc @zzhanq @JiaquanYe @delveintodetail

Hi, sonack. Code line 42-43 is unnecessary, I have fix this bug. In our experience, "memory-cached inference" may speed up about 40-50%, when maxlength is 100, in a normal size master decoder. And we haven't try "memory-cached inference" in lightweight decoder. A early stop mechanism is a useful speed-up method in sequence decoding.

Sep 06 '21 14:09 JiaquanYe

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder? I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement? BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain. If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.

cc @zzhanq @JiaquanYe @delveintodetail

I expect the speedup ratio should be much larger than you reported. It highly depends on your implementation. If you prefer, you can send me your code, I will check it for you. If you highly care about the speed, I would like to suggest you to decrease the resolution of the output feature map by CNN. If currently, the input image size is 400400, the output resolution by 5050C, you can further decrease it to 2525*C by convolution and pooling, it will largely speedup the inference but may decrease the performance by 1%.

Sep 08 '21 09:09 delveintodetail

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder? I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement? BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain. If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application. cc @zzhanq @JiaquanYe @delveintodetail

I expect the speedup ratio should be much larger than you reported. It highly depends on your implementation. If you prefer, you can send me your code, I will check it for you. If you highly care about the speed, I would like to suggest you to decrease the resolution of the output feature map by CNN. If currently, the input image size is 400_400, the output resolution by 50_50_C, you can further decrease it to 25_25*C by convolution and pooling, it will largely speedup the inference but may decrease the performance by 1%.

Sorry I can't provide you the code directly, because I finish it in company. Do you have wechat? Maybe we can discuss it in depth more conveniently through it. Thank you very much.

Sep 09 '21 03:09 sonack

Hi，@JiaquanYe @delveintodetail，我测试了一下这个大模型的推理速度：在max_len=500，遇到EOS不提前终止的情况下：

没有去除Code line 42-43的冗余计算时，大概是11s/img；
去除冗余计算后，大概是6s/img；
再加上memory-cached inference的话，大概是4.5s/img。

请问一下在相同config模型下，你们内部实现的memory-cached inference的推理速度最快能到多少？我profile了一下目前的memory-cached inference的实现，主要瓶颈都在KQV的matrix计算上了，应该不是pytorch具体实现的问题。目前都是采用的原生Pytorch进行的计算，没有经过torchscript或者其他工程化操作。

Sep 13 '21 05:09 sonack

Hi，@JiaquanYe @delveintodetail，我测试了一下这个大模型的推理速度：在max_len=500，遇到EOS不提前终止的情况下：

没有去除Code line 42-43的冗余计算时，大概是11s/img；

去除冗余计算后，大概是6s/img；

再加上memory-cached inference的话，大概是4.5s/img。

请问一下在相同config模型下，你们内部实现的memory-cached inference的推理速度最快能到多少？我profile了一下目前的memory-cached inference的实现，主要瓶颈都在KQV的matrix计算上了，应该不是pytorch具体实现的问题。目前都是采用的原生Pytorch进行的计算，没有经过torchscript或者其他工程化操作。

In the decoder, the computation of K and V was conducted only one time (this is important for efficiency.), then it will be cached for further usage. You don't need to compute it in each time step. I believe this is the problem of your code. Tell me if you fix this issue. Thanks.

Sep 14 '21 07:09 delveintodetail

Hi，@JiaquanYe @delveintodetail，我测试了一下这个大模型的推理速度：在max_len=500，遇到EOS不提前终止的情况下：

没有去除Code line 42-43的冗余计算时，大概是11s/img；

去除冗余计算后，大概是6s/img；

再加上memory-cached inference的话，大概是4.5s/img。

请问一下在相同config模型下，你们内部实现的memory-cached inference的推理速度最快能到多少？我profile了一下目前的memory-cached inference的实现，主要瓶颈都在KQV的matrix计算上了，应该不是pytorch具体实现的问题。目前都是采用的原生Pytorch进行的计算，没有经过torchscript或者其他工程化操作。

In the decoder, the computation of K and V was conducted only one time (this is important for efficiency.), then it will be cached for further usage. You don't need to compute it in each time step. I believe this is the problem of your code. Tell me if you fix this issue. Thanks.

不是这样的，encoder和decoder之间的cross attention的K、V确实是算一次就可以了，但是Q需要每步都算吧（因为Q是上一个时间步才预测出来的token，时间步T只算第T-1步，即上一步刚刚预测的token的多头线性变换，而不是原始的1~T步都重复计算），类似于下图中的红框部分：

我后面会尝试私下再重新实现一下，然后看看能不能提个PR，希望到时候可以一起看看~ @JiaquanYe @delveintodetail

Sep 14 '21 07:09 sonack

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder? I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement? BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain. If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application. cc @zzhanq @JiaquanYe @delveintodetail

I expect the speedup ratio should be much larger than you reported. It highly depends on your implementation. If you prefer, you can send me your code, I will check it for you. If you highly care about the speed, I would like to suggest you to decrease the resolution of the output feature map by CNN. If currently, the input image size is 400_400, the output resolution by 50_50_C, you can further decrease it to 25_25*C by convolution and pooling, it will largely speedup the inference but may decrease the performance by 1%.

Sorry I can't provide you the code directly, because I finish it in company. Do you have wechat? Maybe we can discuss it in depth more conveniently through it. Thank you very much.

Line 11, q: 1d, M^k_b: T-1d, ops: T-1d Line 12, ops: T-1d line 13, ops:T-1d + T-1d line 14, ops: dd + T-1d + T-1*d

I roughly list the computational operations of each line above. Please check if it is right for your implementation.

Sep 14 '21 07:09 delveintodetail

I implement Decoder part of MASTER-pytorch in FasterTransformer, compare to origin pytorch it's much faster. Example code。It should be easy to add to TableMASTER with the minor changes

Oct 20 '21 09:10 Sanster

I implement Decoder part of MASTER-pytorch in FasterTransformer, compare to origin pytorch it's much faster. Example code。It should be easy to add to TableMASTER with the minor changes

It's a great job! I will try it in TableMASTER.

Oct 20 '21 14:10 JiaquanYe

mark

Mar 14 '22 13:03 WenmuZhou

请问用 memory-cached inference 和EOS 以加快推理速度具体怎么实现呢

Apr 20 '23 12:04 LiuDong777

本质上memory-cache就是把过去的k-v cache给后面的q计算的时候用， huggingface我看到最近也支持了这个，你可以看看。当时在master论文里面我想到这个时候觉得是memory-cache，我们实际做的和现在的k-v cache是一样的事情。你可以参见： https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.use_cache

Apr 20 '23 13:04 delveintodetail

TableMASTER-mmocr TableMASTER-mmocr copied to clipboard

how about the speed in inference

TableMASTER-mmocr
TableMASTER-mmocr copied to clipboard