TableMASTER-mmocr
TableMASTER-mmocr copied to clipboard
how about the speed in inference
Thanks for your great work!
How about the inference speed ? I use a GTX-1080 and it cost a few seconds(almost 10s) per image. So I want to know is it normal?
Thanks for your great work!
How about the inference speed ? I use a GTX-1080 and it cost a few seconds(almost 10s) per image. So I want to know is it normal?
I think it takes 10s for the end2end inference(text-line detection, text-line recognition, table structure restruction and match process) is normal, with GTX-1080.
In the released code, we did not put the speedup module in the repo. You can refer to our MASTER paper for the speedup module. I expect the end2end inference can be speedup to 2-4s.
In the released code, we did not put the speedup module in the repo. You can refer to our MASTER paper for the speedup module. I expect the end2end inference can be speedup to 2-4s.
But I found the bottleneck of speed is not the encoder backbone MASTER, but the self-regressed decoder which max length could be up too 500 steps, do you have any idea of it?
In the self-regressed decoder, there are many repeated operations. In the master paper, we use a memory-cached mechanism to speedup the inference. The speedup is extremely effecient for the long length decoder. Please check the paper. Actually, o(n^2) complexity can speedup to o(n).
In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.
In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.
I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder? I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement? BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain. If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.
cc @zzhanq @JiaquanYe @delveintodetail
In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.
I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder? I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement? BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain. If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.
cc @zzhanq @JiaquanYe @delveintodetail
Hi, sonack. Code line 42-43 is unnecessary, I have fix this bug. In our experience, "memory-cached inference" may speed up about 40-50%, when maxlength is 100, in a normal size master decoder. And we haven't try "memory-cached inference" in lightweight decoder. A early stop mechanism is a useful speed-up method in sequence decoding.
In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.
I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder? I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement? BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain. If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.
cc @zzhanq @JiaquanYe @delveintodetail
I expect the speedup ratio should be much larger than you reported. It highly depends on your implementation. If you prefer, you can send me your code, I will check it for you. If you highly care about the speed, I would like to suggest you to decrease the resolution of the output feature map by CNN. If currently, the input image size is 400400, the output resolution by 5050C, you can further decrease it to 2525*C by convolution and pooling, it will largely speedup the inference but may decrease the performance by 1%.
In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.
I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder? I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement? BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain. If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application. cc @zzhanq @JiaquanYe @delveintodetail
I expect the speedup ratio should be much larger than you reported. It highly depends on your implementation. If you prefer, you can send me your code, I will check it for you. If you highly care about the speed, I would like to suggest you to decrease the resolution of the output feature map by CNN. If currently, the input image size is 400_400, the output resolution by 50_50_C, you can further decrease it to 25_25*C by convolution and pooling, it will largely speedup the inference but may decrease the performance by 1%.
Sorry I can't provide you the code directly, because I finish it in company. Do you have wechat? Maybe we can discuss it in depth more conveniently through it. Thank you very much.
Hi,@JiaquanYe @delveintodetail,我测试了一下这个大模型的推理速度: 在max_len=500,遇到EOS不提前终止的情况下:
- 没有去除Code line 42-43的冗余计算时,大概是11s/img;
- 去除冗余计算后,大概是6s/img;
- 再加上memory-cached inference的话,大概是4.5s/img。
请问一下在相同config模型下,你们内部实现的memory-cached inference的推理速度最快能到多少?我profile了一下目前的memory-cached inference的实现,主要瓶颈都在KQV的matrix计算上了,应该不是pytorch具体实现的问题。目前都是采用的原生Pytorch进行的计算,没有经过torchscript或者其他工程化操作。
Hi,@JiaquanYe @delveintodetail,我测试了一下这个大模型的推理速度: 在max_len=500,遇到EOS不提前终止的情况下:
- 没有去除Code line 42-43的冗余计算时,大概是11s/img;
- 去除冗余计算后,大概是6s/img;
- 再加上memory-cached inference的话,大概是4.5s/img。
请问一下在相同config模型下,你们内部实现的memory-cached inference的推理速度最快能到多少?我profile了一下目前的memory-cached inference的实现,主要瓶颈都在KQV的matrix计算上了,应该不是pytorch具体实现的问题。目前都是采用的原生Pytorch进行的计算,没有经过torchscript或者其他工程化操作。
In the decoder, the computation of K and V was conducted only one time (this is important for efficiency.), then it will be cached for further usage. You don't need to compute it in each time step. I believe this is the problem of your code. Tell me if you fix this issue. Thanks.
Hi,@JiaquanYe @delveintodetail,我测试了一下这个大模型的推理速度: 在max_len=500,遇到EOS不提前终止的情况下:
- 没有去除Code line 42-43的冗余计算时,大概是11s/img;
- 去除冗余计算后,大概是6s/img;
- 再加上memory-cached inference的话,大概是4.5s/img。
请问一下在相同config模型下,你们内部实现的memory-cached inference的推理速度最快能到多少?我profile了一下目前的memory-cached inference的实现,主要瓶颈都在KQV的matrix计算上了,应该不是pytorch具体实现的问题。目前都是采用的原生Pytorch进行的计算,没有经过torchscript或者其他工程化操作。
In the decoder, the computation of K and V was conducted only one time (this is important for efficiency.), then it will be cached for further usage. You don't need to compute it in each time step. I believe this is the problem of your code. Tell me if you fix this issue. Thanks.
不是这样的,encoder和decoder之间的cross attention的K、V确实是算一次就可以了,但是Q需要每步都算吧(因为Q是上一个时间步才预测出来的token,时间步T只算第T-1步,即上一步刚刚预测的token的多头线性变换,而不是原始的1~T步都重复计算),类似于下图中的红框部分:

我后面会尝试私下再重新实现一下,然后看看能不能提个PR,希望到时候可以一起看看~ @JiaquanYe @delveintodetail
In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.
I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder? I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement? BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain. If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application. cc @zzhanq @JiaquanYe @delveintodetail
I expect the speedup ratio should be much larger than you reported. It highly depends on your implementation. If you prefer, you can send me your code, I will check it for you. If you highly care about the speed, I would like to suggest you to decrease the resolution of the output feature map by CNN. If currently, the input image size is 400_400, the output resolution by 50_50_C, you can further decrease it to 25_25*C by convolution and pooling, it will largely speedup the inference but may decrease the performance by 1%.
Sorry I can't provide you the code directly, because I finish it in company. Do you have wechat? Maybe we can discuss it in depth more conveniently through it. Thank you very much.
Line 11, q: 1d, M^k_b: T-1d, ops: T-1d Line 12, ops: T-1d line 13, ops:T-1d + T-1d line 14, ops: dd + T-1d + T-1*d
I roughly list the computational operations of each line above. Please check if it is right for your implementation.
I implement Decoder part of MASTER-pytorch in FasterTransformer, compare to origin pytorch it's much faster. Example code。It should be easy to add to TableMASTER with the minor changes
I implement Decoder part of MASTER-pytorch in FasterTransformer, compare to origin pytorch it's much faster. Example code。It should be easy to add to TableMASTER with the minor changes
It's a great job! I will try it in TableMASTER.
mark
请问用 memory-cached inference 和EOS 以加快推理速度具体怎么实现呢
本质上memory-cache就是把过去的k-v cache给后面的q计算的时候用, huggingface我看到最近也支持了这个, 你可以看看。当时在master论文里面我想到这个时候觉得是memory-cache,我们实际做的和现在的k-v cache是一样的事情。 你可以参见: https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.use_cache