parseq
parseq copied to clipboard
Could you please provide a detailed explanation of PARseq architecture
As I have reviewed your documentation, I am unable to understand how the prediction is made. Could you explain the entire procedure step by step?
Kind refer to our ECCV presentation: https://drive.google.com/file/d/11VoZW4QC5tbMwVIjKB44447uTiuCJAAD/view
It contains a brief step-by-step walkthrough of the decoding process.
Thank you, I understand that you are using ensembles of autoregressive models (AR) in order to detect real-time scene text using a video. However, I am confused about how the image will be encoded and decoded and how it will be trained?
In order to understand that process, how does the input image get split up into features, and how does the process flow proceed?
I want to know process flow of attached image
Thank you for your presentation video. Now I have a better understanding of the architecture.
Let me explain, correct me if I am mistaken,
By using encoders and decoders, we will decode the text from the image. By using input text, the output text will be compared to the input text, and the model will be trained.
The output will be generated if I provide new text. Am I correct?
There is a run-time error in your hugging face space