pixel
pixel copied to clipboard
Decoder only text reconstruction
In your work, it seems like the decoder is only used to pre-train the model, and then discarded for downstream tasks, and only the encoder is being used.
Is there a way for the decoder to predict text, and get back to the original text unicode representation? Is an OCR model (that can be trained in conjunction perhaps) required?
Currently, it is not possible to go from reconstructed image patches to unicode text, so this is an open problem (we mentioned this briefly in our paper's limitations section). An OCR decoder could work for sure, and we're planning to investigate possible solutions for this in the future