mPLUG-DocOwl
mPLUG-DocOwl copied to clipboard
Unique Aspects of Vision Token Merger in TinyChart
Nice work! The Vision Token Merger method mentioned in TinyChart has been similarly explored in works like TextHawk and DocKylin. However, in those works, the merger is typically performed after the ViT (Vision Transformer). This paper introduces the idea of performing the vision token merger inside the ViT. I’m curious: does this approach offer any special advantages in terms of performance or interpretability?