Unique Aspects of Vision Token Merger in TinyChart

Open EchoDreamer opened this issue 1 year ago • 0 comments

Nice work! The Vision Token Merger method mentioned in TinyChart has been similarly explored in works like TextHawk and DocKylin. However, in those works, the merger is typically performed after the ViT (Vision Transformer). This paper introduces the idea of performing the vision token merger inside the ViT. I’m curious: does this approach offer any special advantages in terms of performance or interpretability?

Nov 25 '24 10:11 EchoDreamer