torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

Add a ViT Encoder to TorchTitan

Open fduwjj opened this issue 5 months ago • 6 comments

This is first step to include more models into torchtitan to demonstrate composability of pretrain. Now with llama 3.2 coming and we already have it available in torch tune. We want to bring multi-mode model to torch titan as well.

After a deep discussion with the team, we believe the goal of torch titan is to demonstrate distributed training paradigms for different model architectures and each one is quite different. So it makes sense to take the HuggingFace approach when we have each model to own its own definition and code, no inheritance and no common modules will be used. So we create a new folder named "llama_multimodal" and this PR is to add a vision encoder first.

fduwjj avatar Sep 27 '24 00:09 fduwjj