llama-2-jax
llama-2-jax copied to clipboard
Training From Scrach -- Redpajama
I guess more of a general question.
I want to train a 7B LLaMA 2 on some data from scratch. What datasets would you recommend? Best way to tokenize etc. Would it be possible to add support for Redpajama or Open Web Text for example? I have access to v4-32 so I want to try some from scratch work but not sure where to begin. Any guidance would be appreciated.