llama-2-jax icon indicating copy to clipboard operation
llama-2-jax copied to clipboard

Training From Scrach -- Redpajama

Open opooladz opened this issue 11 months ago • 0 comments

I guess more of a general question.

I want to train a 7B LLaMA 2 on some data from scratch. What datasets would you recommend? Best way to tokenize etc. Would it be possible to add support for Redpajama or Open Web Text for example? I have access to v4-32 so I want to try some from scratch work but not sure where to begin. Any guidance would be appreciated.

opooladz avatar Mar 12 '24 05:03 opooladz