WenJett issues

Repositories
Issues
Comments

Results 2 issues of


                                            WenJett

Generation of own dataset with Dolma Tokenizer CLI

Hi, Appreciate your work done so far. With the new release of OLMo 2, the tokenizer used seems to be **allenai_domla2.json** but in **prepare_memmap_dataset.py**, the tokenizer is **allenai/eleuther-ai-gpt-neox-20b-pii-special**. Understand that...

Tokenizer to be used for generation of data to .npy files

### ❓ The question Hi, I was unable to reopen the previous issue: https://github.com/allenai/OLMo/issues/790. Hence, creating another open issue and copying my response below. Hi Aman, Thanks for the guidance,...

type/question