Cy
Results
2
comments of
Cy
This would be very useful :heart: Hope it can be merged soon!
Currently, I'm using this script that I wrote: ```Python import datasets from datasets import Dataset from transformers import AutoTokenizer num_sequence_wanted = 20000 max_seq_len = 4096 portions = { "dclm": 0.472,...