Cy

Results 2 comments of Cy

This would be very useful :heart: Hope it can be merged soon!

Currently, I'm using this script that I wrote: ```Python import datasets from datasets import Dataset from transformers import AutoTokenizer num_sequence_wanted = 20000 max_seq_len = 4096 portions = { "dclm": 0.472,...