unigram-tokenization topic

List unigram-tokenization repositories

count-tokens-hf-datasets

22
Stars
1
Forks
Watchers

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.