aitextgen
aitextgen copied to clipboard
TokenDataset loading from cache
Hi! I have very simple question, i made encoded Dataset and upload it to Collab Notebook "Training from scratch". How can train tokenizer on it? If i just run "Training the Tokenizer" step with this file, i've get error "AssertionError: files must be a string or a list." But in Finetuning Collab it work with file and load it from cache.
This may be an issue with the new Tokenizers
, although that should not be installed with the released aitextgen
. Can you include the full stack trace?
Just edit Aitextgen Colab
- copy_file_from_gdrive("dataset_cache.tar.gz")
- file_name = TokenDataset("dataset_cache.tar.gz", from_cache=True)
- train_tokenizer(file_name)