aitextgen
aitextgen copied to clipboard
TokenDataset loading from cache
Hi! I have very simple question, i made encoded Dataset and upload it to Collab Notebook "Training from scratch". How can train tokenizer on it? If i just run "Training the Tokenizer" step with this file, i've get error "AssertionError: files must be a string or a list." But in Finetuning Collab it work with file and load it from cache.
This may be an issue with the new Tokenizers, although that should not be installed with the released aitextgen. Can you include the full stack trace?
Just edit Aitextgen Colab
- copy_file_from_gdrive("dataset_cache.tar.gz")
- file_name = TokenDataset("dataset_cache.tar.gz", from_cache=True)
- train_tokenizer(file_name)
