aitextgen TokenDataset loading from cache

TokenDataset loading from cache

Open olegchomp opened this issue 5 years ago • 2 comments

Hi! I have very simple question, i made encoded Dataset and upload it to Collab Notebook "Training from scratch". How can train tokenizer on it? If i just run "Training the Tokenizer" step with this file, i've get error "AssertionError: files must be a string or a list." But in Finetuning Collab it work with file and load it from cache.

Jul 12 '20 23:07 olegchomp

This may be an issue with the new Tokenizers, although that should not be installed with the released aitextgen. Can you include the full stack trace?

Jul 14 '20 02:07 minimaxir

Just edit Aitextgen Colab

copy_file_from_gdrive("dataset_cache.tar.gz")
file_name = TokenDataset("dataset_cache.tar.gz", from_cache=True)
train_tokenizer(file_name)

Jul 14 '20 10:07 olegchomp

aitextgen aitextgen copied to clipboard

TokenDataset loading from cache

aitextgen
aitextgen copied to clipboard