aitextgen icon indicating copy to clipboard operation
aitextgen copied to clipboard

TokenDataset loading from cache

Open olegchomp opened this issue 4 years ago • 2 comments

Hi! I have very simple question, i made encoded Dataset and upload it to Collab Notebook "Training from scratch". How can train tokenizer on it? If i just run "Training the Tokenizer" step with this file, i've get error "AssertionError: files must be a string or a list." But in Finetuning Collab it work with file and load it from cache.

olegchomp avatar Jul 12 '20 23:07 olegchomp

This may be an issue with the new Tokenizers, although that should not be installed with the released aitextgen. Can you include the full stack trace?

minimaxir avatar Jul 14 '20 02:07 minimaxir

Just edit Aitextgen Colab

  1. copy_file_from_gdrive("dataset_cache.tar.gz")
  2. file_name = TokenDataset("dataset_cache.tar.gz", from_cache=True)
  3. train_tokenizer(file_name) image

olegchomp avatar Jul 14 '20 10:07 olegchomp