aitextgen icon indicating copy to clipboard operation
aitextgen copied to clipboard

merge_datasets() doesn't work post-numpy migration

Open minimaxir opened this issue 5 years ago • 2 comments

From #6:

/usr/local/lib/python3.6/dist-packages/aitextgen/TokenDataset.py in init(self, file_path, vocab_file, merges_file, texts, line_by_line, from_cache, header, save_cache, cache_destination, compress, block_size, tokenized_texts, text_delim, bos_token, eos_token, unk_token, pad_token, progress_bar_refresh_rate, **kwargs)
75 if tokenized_texts:
76 self.tokens = tokenized_texts
---> 77 self.num_subsets = self.tokens.shape[0] - block_size
78 self.block_size = block_size
79 self.file_path = "merged TokenDataset"

AttributeError: 'list' object has no attribute 'shape'

minimaxir avatar Jul 14 '20 01:07 minimaxir

is there an easy fix for this? I'm trying to train aitextgen on a big spanish corpus (20 GB) and I'm checking if it's possible to do so by preparing multiple small TokenDatasets and then merging them (To avoid OOM issues).

mathigatti avatar Sep 07 '20 00:09 mathigatti

changing self.tokens = tokenized_texts to self.tokens = np.asarray(tokenized_texts) seems to allow token merging I'd submit a PR but I'm not very confident that such a simple thing is the correct fix

BPCZ avatar Oct 20 '20 02:10 BPCZ