aitextgen
aitextgen copied to clipboard
merge_datasets() doesn't work post-numpy migration
From #6:
/usr/local/lib/python3.6/dist-packages/aitextgen/TokenDataset.py in init(self, file_path, vocab_file, merges_file, texts, line_by_line, from_cache, header, save_cache, cache_destination, compress, block_size, tokenized_texts, text_delim, bos_token, eos_token, unk_token, pad_token, progress_bar_refresh_rate, **kwargs)
75 if tokenized_texts:
76 self.tokens = tokenized_texts
---> 77 self.num_subsets = self.tokens.shape[0] - block_size
78 self.block_size = block_size
79 self.file_path = "merged TokenDataset"
AttributeError: 'list' object has no attribute 'shape'
is there an easy fix for this? I'm trying to train aitextgen on a big spanish corpus (20 GB) and I'm checking if it's possible to do so by preparing multiple small TokenDatasets and then merging them (To avoid OOM issues).
changing self.tokens = tokenized_texts to self.tokens = np.asarray(tokenized_texts) seems to allow token merging I'd submit a PR but I'm not very confident that such a simple thing is the correct fix