aitextgen merge_datasets() doesn't work post-numpy migration

From #6:

/usr/local/lib/python3.6/dist-packages/aitextgen/TokenDataset.py in init(self, file_path, vocab_file, merges_file, texts, line_by_line, from_cache, header, save_cache, cache_destination, compress, block_size, tokenized_texts, text_delim, bos_token, eos_token, unk_token, pad_token, progress_bar_refresh_rate, **kwargs)
75 if tokenized_texts:
76 self.tokens = tokenized_texts
---> 77 self.num_subsets = self.tokens.shape[0] - block_size
78 self.block_size = block_size
79 self.file_path = "merged TokenDataset"

AttributeError: 'list' object has no attribute 'shape'

Jul 14 '20 01:07 minimaxir

is there an easy fix for this? I'm trying to train aitextgen on a big spanish corpus (20 GB) and I'm checking if it's possible to do so by preparing multiple small TokenDatasets and then merging them (To avoid OOM issues).

Sep 07 '20 00:09 mathigatti

changing self.tokens = tokenized_texts to self.tokens = np.asarray(tokenized_texts) seems to allow token merging I'd submit a PR but I'm not very confident that such a simple thing is the correct fix

Oct 20 '20 02:10 BPCZ