fast-bert
fast-bert copied to clipboard
BertLMDataBunch.from_raw_corpus : `ValueError: num_samples should be a positive integer value, but got num_samples=0`
I am trying to fine tune a model, but I am encountering a ValueError when creating the dataBunch from the raw corpus.
With the following syntactic data :
text_list = ['Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in',
'reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
]
databunch_lm = BertLMDataBunch.from_raw_corpus(
data_dir=DATA_PATH,
text_list=text_list,
tokenizer='bert-base-uncased',
batch_size_per_gpu=16,
max_seq_length=128,
multi_gpu=True,
model_type='bert',
logger=logger)
I get the following ValueError
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<timed exec> in <module>
~/envs/my_env/lib/python3.7/site-packages/fast_bert/data_lm.py in from_raw_corpus(data_dir, text_list, tokenizer, batch_size_per_gpu, max_seq_length, multi_gpu, test_size, model_type, logger, clear_cache, no_cache)
198 logger=logger,
199 clear_cache=clear_cache,
--> 200 no_cache=no_cache,
201 )
202
~/envs/my_env/lib/python3.7/site-packages/fast_bert/data_lm.py in __init__(self, data_dir, tokenizer, train_file, val_file, batch_size_per_gpu, max_seq_length, multi_gpu, model_type, logger, clear_cache, no_cache)
275 self.train_batch_size = self.batch_size_per_gpu * max(1, self.n_gpu)
276
--> 277 train_sampler = RandomSampler(train_dataset)
278 self.train_dl = DataLoader(
279 train_dataset, sampler=train_sampler, batch_size=self.train_batch_size
~/envs/my_env/lib/python3.7/site-packages/torch/utils/data/sampler.py in __init__(self, data_source, replacement, num_samples)
92 if not isinstance(self.num_samples, int) or self.num_samples <= 0:
93 raise ValueError("num_samples should be a positive integer "
---> 94 "value, but got num_samples={}".format(self.num_samples))
95
96 @property
ValueError: num_samples should be a positive integer value, but got num_samples=0
The intermediate files lm_train.txt and lm_val.txt are created, so I suspect something is going wrong at the level of the tokenizer.
My env has python 3.7.6 and contains
pytorch-lamb 1.0.0 pypi_0 pypi
torch 1.4.0 pypi_0 pypi
torchvision 0.5.0 pypi_0 pypi
fast-bert 1.6.2 pypi_0 pypi
tokenizers 0.5.2 pypi_0 pypi
transformers 2.5.1 pypi_0 pypi
Anyway, let me know if you need any further information from my side!
I have the same error. Trying to run the LM example on a new dataset (Language model training from scratch), and I get the same error.
So, in my case the issue was that the text .... was too small.
Basically line 137 in fast_bert/data_lm.py
while len(tokenized_text) >= block_size: # Truncate in block of block_size
self.examples.append(
tokenizer.build_inputs_with_special_tokens(
tokenized_text[:block_size]
)
)
tokenized_text = tokenized_text[block_size:]
never execute if len(tokenized_text)
is smaller than the given block size.
Bear in mind the process may also take a really long time, since it runs on a single core. In my case it ended up being 14 hours :D
I used a large raw corpus and got the same error. I tested the same raw corpus using run_language_modeling.py
from the transformers library, and I got the same error. My solution was to set up the block-size equal to my maximum length for a sentence; in this case, I was using 128.:
!python run_language_modeling.py \ --train_data_file=/home/ubuntu/data/sedi1_full.txt \ --output_dir=./tmp/ \ --model_type=bert \ --model_name_or_path=/home/ubuntu/tmp/bertcase_torch/ \ --mlm \ --block_size=128\ --do_train \ --eval_all_checkpoints \ --save_steps=100000
I didn't find how to set-up the block_size on fast_bert language databunch. For now, I am going to use transformers solution.
I had the same problem, but i just changed f.write(text+'\n') to replace f.write(text) in the code data_lm.py, then it is ok.
I had the same problem, but i just changed f.write(text+'\n') to replace f.write(text) in the code data_lm.py, then it is ok.
@ninasujit2016 what if I change this and still get the same error?
Facing the same error. Any fixes yet?
Clear your cache ! This function silently uses cache if available, totally ignoring the data you pass as input. In my case, creating the whole dataset was too slow, so I tried to pass just a few lines of text, which created an empty dataset in my cache (because only a few lines of text is too small). Then, I got this error whatever data I used, until I cleared the cache.
I strongly recommend to activate the 'info' logging, as follow, so that you see whether the function uses cache or not.
logger.setLevel('INFO')
consoleHandler = logging.StreamHandler()
consoleHandler.setLevel(logging.INFO)
logger.addHandler(consoleHandler)
By the way, I consider that this is a bug. Calling BertLMDataBunch.from_raw_corpus
should never read from the cache.