fast-bert
fast-bert copied to clipboard
what's the meaning of the hyperparameter "text_list=texts" ?
hello! I'm wondering if I could ask you what the hyperparameter "text_list=texts" means? For example, in the section "3. Create DataBunch object" of "Language Model Fine-tuning" in the README.md , there is a code:
databunch_lm = BertLMDataBunch.from_raw_corpus( data_dir=DATA_PATH, **text_list=texts,** tokenizer=args.model_name, batch_size_per_gpu=args.train_batch_size, max_seq_length=args.max_seq_length, multi_gpu=args.multi_gpu, model_type=args.model_type, logger=logger)
So, Is the parameter "texts" a list of words sequeces which contains all the words in one's own corpus prepared for LM Mask? Maybe the length of this list is too long ? Thank you for your help~
I am having a similar problem here. In my interpretation text_list
is a list where each entry corresponds to one of the sentences I am trying to classify.
The two files lm_train.txt
and lm_val.txt
are created as expected but then it either:
- Takes a really long time to complete / looks like it stalls (hours)
- Immediately completes the task (less then a minute)
- Crashes by saying that
num_samples should be a positive integeral value, but got num_samples=0
(trace back goes to RandomSampler(train_dataset) in data_lm.py
But, perhaps, I am misinterpreting what text_list
actualy is ...
Ditto. I assuemd it was meant to be a list of the texts (loaded from the files into memory), but it fails with num_samples should be a positive integeral value, but got num_samples=0
.
Setting it to be the training file name, also fails with the above error
text_list is a list of the texts (List[str]), where each entry in the list is one of the texts as a string.
if you have too few samples (not much text), you will get the num_samples should be a positive integeral value, but got num_samples=0
error.
If you have a text file with one text per line, a quick way to create the text_list object to be loaded into the function:
from numpy import loadtxt
texts = loadtxt("/content/bert.txt", dtype=str, delimiter="\n", unpack=False)
# Validate that it loaded properly
print(text[0])
print(len(texts))
Then you can pass texts onto BertLMDataBunch.
If you have a text file with one text per line, a quick way to create the text_list object to be loaded into the function:
from numpy import loadtxt texts = loadtxt("/content/bert.txt", dtype=str, delimiter="\n", unpack=False) # Validate that it loaded properly print(text[0]) print(len(texts))
Then you can pass texts onto BertLMDataBunch.
@trisongz what if my file is too large (700mb) I'll get OOM when trying to run your code, is there any other way to do this?