fast-bert icon indicating copy to clipboard operation
fast-bert copied to clipboard

what's the meaning of the hyperparameter "text_list=texts" ?

Open JiangYanting opened this issue 5 years ago • 5 comments

hello! I'm wondering if I could ask you what the hyperparameter "text_list=texts" means? For example, in the section "3. Create DataBunch object" of "Language Model Fine-tuning" in the README.md , there is a code: databunch_lm = BertLMDataBunch.from_raw_corpus( data_dir=DATA_PATH, **text_list=texts,** tokenizer=args.model_name, batch_size_per_gpu=args.train_batch_size, max_seq_length=args.max_seq_length, multi_gpu=args.multi_gpu, model_type=args.model_type, logger=logger)

So, Is the parameter "texts" a list of words sequeces which contains all the words in one's own corpus prepared for LM Mask? Maybe the length of this list is too long ? Thank you for your help~

JiangYanting avatar Jan 08 '20 15:01 JiangYanting

I am having a similar problem here. In my interpretation text_list is a list where each entry corresponds to one of the sentences I am trying to classify.

The two files lm_train.txt and lm_val.txt are created as expected but then it either:

  • Takes a really long time to complete / looks like it stalls (hours)
  • Immediately completes the task (less then a minute)
  • Crashes by saying that num_samples should be a positive integeral value, but got num_samples=0 (trace back goes to RandomSampler(train_dataset) in data_lm.py

But, perhaps, I am misinterpreting what text_list actualy is ...

Q-lds avatar Feb 27 '20 16:02 Q-lds

Ditto. I assuemd it was meant to be a list of the texts (loaded from the files into memory), but it fails with num_samples should be a positive integeral value, but got num_samples=0. Setting it to be the training file name, also fails with the above error

ddofer avatar Mar 07 '20 11:03 ddofer

text_list is a list of the texts (List[str]), where each entry in the list is one of the texts as a string. if you have too few samples (not much text), you will get the num_samples should be a positive integeral value, but got num_samples=0 error.

jkhalsa-arabesque avatar Mar 30 '20 09:03 jkhalsa-arabesque

If you have a text file with one text per line, a quick way to create the text_list object to be loaded into the function:

from numpy import loadtxt
texts = loadtxt("/content/bert.txt", dtype=str, delimiter="\n", unpack=False)

# Validate that it loaded properly
print(text[0])
print(len(texts))

Then you can pass texts onto BertLMDataBunch.

trisongz avatar Apr 03 '20 00:04 trisongz

If you have a text file with one text per line, a quick way to create the text_list object to be loaded into the function:

from numpy import loadtxt
texts = loadtxt("/content/bert.txt", dtype=str, delimiter="\n", unpack=False)

# Validate that it loaded properly
print(text[0])
print(len(texts))

Then you can pass texts onto BertLMDataBunch.

@trisongz what if my file is too large (700mb) I'll get OOM when trying to run your code, is there any other way to do this?

krannnn avatar Jun 01 '20 11:06 krannnn