BERT-Keyword-Extractor icon indicating copy to clipboard operation
BERT-Keyword-Extractor copied to clipboard

Pre-trained model

Open kailashkarthik9 opened this issue 5 years ago • 5 comments

Do you have a pre-trained model that we can use for a downstream system? It would be awesome if you can provide us with that!

kailashkarthik9 avatar Feb 24 '20 16:02 kailashkarthik9

The model will be automatically downloaded throw the pytorch_pretrained_bert package.

vigosser avatar Mar 05 '20 20:03 vigosser

@vigosser - I am pretty new to this. my understanding is that the model should be taken from pytorch_pretrained_bert , but I am getting following error

python3 keyword-extractor.py --sentence "BERT is a great model."

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Traceback (most recent call last):
  File "keyword-extractor.py", line 40, in <module>
    keywordextract(args.sentence, args.path)
  File "keyword-extractor.py", line 28, in keywordextract
    model = torch.load(path)
  File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 584, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 234, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 215, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'model.pt'

It would help , if you can tell me what to do with this

Vineeth-fw avatar May 12 '20 13:05 Vineeth-fw

@vigosser - I am pretty new to this. my understanding is that the model should be taken from pytorch_pretrained_bert , but I am getting following error

python3 keyword-extractor.py --sentence "BERT is a great model."

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Traceback (most recent call last):
  File "keyword-extractor.py", line 40, in <module>
    keywordextract(args.sentence, args.path)
  File "keyword-extractor.py", line 28, in keywordextract
    model = torch.load(path)
  File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 584, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 234, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 215, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'model.pt'

It would help , if you can tell me what to do with this

hey, the model is pretrained remove the path, args and model itself in the program that will help, some thing like this,

from pytorch_pretrained_bert import BertTokenizer, BertConfig from pytorch_pretrained_bert import BertForTokenClassification, BertAdam import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') tag2idx = {'B': 0, 'I': 1, 'O': 2} tags_vals = ['B', 'I', 'O']

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) model = BertForTokenClassification.from_pretrained("bert-base-uncased", num_labels=len(tag2idx)) def keyword(sentence): text = sentence tkns = tokenizer.tokenize(text) indexed_tokens = tokenizer.convert_tokens_to_ids(tkns) segments_ids = [0] * len(tkns) tokens_tensor = torch.tensor([indexed_tokens]).to(device) segments_tensors = torch.tensor([segments_ids]).to(device) #model = torch.load(path) model.eval() prediction = [] logit = model(tokens_tensor, token_type_ids=None, attention_mask=segments_tensors) logit = logit.detach().cpu().numpy() prediction.extend([list(p) for p in np.argmax(logit, axis=2)]) for k, j in enumerate(prediction[0]): if j==1 or j==0: print(tokenizer.convert_ids_to_tokens(tokens_tensor[0].to('cpu').numpy())[k], j) keyword('The solution is based upon an abstract representation of the mobile object system.')

SaiSandeepKantareddy avatar May 24 '20 17:05 SaiSandeepKantareddy

But this would use the default bert-base-uncased, no?

What would be nice would be to also provide the model you fine-tuned yourself on the SemEval 2010 dataset!

MasterScrat avatar Aug 15 '20 15:08 MasterScrat

But this would use the default bert-base-uncased, no?

What would be nice would be to also provide the model you fine-tuned yourself on the SemEval 2010 dataset!

Don't worry too much about this pretrained model. The dataset that trained this model was made 10 years ago, and if you take at the original source (or the folder in this repo), you will find that the size of the dataset is quite small ~5M, and the quality is not very good, i.e., no continuous sentences, quite a bit garbage like parenthetical citations not cleared, etc.

you could use some more up-to-date data if you need it. I'm sure that in 2020 there is a very large and relatively clean dataset for you to use

pandalalalala avatar Nov 21 '20 03:11 pandalalalala