BERT-Keyword-Extractor
BERT-Keyword-Extractor copied to clipboard
Pre-trained model
Do you have a pre-trained model that we can use for a downstream system? It would be awesome if you can provide us with that!
The model will be automatically downloaded throw the pytorch_pretrained_bert package.
@vigosser - I am pretty new to this. my understanding is that the model should be taken from pytorch_pretrained_bert , but I am getting following error
python3 keyword-extractor.py --sentence "BERT is a great model."
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Traceback (most recent call last):
File "keyword-extractor.py", line 40, in <module>
keywordextract(args.sentence, args.path)
File "keyword-extractor.py", line 28, in keywordextract
model = torch.load(path)
File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 584, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 234, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 215, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'model.pt'
It would help , if you can tell me what to do with this
@vigosser - I am pretty new to this. my understanding is that the model should be taken from pytorch_pretrained_bert , but I am getting following error
python3 keyword-extractor.py --sentence "BERT is a great model."
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex. Traceback (most recent call last): File "keyword-extractor.py", line 40, in <module> keywordextract(args.sentence, args.path) File "keyword-extractor.py", line 28, in keywordextract model = torch.load(path) File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 584, in load with _open_file_like(f, 'rb') as opened_file: File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 234, in _open_file_like return _open_file(name_or_buffer, mode) File "/home/jalal/.local/lib/python3.6/site-packages/torch/serialization.py", line 215, in __init__ super(_open_file, self).__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: 'model.pt'
It would help , if you can tell me what to do with this
hey, the model is pretrained remove the path, args and model itself in the program that will help, some thing like this,
from pytorch_pretrained_bert import BertTokenizer, BertConfig from pytorch_pretrained_bert import BertForTokenClassification, BertAdam import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') tag2idx = {'B': 0, 'I': 1, 'O': 2} tags_vals = ['B', 'I', 'O']
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) model = BertForTokenClassification.from_pretrained("bert-base-uncased", num_labels=len(tag2idx)) def keyword(sentence): text = sentence tkns = tokenizer.tokenize(text) indexed_tokens = tokenizer.convert_tokens_to_ids(tkns) segments_ids = [0] * len(tkns) tokens_tensor = torch.tensor([indexed_tokens]).to(device) segments_tensors = torch.tensor([segments_ids]).to(device) #model = torch.load(path) model.eval() prediction = [] logit = model(tokens_tensor, token_type_ids=None, attention_mask=segments_tensors) logit = logit.detach().cpu().numpy() prediction.extend([list(p) for p in np.argmax(logit, axis=2)]) for k, j in enumerate(prediction[0]): if j==1 or j==0: print(tokenizer.convert_ids_to_tokens(tokens_tensor[0].to('cpu').numpy())[k], j) keyword('The solution is based upon an abstract representation of the mobile object system.')
But this would use the default bert-base-uncased
, no?
What would be nice would be to also provide the model you fine-tuned yourself on the SemEval 2010 dataset!
But this would use the default
bert-base-uncased
, no?What would be nice would be to also provide the model you fine-tuned yourself on the SemEval 2010 dataset!
Don't worry too much about this pretrained model. The dataset that trained this model was made 10 years ago, and if you take at the original source (or the folder in this repo), you will find that the size of the dataset is quite small ~5M, and the quality is not very good, i.e., no continuous sentences, quite a bit garbage like parenthetical citations not cleared, etc.
you could use some more up-to-date data if you need it. I'm sure that in 2020 there is a very large and relatively clean dataset for you to use