instructor-embedding
instructor-embedding copied to clipboard
Evaluation settings of INSTRUCTOR
Hello! I have a very puzzling question that I would like to ask. Since your model is fine-tuned with instructions, why not use instructions during benchmark evaluations (e.g. MTEB)?
Hi, the instructions are included in the evaluation. You may refer to the table 1 in our paper
As far as I understand, when evaluating MTEB in your code, the following lines are used:
model = INSTRUCTOR(args.model_name, cache_folder=args.cache_dir)
evaluation = MTEB(tasks=[args.task_name], task_langs=["en"])
evaluation.run(model, output_folder=args.output_dir, eval_splits=[args.split], args=args, overwrite_results=True)
During the execution of evaluation.run()
, it utilizes INSTRUCTOR.encode()
to encode the input sentences. However, when I print the sentences passed to INSTRUCTOR.encode()
before tokenization, it appears that the corresponding task instructions are not added to these sentences.
I'm not sure if my understanding and evaluation method are correct. I would greatly appreciate it if you could provide me with answers. Thank you very much.
Hi, could you share the scripts you print out the sentences? Also, make sure you have correctly installed the InstructorEmbedding library.
I just use the source code on the Github:
https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L478-L565
def encode(self, sentences,
batch_size: int = 32,
show_progress_bar: bool = None,
output_value: str = 'sentence_embedding',
convert_to_numpy: bool = True,
convert_to_tensor: bool = False,
device: str = None,
normalize_embeddings: bool = False):
"""
Computes sentence embeddings
:param sentences: the sentences to embed
:param batch_size: the batch size used for the computation
:param show_progress_bar: Output a progress bar when encode sentences
:param output_value: Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values
:param convert_to_numpy: If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.
:param convert_to_tensor: If true, you get one large tensor as return. Overwrites any setting from convert_to_numpy
:param device: Which torch.device to use for the computation
:param normalize_embeddings: If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.
:return:
By default, a list of tensors is returned. If convert_to_tensor, a stacked tensor is returned. If convert_to_numpy, a numpy matrix is returned.
"""
self.eval()
if show_progress_bar is None:
show_progress_bar = False
if convert_to_tensor:
convert_to_numpy = False
if output_value != 'sentence_embedding':
convert_to_tensor = False
convert_to_numpy = False
input_was_string = False
if isinstance(sentences, str) or not hasattr(sentences, '__len__'): #Cast an individual sentence to a list with length 1
sentences = [sentences]
input_was_string = True
if device is None:
device = self._target_device
self.to(device)
all_embeddings = []
if isinstance(sentences[0],list):
lengths = []
for sen in sentences:
lengths.append(-self._text_length(sen[1]))
length_sorted_idx = np.argsort(lengths)
else:
length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])
sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
sentences_batch = sentences_sorted[start_index:start_index+batch_size]
features = self.tokenize(sentences_batch)
features = batch_to_device(features, device)
with torch.no_grad():
out_features = self.forward(features)
if output_value == 'token_embeddings':
embeddings = []
for token_emb, attention in zip(out_features[output_value], out_features['attention_mask']):
last_mask_id = len(attention)-1
while last_mask_id > 0 and attention[last_mask_id].item() == 0:
last_mask_id -= 1
embeddings.append(token_emb[0:last_mask_id+1])
elif output_value is None: #Return all outputs
embeddings = []
for sent_idx in range(len(out_features['sentence_embedding'])):
row = {name: out_features[name][sent_idx] for name in out_features}
embeddings.append(row)
else: #Sentence embeddings
embeddings = out_features[output_value]
embeddings = embeddings.detach()
if normalize_embeddings:
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
# fixes for #522 and #487 to avoid oom problems on gpu with large datasets
if convert_to_numpy:
embeddings = embeddings.cpu()
all_embeddings.extend(embeddings)
all_embeddings = [all_embeddings[idx] for idx in np.argsort(length_sorted_idx)]
I print sentences
in the encode()
at about Line 512
For the case of MTEB, make sure that the library is correctly installed by following https://github.com/HKUNLP/instructor-embedding#mteb.
Sorry, I have tried to install this previously but failed with the message here:
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/tmp/tmp73vjuhzp/output.json
So I can only pip install mteb
following the instruction at https://github.com/embeddings-benchmark/mteb
Maybe this is the reason for the problem?
Yes, we should install the customized mteb package for correct evaluation.
Thanks! I've corrected my evaluation method following your customized mteb package. The performance of replicating INSTRUCTOR have been improved but still lower than yours. Here I still have some detailed questions:
- Have you been consistently overlooking the token embeddings of the instructions during the training and evaluation processes?
- Have you been consistently using mean pooling as the pooling method during the training and evaluation processes?
Thank you again for your patient response.
Yes, we use the mean pooling in both the training and evaluation processes.
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/tmp/tmp73vjuhzp/output.json
I guess the reason for this issue might be due to the exit() statement present at https://github.com/HKUNLP/instructor-embedding/blob/main/evaluation/MTEB/setup.py#L42 @hongjin-su, could you kindly check this?
Hi, you may check the permission of /tmp or /tmp/tmp73vjuhzp.