gritlm icon indicating copy to clipboard operation
gritlm copied to clipboard

Evaluate GritLM-7B on MTEB datasets

Open ThisisXXZ opened this issue 1 year ago • 9 comments

I am trying to evaluate GritLM-7B on MTEB datasets using the provided script.

#!/bin/bash

python /home/e/e1347696/unified_encoder_decoder/src/eval/MTEB/eval_mteb.py \
    --model_name_or_path /home/e/e1347696/unified_encoder_decoder/model/GritLM-7B \
    --output_folder /home/e/e1347696/unified_encoder_decoder/src/results/GritLM-7B-mteb \
    --task_types Classification,Clustering,PairClassification,Reranking,Retrieval,STS,Summarization \
    --batch_size 32

However, it seems that it has only been evaluated on the following datasets:

  • AmazonCounterFactualClassification
  • AmazonReviewsClassification
  • MassiveIntentClassification
  • MassiveScenarioClassification
  • MTOPDomainClassification
  • MTOPIntentClassification
  • STS17
  • STS22

Other datasets seem to be skipped. The output log is shown here:

Created GritLM: torch.bfloat16 dtype, mean pool, embedding mode, bbcc attn
GritLM-7B instruction for AmazonCounterfactualClassification:  <|user|>
Classify a given Amazon customer review text as either counterfactual or not-counterfactual
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - AmazonCounterfactualClassification, s2s, multilingual 1 / 4 Subsets


GritLM-7B instruction for AmazonReviewsClassification:  <|user|>
Classify the given Amazon review into its appropriate rating category
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - AmazonReviewsClassification, s2s, multilingual 1 / 6 Subsets


Skipping task: MasakhaNEWSClassification
GritLM-7B instruction for MassiveIntentClassification:  <|user|>
Given a user utterance as query, find the user intents
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MassiveIntentClassification, s2s, multilingual 1 / 51 Subsets


GritLM-7B instruction for MassiveScenarioClassification:  <|user|>
Given a user utterance as query, find the user scenarios
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MassiveScenarioClassification, s2s, multilingual 1 / 51 Subsets


GritLM-7B instruction for MTOPDomainClassification:  <|user|>
Classify the intent domain of the given utterance in task-oriented conversation
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MTOPDomainClassification, s2s, multilingual 1 / 6 Subsets


GritLM-7B instruction for MTOPIntentClassification:  <|user|>
Classify the intent of the given utterance in task-oriented conversation
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MTOPIntentClassification, s2s, multilingual 1 / 6 Subsets


Skipping task: MultiHateClassification
Skipping task: MultilingualSentimentClassification
Skipping task: NusaX-senti
Skipping task: SIB200Classification
Skipping task: SouthAfricanLangClassification
Skipping task: MasakhaNEWSClusteringP2P
Skipping task: MasakhaNEWSClusteringS2S
Skipping task: SIB200ClusteringS2S
Skipping task: BelebeleRetrieval
Skipping task: MIRACLRetrieval
Skipping task: MIRACLRetrievalHardNegatives
Skipping task: MLQARetrieval
Skipping task: MultiLongDocRetrieval
Skipping task: WikipediaRetrievalMultilingual
Skipping task: XMarket
Skipping task: XQuADRetrieval
Skipping task: OpusparcusPC
Skipping task: PawsXPairClassification
Skipping task: RTE3
Skipping task: XNLI
Skipping task: MIRACLReranking
Skipping task: WikipediaRerankingMultilingual
Skipping task: SemRel24STS
GritLM-7B instruction for STS17:  <|user|>
Retrieve semantically similar text.
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
STS
    - STS17, s2s, multilingual 1 / 11 Subsets


Skipping task: STS22.v2
GritLM-7B instruction for STS22:  <|user|>
Retrieve semantically similar text.
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
STS
    - STS22, p2p, multilingual 1 / 18 Subsets


Skipping task: STSBenchmarkMultilingualSTS

And the error log contains some warning such as:

The `batch_size` argument is deprecated and will be removed in the next release. Please use `encode_kwargs = {'batch_size': ...}` to set the batch size instead.
Failed to extract metadata from model: 'GritLM' object has no attribute 'model_card_data'. Upgrading to sentence-transformers v3.0.0 or above is recommended.
The `task_langs` argument is deprecated and will be removed in the next release. Please use `tasks = mteb.get_tasks(... languages = [...])` to filter tasks instead. Note that this uses 3 letter language codes (ISO 639-3).
Passing task names as strings is deprecated and will be removed in the next release. Please use `tasks = mteb.get_tasks(tasks=[...])` method to get tasks instead.
The `batch_size` argument is deprecated and will be removed in the next release. Please use `encode_kwargs = {'batch_size': ...}` to set the batch size instead.
Failed to extract metadata from model: 'GritLM' object has no attribute 'model_card_data'. Upgrading to sentence-transformers v3.0.0 or above is recommended.
Dataset 'STS22' is superseeded by 'STS22.v2', you might consider using the newer version of the dataset.

I will really appreciate it if you could help me with that! Thank you so much!

ThisisXXZ avatar Nov 04 '24 09:11 ThisisXXZ

This is on purpose & happens here https://github.com/ContextualAI/gritlm/blob/7c06435de9ccf69de73290a1b08b5bca641c7ff4/evaluation/eval_mteb.py#L1177 It only evaluates the 56 main MTEB EN datasets & skips others

The warnings are fine

Muennighoff avatar Nov 04 '24 18:11 Muennighoff

This is on purpose & happens here

https://github.com/ContextualAI/gritlm/blob/7c06435de9ccf69de73290a1b08b5bca641c7ff4/evaluation/eval_mteb.py#L1177

It only evaluates the 56 main MTEB EN datasets & skips others The warnings are fine

Thank you very much! I noticed that only 8 tasks were evaluated, with 6 of them being classification tasks and 2 being STS tasks. I'd like to evaluate GritLM-7B on all the tasks mentioned in the paper and compare the results. Could you please guide me on how to proceed with that? Here are the results: image I want to compare them with paper, but I don't see any cluster, rerank or retrieval tasks. image Thank you so much! Sorry If I asked something dumb, I'm new to this field :cat:

ThisisXXZ avatar Nov 05 '24 02:11 ThisisXXZ

Oh sorry it seems like the latest version of MTEB had some changes which render the eval script in this repository outdated.

I just changed the requirements of the repo to install a different mteb version here: https://github.com/ContextualAI/gritlm/pull/58 - Can you try downgrading your mteb to the version in that PR (pip install mteb==1.4.0) & check that it works?

(if you want to use the latest mteb it should also work via sth like the below

# !pip install mteb gritlm
import mteb
model_name = "GritLM/GritLM-7B"
revision = "13f00a0e36500c80ce12870ea513846a066004af"
model = mteb.get_model(model_name, revision=revision)
benchmark = mteb.get_benchmark("MTEB(eng, classic)")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model)

)

Muennighoff avatar Nov 05 '24 03:11 Muennighoff

Oh sorry it seems like the latest version of MTEB had some changes which render the eval script in this repository outdated.

I just changed the requirements of the repo to install a different mteb version here: #58 - Can you try downgrading your mteb to the version in that PR (pip install mteb==1.4.0) & check that it works?

(if you want to use the latest mteb it should also work via sth like the below

# !pip install mteb gritlm
import mteb
model_name = "GritLM/GritLM-7B"
revision = "13f00a0e36500c80ce12870ea513846a066004af"
model = mteb.get_model(model_name, revision=revision)
benchmark = mteb.get_benchmark("MTEB(eng, classic)")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model)

)

It begins to evaluate on other datasets! thanks! Also, I'd like to know is it sufficient to evaluate MTEB on a single A100-80GB GPU?

ThisisXXZ avatar Nov 06 '24 03:11 ThisisXXZ

I think that is sufficient, it will just take a while (especially the retrieval datasets).

Muennighoff avatar Nov 06 '24 03:11 Muennighoff

I think that is sufficient, it will just take a while (especially the retrieval datasets).

Hi! The evaluation proceeds fine until the MindSmallReranking dataset. I'm using metb==1.4.0, datasets==3.0.2, and the complete error message is shown below:

Failed to load JSON from file 'gzip://train.jsonl::/home/e/e1347696/.cache/huggingface/hub/datasets--mteb--mind_small/snapshots/3bdac13927fdc888b903db93b2ffdbd90b295a69/train.jsonl.gz' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
Error while evaluating MindSmallReranking: An error occurred while generating the dataset
Traceback (most recent call last):
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 160, in _generate_tables
    df = pandas_read_json(f)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 38, in pandas_read_json
    return pd.read_json(path_or_buf, **kwargs)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 815, in read_json
    return json_reader.read()
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1025, in read
    obj = self._get_object_parser(self.data)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1187, in parse
    self._parse()
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1403, in _parse
    ujson_loads(json, precise_float=self.precise_float), dtype=None
ValueError: Unexpected character found when decoding 'true'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 1853, in _prepare_split_single
    for _, table in generator:
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 163, in _generate_tables
    raise e
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 137, in _generate_tables
    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/e/e1347696/unified_encoder_decoder/eval/MTEB/eval_mteb.py", line 1202, in <module>
    evaluation.run(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 336, in run
    raise e
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 302, in run
    task.load_data(eval_splits=task_eval_splits, **kwargs)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/mteb/abstasks/AbsTask.py", line 37, in load_data
    self.dataset = datasets.load_dataset(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/load.py", line 2154, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 999, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 1740, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 1896, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

ThisisXXZ avatar Nov 08 '24 05:11 ThisisXXZ

Looks like a corrupted download; You can try deleting /home/e/e1347696/.cache/huggingface/hub/datasets--mteb--mind_small & letting it redownload it or else directly download the files from https://huggingface.co/datasets/mteb/mind_small/tree/3bdac13927fdc888b903db93b2ffdbd90b295a69

Muennighoff avatar Nov 08 '24 05:11 Muennighoff

Looks like a corrupted download; You can try deleting /home/e/e1347696/.cache/huggingface/hub/datasets--mteb--mind_small & letting it redownload it or else directly download the files from https://huggingface.co/datasets/mteb/mind_small/tree/3bdac13927fdc888b903db93b2ffdbd90b295a69

I've tried to clean the cache but the error persists. I found a closed issue in MTEB repo and it shares the same problem. Do I need to downgrade datasets to evaluate this MindSmallReranking?

Thank you so much!

ThisisXXZ avatar Nov 08 '24 06:11 ThisisXXZ

Hm yeah maybe try dwongrading

Muennighoff avatar Nov 08 '24 07:11 Muennighoff