Taqyim
Taqyim copied to clipboard
Python intefrace for evaluation on chatgpt models
trafficstars
Taqyim تقييم
A library for evaluting Arabic NLP datasets on chatgpt models.
Installation
pip install -e .
Example
import taqyim as tq
pipeline = tq.Pipeline(
eval_name="ajgt-test",
dataset_name="arbml/ajgt_ubc_split",
task_class="classification",
task_description= "Sentiment Analysis",
input_column_name="content",
target_column_name="label",
prompt="Predict the sentiment",
api_key="<openai-key>",
train_split="train",
test_split="test",
model_name="gpt-3.5-turbo-0301",
max_samples=1,)
# run the evaluation
pipeline.run()
# show the output data frame
pipeline.show_results()
# show the eval metrics
pipeline.get_final_report()
Run on custom dataset
custom_dataset.ipynb has a complete example on how to run evaluation on a custom dataset.
parameters
eval_namechoose an eval nametask_classclass name from supported class namestask_descriptionshort description about the taskdataset_namedataset name for evaluationsubsetIf the dataset has subsettrain_splittrain split name in the datasettest_splittest split name in the datasetinput_column_nameinput column name in the datasettarget_column_nametarget column name in the datasetpromptthe prompt to be fed to the modeltask_descriptionshort string explaining the taskapi_keyapi key from keyspreprocessing_fnfunction used to process inputs and targetsthreadsnumber of threads used to fetch the apithreads_timeoutthread timeoutmax_samplesmax samples used for evaluation from the datasetmodel_namechoose eithergpt-3.5-turbo-0301orgpt-4-0314temperaturetemperature passed to the model between 0 and 2, higher temperature means more random resultsnum_few_shotnumber of fewshot samples to be used for evaluationresume_from_recordifTrueit will continue the run from the sample that has no results.seedseed to redproduce the results
Supported Classes and Tasks
Classificationclassification tasks see classification.py.Pos_Taggingpart of speech tagging tasks pos_tagging.py.Translationmachine translation translation.py.Summarizationmachine translation summarization.py.MCQmultiple choice question answering mcq.py.Ratingrating multiple LLMs outputs rating.py.Diacritizationmachine translation diacritization.py.
Evaluation on Arabic Tasks
| Tasks | Dataset | Size | Metrics | GPT-3.5 | GPT-4 | SoTA |
|---|---|---|---|---|---|---|
| Summarization | EASC | 153 | RougeL | 23.5 | 18.25 | 13.3 |
| PoS Tagging | PADT | 680 | Accuracy | 75.91 | 86.29 | 96.83 |
| classification | AJGT | 360 | Accuracy | 86.94 | 90.30 | 96.11 |
| transliteration | BOLT Egyptian✢ | 6,653 | BLEU | 13.76 | 27.66 | 65.88 |
| translation | UN v1 | 4,000 | BLEU | 35.05 | 38.83 | 53.29 |
| Paraphrasing | APB | 1,010 | BLEU | 4.295 | 6.104 | 17.52 |
| Diacritization | WikiNews✢✢ | 393 | WER/DER | 32.74/10.29 | 38.06/11.64 | 4.49/1.21 |
✢ BOLT requires LDC subscription
✢✢ WikiNews not public, contact authors to access the dataset
@misc{alyafeai2023taqyim,
title={Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models},
author={Zaid Alyafeai and Maged S. Alshaibani and Badr AlKhamissi and Hamzah Luqman and Ebrahim Alareqi and Ali Fadel},
year={2023},
eprint={2306.16322},
archivePrefix={arXiv},
primaryClass={cs.CL}
}