Taqyim
                                
                                 Taqyim copied to clipboard
                                
                                    Taqyim copied to clipboard
                            
                            
                            
                        Python intefrace for evaluation on chatgpt models
Taqyim تقييم
    
A library for evaluting Arabic NLP datasets on chatgpt models.
Installation
pip install -e .
Example
import taqyim as tq
pipeline = tq.Pipeline(
    eval_name="ajgt-test",
    dataset_name="arbml/ajgt_ubc_split",
    task_class="classification",
    task_description= "Sentiment Analysis",
    input_column_name="content",
    target_column_name="label",
    prompt="Predict the sentiment",
    api_key="<openai-key>",
    train_split="train",
    test_split="test",
    model_name="gpt-3.5-turbo-0301",
    max_samples=1,)
# run the evaluation
pipeline.run()
# show the output data frame
pipeline.show_results()
# show the eval metrics
pipeline.get_final_report()
Run on custom dataset
custom_dataset.ipynb has a complete example on how to run evaluation on a custom dataset.
parameters
- eval_namechoose an eval name
- task_classclass name from supported class names
- task_descriptionshort description about the task
- dataset_namedataset name for evaluation
- subsetIf the dataset has subset
- train_splittrain split name in the dataset
- test_splittest split name in the dataset
- input_column_nameinput column name in the dataset
- target_column_nametarget column name in the dataset
- promptthe prompt to be fed to the model
- task_descriptionshort string explaining the task
- api_keyapi key from keys
- preprocessing_fnfunction used to process inputs and targets
- threadsnumber of threads used to fetch the api
- threads_timeoutthread timeout
- max_samplesmax samples used for evaluation from the dataset
- model_namechoose either- gpt-3.5-turbo-0301or- gpt-4-0314
- temperaturetemperature passed to the model between 0 and 2, higher temperature means more random results
- num_few_shotnumber of fewshot samples to be used for evaluation
- resume_from_recordif- Trueit will continue the run from the sample that has no results.
- seedseed to redproduce the results
Supported Classes and Tasks
- Classificationclassification tasks see classification.py.
- Pos_Taggingpart of speech tagging tasks pos_tagging.py.
- Translationmachine translation translation.py.
- Summarizationmachine translation summarization.py.
- MCQmultiple choice question answering mcq.py.
- Ratingrating multiple LLMs outputs rating.py.
- Diacritizationmachine translation diacritization.py.
Evaluation on Arabic Tasks
| Tasks | Dataset | Size | Metrics | GPT-3.5 | GPT-4 | SoTA | 
|---|---|---|---|---|---|---|
| Summarization | EASC | 153 | RougeL | 23.5 | 18.25 | 13.3 | 
| PoS Tagging | PADT | 680 | Accuracy | 75.91 | 86.29 | 96.83 | 
| classification | AJGT | 360 | Accuracy | 86.94 | 90.30 | 96.11 | 
| transliteration | BOLT Egyptian✢ | 6,653 | BLEU | 13.76 | 27.66 | 65.88 | 
| translation | UN v1 | 4,000 | BLEU | 35.05 | 38.83 | 53.29 | 
| Paraphrasing | APB | 1,010 | BLEU | 4.295 | 6.104 | 17.52 | 
| Diacritization | WikiNews✢✢ | 393 | WER/DER | 32.74/10.29 | 38.06/11.64 | 4.49/1.21 | 
✢ BOLT requires LDC subscription
✢✢ WikiNews not public, contact authors to access the dataset
@misc{alyafeai2023taqyim,
      title={Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models}, 
      author={Zaid Alyafeai and Maged S. Alshaibani and Badr AlKhamissi and Hamzah Luqman and Ebrahim Alareqi and Ali Fadel},
      year={2023},
      eprint={2306.16322},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}