TypeEvalPy
TypeEvalPy copied to clipboard
A Micro-benchmarking Framework for Python Type Inference Tools
A Micro-benchmarking Framework for Python Type Inference Tools
📌 Features:
- 📜 Contains 154 code snippets to test and benchmark.
- 🏷 Offers 845 type annotations across a diverse set of Python functionalities.
- 📂 Organized into 18 distinct categories targeting various Python features.
- 🚢 Seamlessly manages the execution of containerized tools.
- 🔄 Efficiently transforms inferred types into a standardized format.
- 📊 Automatically produces meaningful metrics for in-depth assessment and comparison.
🛠️ Supported Tools
Supported :white_check_mark: | In-progress :wrench: | Planned :bulb: |
---|---|---|
HeaderGen | Intellij PSI | MonkeyType |
Jedi | Pyre | Pyannotate |
Pyright | PySonar2 | |
HiTyper | Pytype | |
Scalpel | TypeT5 | |
Type4Py | ||
GPT-4 | ||
Ollama |
🏆 TypeEvalPy Leaderboard
Below is a comparison showcasing exact matches across different tools, coupled with top_n
predictions for ML-based tools.
Rank | 🛠️ Tool | Top-n | Function Return Type | Function Parameter Type | Local Variable Type | Total |
---|---|---|---|---|---|---|
1 | HeaderGen | 1 | 186 | 56 | 322 | 564 |
2 | Jedi | 1 | 122 | 0 | 293 | 415 |
3 | Pyright | 1 | 100 | 8 | 297 | 405 |
4 | HiTyper | 1 3 5 |
163 173 175 |
27 37 37 |
179 225 229 |
369 435 441 |
5 | HiTyper (static) | 1 | 141 | 7 | 102 | 250 |
6 | Scalpel | 1 | 155 | 32 | 6 | 193 |
7 | Type4Py | 1 3 5 |
39 103 109 |
19 31 31 |
99 167 174 |
157 301 314 |
(Auto-generated based on the the analysis run on 20 Oct 2023)
🏆🤖 TypeEvalPy LLM Leaderboard
Below is a comparison showcasing exact matches for LLMs.
Rank | 🛠️ Tool | Function Return Type | Function Parameter Type | Local Variable Type | Total |
---|---|---|---|---|---|
1 | GPT-4 | 225 | 85 | 465 | 775 |
2 | Finetuned:GPT 3.5 | 209 | 85 | 436 | 730 |
3 | codellama:13b-instruct | 199 | 75 | 425 | 699 |
4 | GPT 3.5 Turbo | 188 | 73 | 429 | 690 |
5 | codellama:34b-instruct | 190 | 52 | 425 | 667 |
6 | phind-codellama:34b-v2 | 182 | 60 | 399 | 641 |
7 | codellama:7b-instruct | 171 | 72 | 384 | 627 |
8 | dolphin-mistral | 184 | 76 | 356 | 616 |
9 | codebooga | 186 | 56 | 354 | 596 |
10 | llama2:70b | 168 | 55 | 342 | 565 |
11 | HeaderGen | 186 | 56 | 321 | 563 |
12 | wizardcoder:13b-python | 170 | 74 | 317 | 561 |
13 | llama2:13b | 153 | 40 | 283 | 476 |
14 | mistral:instruct | 155 | 45 | 250 | 450 |
15 | mistral:v0.2 | 155 | 45 | 248 | 448 |
16 | vicuna:13b | 153 | 35 | 260 | 448 |
17 | vicuna:33b | 133 | 29 | 267 | 429 |
18 | Jedi | 122 | 0 | 293 | 415 |
19 | Pyright | 100 | 8 | 297 | 405 |
19 | wizardcoder:7b-python | 103 | 48 | 254 | 405 |
20 | llama2:7b | 140 | 34 | 216 | 390 |
21 | HiTyper | 163 | 27 | 179 | 369 |
22 | wizardcoder:34b-python | 140 | 43 | 178 | 361 |
23 | orca2:7b | 117 | 27 | 184 | 328 |
24 | vicuna:7b | 131 | 17 | 172 | 320 |
25 | orca2:13b | 113 | 19 | 166 | 298 |
26 | Scalpel | 155 | 32 | 6 | 193 |
27 | Type4Py | 39 | 19 | 99 | 157 |
28 | tinyllama | 3 | 0 | 23 | 26 |
29 | phind-codellama:34b-python | 5 | 0 | 15 | 20 |
30 | codellama:13b-python | 0 | 0 | 0 | 0 |
31 | codellama:34b-python | 0 | 0 | 0 | 0 |
32 | codellama:7b-python | 0 | 0 | 0 | 0 |
(Auto-generated based on the the analysis run on 14 Jan 2024)
:whale: Running with Docker
1️⃣ Clone the repo
git clone https://github.com/secure-software-engineering/TypeEvalPy.git
2️⃣ Build Docker image
docker build -t typeevalpy .
3️⃣ Run TypeEvalPy
🕒 Takes about 30mins on first run to build Docker containers.
📂 Results will be generated in the results
folder within the root directory of the repository.
Each results folder will have a timestamp, allowing you to easily track and compare different runs.
Correlation of CSV Files Generated to Tables in ICSE Paper
Here is how the auto-generated CSV tables relate to the paper's tables:-
Table 1 in the paper is derived from three auto-generated CSV tables:
-
paper_table_1.csv
- details Exact matches by type category. -
paper_table_2.csv
- lists Exact matches for 18 micro-benchmark categories. -
paper_table_3.csv
- provides Sound and Complete values for tools.
-
-
Table 2 in the paper is based on the following CSV table:
-
paper_table_5.csv
- shows Exact matches with top_n values for machine learning tools.
-
Additionally, there are CSV tables that are not included in the paper:
-
paper_table_4.csv
- containing Sound and Complete values for 18 micro-benchmark categories. -
paper_table_6.csv
- featuring Sensitivity analysis.
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./results:/app/results \
typeevalpy
🔧 Optionally, run analysis on specific tools:
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./results:/app/results \
typeevalpy --runners headergen scalpel
🛠️ Available options: headergen
, pyright
, scalpel
, jedi
, hityper
, type4py
, hityperdl
🤖 Running TypeEvalPy with LLMs
TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:
- Create Configuration File: Copy the
config_template.yaml
from the src directory and rename it toconfig.yaml
.
In the config.yaml
, configure in the following:
-
openai_key
: your key for accessing OpenAI's models. -
ollama_url
: the URL for your Ollama instance. For simplicity, we recommend deploying Ollama using their Docker container. Get started with Ollama here. -
prompt_id
: set this toquestions_based_2
for optimal performance, based on our tests. -
ollama_models
: select a list of model tags from the Ollama library. For better operation, ensure the model is pre-downloaded with theollama pull
command.
With the config.yaml
configured, run the following command:
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./results:/app/results \
typeevalpy --runners ollama
Running From Source...
1. 📥 Installation
-
Clone the repo
git clone https://github.com/secure-software-engineering/TypeEvalPy.git
-
Install Dependencies and Set Up Virtual Environment
Run the following commands to set up your virtual environment and activate the virtual environment.
python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
2. 🚀 Usage: Running the Analysis
-
Navigate to the
src
Directorycd src
-
Execute the Analyzer
Run the following command to start the benchmarking process on all tools:
python main_runner.py
or
Run analysis on specific tools
python main_runner.py --runners headergen scalpel
🤝 Contributing
Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.
To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md
⭐️ Show Your Support
Give a ⭐️ if this project helped you!