llm-structured-output-benchmarks
llm-structured-output-benchmarks copied to clipboard
Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on tasks like multi-label classification, named entity recognition, sy...
๐งฉ LLM Structured Output Benchmarks
Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.
๐ Benchmark Results [2024-08-25]
- Multi-label classification
Framework Model Reliability Latency p95 (s) Fructose gpt-4o-mini-2024-07-18 1.000 1.138 Modelsmith gpt-4o-mini-2024-07-18 1.000 1.184 OpenAI Structured Output gpt-4o-mini-2024-07-18 1.000 1.201 Instructor gpt-4o-mini-2024-07-18 1.000 1.206 Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 7.606* LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 3.649* Llamaindex gpt-4o-mini-2024-07-18 0.996 0.853 Marvin gpt-4o-mini-2024-07-18 0.988 1.338 Mirascope gpt-4o-mini-2024-07-18 0.985 1.531 - Named Entity Recognition
Framework Model Reliability Latency p95 (s) Precision Recall F1 Score OpenAI Structured Output gpt-4o-mini-2024-07-18 1.000 3.459 0.834 0.748 0.789 LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 6.573* 0.701 0.262 0.382 Instructor gpt-4o-mini-2024-07-18 0.998 2.438 0.776 0.768 0.772 Mirascope gpt-4o-mini-2024-07-18 0.989 3.879 0.768 0.738 0.752 Llamaindex gpt-4o-mini-2024-07-18 0.979 5.771 0.792 0.310 0.446 Marvin gpt-4o-mini-2024-07-18 0.979 3.270 0.822 0.776 0.798 - Synthetic Data Generation
Framework Model Reliability Latency p95 (s) Variety Instructor gpt-4o-mini-2024-07-18 1.000 1.923 0.750 Marvin gpt-4o-mini-2024-07-18 1.000 1.496 0.010 Llamaindex gpt-4o-mini-2024-07-18 1.000 1.003 0.020 Modelsmith gpt-4o-mini-2024-07-18 0.970 2.324 0.835 Mirascope gpt-4o-mini-2024-07-18 0.790 3.383 0.886 Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 0.350 3.577* 1.000 OpenAI Structured Output gpt-4o-mini-2024-07-18 0.650 1.431 0.877 LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 0.650 2.561* 0.662
* NVIDIA GeForce RTX 4080 Super GPU
๐ Run the benchmark
- Install the requirements using
pip install -r requirements.txt - Set the OpenAI api key:
export OPENAI_API_KEY=sk-... - Run the benchmark using
python -m main run-benchmark - Raw results are stored in the
resultsdirectory. - Generate the results using:
- Multilabel classification:
python -m main generate-results - NER:
python -m main generate-results --task ner - Synthetic data generation:
python -m main generate-results --task synthetic_data_generation
- Multilabel classification:
- To get help on the command line arguments, add
--helpafter the command. Eg.,python -m main run-benchmark --help
๐งช Benchmark methodology
- Multi-label classification:
- Task: Given a text, predict the labels associated with it.
- Data:
- Base data: Alexa intent detection dataset
- Benchmarking test is run using synthetic data generated by running:
python -m data_sources.generate_dataset generate-multilabel-data. - The synthetic data is generated by sampling and combining rows from the base data to achieve multiple classes per row according to some distribution for num classes per row. See
python -m data_sources.generate_dataset generate-multilabel-data --helpfor more details.
- Prompt:
"Classify the following text: {text}" - Evaluation Metrics:
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
percent_successfulvalues. - Latency: The 95th percentile of the time taken to run the framework on the data.
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
- Experiment Details: Run each row through the framework
n_runsnumber of times and log the percent of successful runs for each row.
- Named Entity Recognition
- Task: Given a text, extract the entities present in it.
- Data:
- Base data: Synthetic PII Finance dataset
- Benchmarking test is run using a sampled data generated by running:
python -m data_sources.generate_dataset generate-ner-data. - The data is sampled from the base data to achieve number of entities per row according to some distribution. See
python -m data_sources.generate_dataset generate-ner-data --helpfor more details.
- Prompt:
Extract and resolve a list of entities from the following text: {text} - Evaluation Metrics:
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
percent_successfulvalues. - Latency: The 95th percentile of the time taken to run the framework on the data.
- Precision: The micro average of the precision of the framework on the data.
- Recall: The micro average of the recall of the framework on the data.
- F1 Score: The micro average of the F1 score of the framework on the data.
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
- Experiment Details: Run each row through the framework
n_runsnumber of times and log the percent of successful runs for each row.
- Synthetic Data Generation
- Task: Generate synthetic data similar according to a Pydantic data model schema.
- Data:
- Two level nested User details Pydantic schema.
- Prompt:
Generate a random person's information. The name must be chosen at random. Make it something you wouldn't normally choose. - Evaluation Metrics:
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
percent_successfulvalues. - Latency: The 95th percentile of the time taken to run the framework on the data.
- Variety: The percent of names that are unique compared to all names generated.
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
- Experiment Details: Run each row through the framework
n_runsnumber of times and log the percent of successful runs.
๐ Adding new data
- Create a new pandas dataframe pickle file with the following columns:
text: The text to be sent to the frameworklabels: List of labels associated with the text- See
data/multilabel_classification.pklfor an example.
- Add the path to the new pickle file in the
./config.yamlfile under thesource_data_pickle_pathkey for all the frameworks you want to test. - Run the benchmark using
python -m main run-benchmarkto test the new data on all the frameworks! - Generate the results using
python -m main generate-results
๐๏ธ Adding a new framework
The easiest way to create a new framework is to reference the ./frameworks/instructor_framework.py file. Detailed steps are as follows:
- Create a .py file in frameworks directory with the name of the framework. Eg.,
instructor_framework.pyfor the instructor framework. - In this .py file create a class that inherits
BaseFrameworkfromframeworks.base. - The class should define an
initmethod that initializes the base class. Here are the arguments the base class expects:task(str): the task that the framework is being tested on. Obtained from./config.yamlfile. Allowed values are"multilabel_classification"and"ner"prompt(str): Prompt template used. Obtained from theinit_kwargsin the./config.yamlfile.llm_model(str): LLM model to be used. Obtained from theinit_kwargsin the./config.yamlfile.llm_model_family(str): LLM model family to be used. Current supported values as"openai"and"transformers". Obtained from theinit_kwargsin the./config.yamlfile.retries(int): Number of retries for the framework. Default is $0$. Obtained from theinit_kwargsin the./config.yamlfile.source_data_picke_path(str): Path to the source data pickle file. Obtained from theinit_kwargsin the./config.yamlfile.sample_rows(int): Number of rows to sample from the source data. Useful for testing on a smaller subset of data. Default is $0$ which uses all rows in source_data_pickle_path for the benchmarking. Obtained from theinit_kwargsin the./config.yamlfile.response_model(Any): The response model to be used. Internally passed by the benchmarking script.
- The class should define a
runmethod that takes three arguments:task: The task that the framework is being tested on. Obtained from thetaskin the./config.yamlfile. Eg.,"multilabel_classification"n_runs: number of times to repeat each textexpected_response: Output expected from the framework. Use default value ofNoneinputs: a dictionary of{"text": str}wherestris the text to be sent to the framework. Use default value of empty dictionary{}
- This
runmethod should create anotherrun_experimentfunction that takesinputsas argument, runs that input through the framework and returns the output. - The
run_experimentfunction should be annotated with the@experimentdecorator fromframeworks.basewithn_runs,expected_resposneandtaskas arguments. - The
runmethod should call therun_experimentfunction and return the four outputspredictions,percent_successful,metricsandlatencies. - Import this new class in
frameworks/__init__.py. - Add a new entry in the
./config.yamlfile with the name of the class as the key. The yaml entry can have the following fieldstask: the task that the framework is being tested on. Obtained from./config.yamlfile. Allowed values are"multilabel_classification"and"ner"n_runs: number of times to repeat each textinit_kwargs: all the arguments that need to be passed to theinitmethod of the class, including those mentioned in step 3 above.
๐งญ Roadmap
- Framework related tasks:
Framework Multi-label classification Named Entity Recognition Synthetic Data Generation OpenAI Structured Output โ OpenAI โ OpenAI โ OpenAI Instructor โ OpenAI โ OpenAI โ OpenAI Mirascope โ OpenAI โ OpenAI โ OpenAI Fructose โ OpenAI ๐ง In Progress ๐ง In Progress Marvin โ OpenAI โ OpenAI โ OpenAI Llamaindex โ OpenAI โ OpenAI โ OpenAI Modelsmith โ OpenAI ๐ง In Progress โ OpenAI Outlines โ HF Transformers ๐ง In Progress โ HF Transformers LM format enforcer โ HF Transformers โ HF Transformers โ HF Transformers Jsonformer โ No Enum Support ๐ญ Planning ๐ญ Planning Strictjson โ Non-standard schema โ Non-standard schema โ Non-standard schema Guidance ๐ญ Planning ๐ญ Planning ๐ญ Planning DsPy ๐ญ Planning ๐ญ Planning ๐ญ Planning Langchain ๐ญ Planning ๐ญ Planning ๐ญ Planning - Others
- [x] Latency metrics
- [ ] CICD pipeline for benchmark run automation
- [ ] Async run
๐ก Contribution guidelines
Contributions are welcome! Here are the steps to contribute:
- Please open an issue with any new framework you would like to add. This will help avoid duplication of effort.
- Once the issue is assigned to you, pls submit a PR with the new framework!
๐ Citation
To cite LLM Structured Output Benchmarks in your work, please use the following bibtex reference:
@software{marie_stephen_leo_2024_12327267,
author = {Marie Stephen Leo},
title = {{stephenleo/llm-structured-output-benchmarks:
Release for Zenodo}},
month = jun,
year = 2024,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.12327267},
url = {https://doi.org/10.5281/zenodo.12327267}
}
๐ Feedback
If this work helped you in any way, please consider โญ this repository to give me feedback so I can spend more time on this project.