Awesome-Knowledge-Distillation-of-LLMs
Awesome-Knowledge-Distillation-of-LLMs copied to clipboard
This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vert...
Awesome Knowledge Distillation of LLM Papers
A Survey on Knowledge Distillation of Large Language Models
Ming Li2   Xiaohan Xu1   Chongyang Tao3   Tao Shen4   Reynold Cheng1   Jinyang Li1   Can Xu5   Dacheng Tao6   Tianyi Zhou2  
1 The University of Hong Kong    2 University of Maryland    3 Microsoft    4 University of Technology Sydney    5 Peking University    6 The University of Sydney

A collection of papers related to knowledge distillation of large language models (LLMs). If you want to use LLMs for benefitting your own smaller models training, or use self-generated knowledge to achieve the self-improvement, just take a look at this collection.
We will update this collection every week. Welcome to star ⭐️ this repo to keep track of the updates.
❗️Legal Consideration: It's crucial to note the legal implications of utilizing LLM outputs, such as those from ChatGPT (Restrictions), Llama (License), etc. We strongly advise users to adhere to the terms of use specified by the model providers, such as the restrictions on developing competitive products, and so on.
💡 News
-
2024-2-20: 📃 We released a survey paper "A Survey on Knowledge Distillation of Large Language Models". Welcome to read and cite it. We are looking forward to your feedback and suggestions.
-
Update Log
- 2024-3-19: Add 14 papers.
Contributing to This Collection
Feel free to open an issue/PR or e-mail [email protected], [email protected], [email protected] and [email protected] if you find any missing taxonomies or papers. We will keep updating this collection and survey.
📝 Introduction
KD of LLMs: This survey delves into knowledge distillation (KD) techniques in Large Language Models (LLMs), highlighting KD's crucial role in transferring advanced capabilities from proprietary LLMs like GPT-4 to open-source counterparts such as LLaMA and Mistral. We also explore how KD enables the compression and self-improvement of open-source LLMs by using them as teachers.
KD and Data Augmentation: Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs' performance. By leveraging DA to generate context-rich, skill-specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts.
Taxonomy: Our analysis is meticulously structured around three foundational pillars: algorithm, skill, and verticalization -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields.
KD Algorithms: For KD algorithms, we categorize it into two principal steps: "Knowledge Elicitation" focusing on eliciting knowledge from teacher LLMs, and "Distillation Algorithms" centered on injecting this knowledge into student models.

Figure: An illustration of different knowledge elicitation methods from teacher LLMs.
Skill Distillation: We delve into the enhancement of specific cognitive abilities, such as context following, alignment, agent, NLP task specialization, and multi-modality.
Verticalization Distillation: We explore the practical implications of KD across diverse fields, including law, medical & healthcare, finance, science, and miscellaneous domains.
Note that both Skill Distillation and Verticalization Distillation employ Knowledge Elicitation and Distillation Algorithms in KD Algorithms to achieve their KD. Thus, there are overlaps between them. However, this could also provide different perspectives for the papers.
Why KD of LLMs?
In the era of LLMs, KD of LLMs plays the following crucial roles:

Role | Description | Trend |
---|---|---|
① Advancing Smaller Models | Transferring advanced capabilities from proprietary LLMs to smaller models, such as open source LLMs or other smaller models. | Most common |
② Compression | Compressing open-source LLMs to make them more efficient and practical. | More popular with the prosperity of open-source LLMs |
③ Self-Improvement | Refining open-source LLMs' performance by leveraging their own knowledge, i.e. self-knowledge. | New trend to make open-source LLMs more competitive |
📒 Table of Contents
-
KD Algorithms
-
Knowledge Elicitation
- Labeling
- Expansion
- Curation
- Feature
- Feedback
- Self-Knowledge
-
Distillation Algorithms
- Supervised Fine-Tuning
- Divergence and Similarity
- Reinforcement Learning
- Rank Optimization
-
Knowledge Elicitation
-
Skill Distillation
-
Context Following
- Instruction Following
- Multi-turn Dialogue
- RAG Capability
-
Alignment
- Thinking Pattern
- Preference
- Value
-
Agent
- Tool Using
- Planning
-
NLP Task Specialization
- NLU
- NLG
- Information Retrieval
- Recommendation
- Text Generation Evaluation
- Code
- Multi-Modality
- Summary Table
-
Context Following
-
Verticalization Distillation
- Law
- Medical & Healthcare
- Finance
- Science
- Misc.
-
Encoder-based KD
-
Citation
KD Algorithms
Knowledge Elicitation
Labeling
Expansion
Curation
Feature
Feedback
Self-Knowledge
Distillation Algorithms
Supervised Fine-Tuning
Due to the large number of works applying supervised fine-tuning, we only list the most representative ones here.
Divergence and Similarity
Reinforcement Learning
Rank Optimization
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question Answering | arXiv | 2024-03 | ||
KnowTuning: Knowledge-aware Fine-tuning for Large Language Models | arXiv | 2024-02 | Github | |
Self-Rewarding Language Models | arXiv | 2024-01 | Github | |
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models | arXiv | 2024-01 | Github | Data |
Zephyr: Direct Distillation of Language Model Alignment | arXiv | 2023-10 | Github | Data |
CycleAlign: Iterative Distillation from Black-box LLM to White-box Models for Better Human Alignment | arXiv | 2023-10 |
Skill Distillation
Context Following
Instruction Following
Multi-turn Dialogue
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Zephyr: Direct Distillation of LM Alignment | arXiv | 2023-10 | Github | Data |
OPENCHAT: ADVANCING OPEN-SOURCE LANGUAGE MODELS WITH MIXED-QUALITY DATA | ICLR | 2023-09 | Github | Data |
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations | arXiv | 2023-05 | Github | Data |
Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data | EMNLP | 2023-04 | Github | Data |
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality* | - | 2023-03 | Github | Data |
RAG Capability
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection | NIPS | 2023-10 | Github | Data |
SAIL: Search-Augmented Instruction Learning | arXiv | 2023-05 | Github | Data |
Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks | NIPS | 2023-05 | Github | Data |
Alignment
Thinking Pattern
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Aligning Large and Small Language Models via Chain-of-Thought Reasoning | EACL | 2024-03 | Github | |
Divide-or-Conquer? Which Part Should You Distill Your LLM? | arXiv | 2024-02 | ||
Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning | arXiv | 2024-02 | Github | Data |
Can LLMs Speak For Diverse People? Tuning LLMs via Debate to Generate Controllable Controversial Statements | arXiv | 2024-02 | Github | Data |
Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering | arXiv | 2023-11 | Github | |
Orca 2: Teaching Small Language Models How to Reason | arXiv | 2023-11 | ||
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning | NIPS Workshop | 2023-10 | Github | Data |
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 | arXiv | 2023-06 | ||
SelFee: Iterative Self-Revising LLM Empowered by Self-Feedback Generation | arXiv | 2023-05 |
Preference
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Ultrafeedback: Boosting language models with high-quality feedback | arXiv | 2023-10 | Github | Data |
Zephyr: Direct Distillation of LM Alignment | arXiv | 2023-10 | Github | Data |
Rlaif: Scaling Reinforcement Learning from Human Feedback with AI Feedback | arXiv | 2023-09 | ||
OPENCHAT: ADVANCING OPEN-SOURCE LANGUAGE MODELS WITH MIXED-QUALITY DATA | ICLR | 2023-09 | Github | Data |
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment | arXiv | 2023-07 | Github | |
Aligning Large Language Models through Synthetic Feedbacks | EMNLP | 2023-05 | Github | Data |
Reward Design with Language Models | ICLR | 2023-03 | Github | |
Training Language Models with Language Feedback at Scale | arXiv | 2023-03 | ||
Constitutional AI: Harmlessness from AI Feedback | arXiv | 2022-12 |
Value
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Ultrafeedback: Boosting language models with high-quality feedback | arXiv | 2023-10 | Github | Data |
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment | arXiv | 2023-07 | Github | |
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision | NeurIPS | 2023-05 | Github | Data |
Training Socially Aligned Language Models on Simulated Social Interactions | arXiv | 2023-05 | ||
Constitutional AI: Harmlessness from AI Feedback | arXiv | 2022-12 |
Agent
Tool Using
Planning
NLP Task Specialization
NLU
NLG
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Tailoring Self-Rationalizers with Multi-Reward Distillation | arXiv | 2023-11 | Github | Data |
RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation | arXiv | 2023-10 | Github | |
Neural Machine Translation Data Generation and Augmentation using ChatGPT | arXiv | 2023-07 | ||
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes | ICLR | 2023-06 | ||
Can LLMs generate high-quality synthetic note-oriented doctor-patient conversations? | arXiv | 2023-06 | Github | Data |
InheritSumm: A General, Versatile and Compact Summarizer by Distilling from GPT | EMNLP | 2023-05 | ||
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing | arXiv | 2023-05 | Github | |
Data Augmentation for Radiology Report Simplification | Findings of EACL | 2023-04 | Github | |
Want To Reduce Labeling Cost? GPT-3 Can Help | Findings of EMNLP | 2021-08 |
Information Retrieval
Recommendation
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Can Small Language Models be Good Reasoners for Sequential Recommendation? | arXiv | 2024-03 | ||
Large Language Model Augmented Narrative Driven Recommendations | arXiv | 2023-06 | ||
Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach | arXiv | 2023-05 | ||
ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models | WSDM | 2023-05 | Github | Data |
Text Generation Evaluation
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models | ICLR | 2023-10 | Github | Data |
TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks | arXiv | 2023-10 | Github | Data |
Generative Judge for Evaluating Alignment | ICLR | 2023-10 | Github | Data |
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization | arXiv | 2023-06 | Github | Data |
INSTRUCTSCORE: Explainable Text Generation Evaluation with Fine-grained Feedback | EMNLP | 2023-05 | Github | Data |
Code
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Magicoder: Source Code Is All You Need | arXiv | 2023-12 | Github | Data Data |
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation | arXiv | 2023-12 | ||
Instruction Fusion: Advancing Prompt Evolution through Hybridization | arXiv | 2023-12 | ||
MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning | arXiv | 2023-11 | Github | Data Data |
LLM-Assisted Code Cleaning For Training Accurate Code Generators | arXiv | 2023-11 | ||
Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation | EMNLP | 2023-10 | Github | |
Code Llama: Open Foundation Models for Code | arXiv | 2023-08 | Github | |
Distilled GPT for Source Code Summarization | arXiv | 2023-08 | Github | Data |
Textbooks Are All You Need: A Large-Scale Instructional Text Data Set for Language Models | arXiv | 2023-06 | ||
Code Alpaca: An Instruction-following LLaMA model for code generation | - | 2023-03 | Github | Data |
Multi-Modality
Summary Table

Figure: A summary of representative works about skill distillation.
Verticalization Distillation
Law
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Fuzi | - | 2023-08 | Github | |
ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases | arXiv | 2023-06 | Github | |
Lawyer LLaMA Technical Report | arXiv | 2023-05 | Github | Data |
Medical & Healthcare
Finance
Title | Venue | Date | Code | Data |
---|---|---|---|---|
XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters | CIKM | 2023-05 |
Science
Misc.
Title | Venue | Date | Code | Data |
---|---|---|---|---|
OWL: A Large Language Model for IT Operations | arXiv | 2023-09 | Github | Data |
EduChat: A Large-Scale Language Model-based Chatbot System for Intelligent Education | arXiv | 2023-08 | Github | Data |
Encoder-based KD
Note: Our survey mainly focuses on generative LLMs, and thus the encoder-based KD is not included in the survey. However, we are also interested in this topic and continue to update the latest works in this area.
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling | Findings of ACL | 2023-08 | ||
Better Together: Jointly Using Masked Latent Semantic Modeling and Masked Language Modeling for Sample Efficient Pre-training | CoNLL | 2023-08 |
Citation
If you find this repository helpful, please consider citing the following paper:
@misc{xu2024survey,
title={A Survey on Knowledge Distillation of Large Language Models},
author={Xiaohan Xu and Ming Li and Chongyang Tao and Tao Shen and Reynold Cheng and Jinyang Li and Can Xu and Dacheng Tao and Tianyi Zhou},
year={2024},
eprint={2402.13116},
archivePrefix={arXiv},
primaryClass={cs.CL}
}