Awesome-Evaluation-of-Visual-Generation icon indicating copy to clipboard operation
Awesome-Evaluation-of-Visual-Generation copied to clipboard

A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems

Awesome Evaluation of Visual Generation

Visitor

This repository collects methods for evaluating visual generation.

overall_structure

Overview

What You'll Find Here

Within this repository, we collect works that aim to answer some critical questions in the field of evaluating visual generation, such as:

  • Model Evaluation: How does one determine the quality of a specific image or video generation model?
  • Sample/Content Evaluation: What methods can be used to evaluate the quality of a particular generated image or video?
  • User Control Consistency Evaluation: How to tell how well the generated images and videos align with the user controls or inputs?

Updates

This repository is updated periodically. If you have suggestions for additional resources, updates on methodologies, or fixes for expiring links, please feel free to do any of the following:

  • raise an Issue,
  • nominate awesome related works with Pull Requests,
  • We are also contactable via email (ZIQI002 at e dot ntu dot edu dot sg).

Table of Contents

  • 1. Evaluation Metrics of Generative Models
    • 1.1. Evaluation Metrics of Image Generation
    • 1.2. Evaluation Metrics of Video Generation
    • 1.3. Evaluation Metrics for Latent Representation
  • 2. Evaluation Metrics of Condition Consistency
    • 2.1 Evaluation Metrics of Multi-Modal Condition Consistency
    • 2.2. Evaluation Metrics of Image Similarity
  • 3. Evaluation Systems of Generative Models
    • 3.1. Evaluation of Unconditional Image Generation
    • 3.2. Evaluation of Text-to-Image Generation
    • 3.3. Evaluation of Text-Based Image Editing
    • 3.4. Evaluation of Video Generation
    • 3.5. Evaluation of Text-to-Motion Generation
    • 3.6. Evaluation of Model Trustworthiness
    • 3.7. Evaluation of Entity Relation
  • 4. Improving Visual Generation with Evaluation / Feedback / Reward
  • 5. Quality Assessment for AIGC
  • 6. Study and Rethinking
  • 7. Other Useful Resources

1. Evaluation Metrics of Generative Models

1.1. Evaluation Metrics of Image Generation

Metric Paper Code
Inception Score (IS) Improved Techniques for Training GANs (NeurIPS 2016)
Fréchet Inception Distance (FID) GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017) Code Code
Kernel Inception Distance (KID) Demystifying MMD GANs (ICLR 2018) Code Code
CLIP-FID The Role of ImageNet Classes in Fréchet Inception Distance (ICLR 2023) Code Code
Precision-and-Recall Assessing Generative Models via Precision and Recall (2018-05-31, NeurIPS 2018)
Improved Precision and Recall Metric for Assessing Generative Models (NeurIPS 2019)
Code Code
Renyi Kernel Entropy (RKE) An Information-Theoretic Evaluation of Generative Models in Learning Multi-modal Distributions (NeurIPS 2023) Code
CLIP Maximum Mean Discrepancy (CMMD) Rethinking FID: Towards a Better Evaluation Metric for Image Generation (CVPR 2024) Code

NOTE: evaluates text to image and utilizes vision language models (VLM)

NOTE: RND metric introduced

NOTE: Fréchet Joint Distance (FJD), which is able to assess image quality, conditional consistency, and intra-conditioning diversity within a single metric.

1.2. Evaluation Metrics of Video Generation

Metric Paper Code
FID-vid GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017)
Fréchet Video Distance (FVD) Towards Accurate Generative Models of Video: A New Metric & Challenges (arXiv 2018)
FVD: A new Metric for Video Generation (2019-05-04) (Note: ICLR 2019 Workshop DeepGenStruct Program Chairs)
Code

1.3. Evaluation Metrics for Latent Representation

2. Evaluation Metrics of Condition Consistency

2.1 Evaluation Metrics of Multi-Modal Condition Consistency

Metric Condition Pipeline Code References
CLIP Score (a.k.a. CLIPSIM) Text cosine similarity between the CLIP image and text embeddings Code PyTorch Lightning CLIP Paper (ICML 2021). Metrics first used in CLIPScore Paper (arXiv 2021) and GODIVA Paper (arXiv 2021) applies it in video evaluation.
Mask Accuracy Segmentation Mask predict the segmentatio mask, and compute pixel-wise accuracy against the ground-truth segmentation mask any segmentation method for your setting
DINO Similarity Image of a Subject (human / object etc) cosine similarity between the DINO embeddings of the generated image and the condition image Code DINO paper. Metric is proposed in DreamBooth.
<!-- Identity Consistency Image of a Face -

NOTE: Fréchet Joint Distance (FJD), which is able to assess image quality, conditional consistency, and intra-conditioning diversity within a single metric.

2.2. Evaluation Metrics of Image Similarity

Metrics Paper Code
Learned Perceptual Image Patch Similarity (LPIPS) The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (2018-01-11) (CVPR 2018) Code Website
Structural Similarity Index (SSIM) Image quality assessment: from error visibility to structural similarity (TIP 2004) Code Code
Peak Signal-to-Noise Ratio (PSNR) - Code
Multi-Scale Structural Similarity Index (MS-SSIM) Multiscale structural similarity for image quality assessment (SSC 2004) PyTorch-Metrics
Feature Similarity Index (FSIM) FSIM: A Feature Similarity Index for Image Quality Assessment (TIP 2011) Code

The community has also been using DINO or CLIP features to measure the semantic similarity of two images / frames.

There are also recent works on new methods to measure visual similarity (more will be added):

3. Evaluation Systems of Generative Models

3.1. Evaluation of Unconditional Image Generation

3.2. Evaluation of Text-to-Image Generation

NOTE: NewEpisode benchmark introduced

NOTES: GroundingScore metric introduced

NOTE: emotion accuracy, semantic clarity and semantic diversity are not core contributions of this paper

NOTE: an evaluation approach for early stopping criterion in T2I customization

NOTE: new metric Cross-Model Distance introduced

3.3. Evaluation of Text-Based Image Editing

NOTE: novel automatic mask-based evaluation metric tailored to various object-centric editing scenarios

NOTE: manipulative precision metric introduced

3.4. Evaluation of Video Generation

3.4.1. Evaluation of Text-to-Video Generation

3.4.2. Evaluation of Image-to-Video Generation

3.4.3. Evaluation of Talking Face Generation

3.5. Evaluation of Text-to-Motion Generation

3.6. Evaluation of Model Trustworthiness

3.6.1. Evaluation of Visual-Generation-Model Trustworthiness

NOTE: CAS and BAV novel metric introduced

3.6.2. Evaluation of Non-Visual-Generation-Model Trustworthiness

Not for visual generation, but related evaluations of other models like LLMs

3.7. Evaluation of Entity Relation

4. Improving Visual Generation with Evaluation / Feedback / Reward

5. Quality Assessment for AIGC

5.1. Image Quality Assessment for AIGC

5.2. Aesthetic Predictors for Generated Images

6. Study and Rethinking

6.1. Evaluation of Evaluations

6.2. Survey

6.3. Study

6.4. Competition

7. Other Useful Resources

  • Stanford Course: CS236 "Deep Generative Models" - Lecture 15 "Evaluation of Generative Models" [slides]