AI-paper-digest
AI-paper-digest copied to clipboard

Published 20 hours ago •

→

Metadata

Readme
Issues

Paper List

2501

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Unifying Specialized Visual Encoders for Video Language Models

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Nested Attention: Semantic-aware Attention Values for Concept Personalization

SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

ProgCo: Program Helps Self-Correction of Large Language Models

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization

A3: Android Agent Arena for Mobile GUI Agents

Graph Generative Pre-trained Transformer

Dynamic Scaling of Unit Tests for Code Reward Modeling

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Population Aware Diffusion for Time Series Generation

Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding

Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

MLLM-as-a-Judge for Image Safety without Human Labeling

LTX-Video: Realtime Video Latent Diffusion

2412

PERSE: Personalized 3D Generative Avatars from A Single Portrait

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Aviary: training language agents on challenging scientific tasks

PyG-SSL: A Graph Self-Supervised Learning Toolkit

Facilitating large language model Russian adaptation with Learned Embedding Propagation

Training Software Engineering Agents and Verifiers with SWE-Gym

Edicho: Consistent Image Editing in the Wild

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

MapQaTor: A System for Efficient Annotation of Map Query Datasets

Efficiently Serving LLM Reasoning Programs with Certaindex

Slow Perception: Let's Perceive Geometric Figures Step-by-step

Bringing Objects to Life: 4D generation from 3D objects

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

Toward Adaptive Reasoning in Large Language Models with Thought Rollback

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

Xmodel-2 Technical Report

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Introduction to Graph Neural Networks: A Starting Point for Machine Learning Engineers

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Token-Budget-Aware LLM Reasoning

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

GeAR: Graph-enhanced Agent for Retrieval-augmented Generation

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

DepthLab: From Partial to Complete

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

WavePulse: Real-time Content Analytics of Radio Livestreams

Large Motion Video Autoencoding with Cross-modal Video VAE

Automating the Search for Artificial Life with Foundation Models

PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion

ResearchTown: Simulator of Human Research Community

The Superposition of Diffusion Models Using the Itô Density Estimator

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

Deliberation in Latent Space via Differentiable Cache Augmentation

YuLan-Mini: An Open Data-efficient Language Model

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

VidTwin: Video VAE with Decoupled Structure and Dynamics

SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought

A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression

Diving into Self-Evolving Training for Multimodal Reasoning

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

Better Think with Tables: Leveraging Tables to Enhance Large Language Model Comprehension

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

GraphAgent: Agentic Graph Language Assistant

System-2 Mathematical Reasoning via Enriched Instruction Tuning

Revisiting In-Context Learning with Long Context Language Models

OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning

OpenAI o1 System Card

NILE: Internal Consistency Alignment in Large Language Models

LearnLM: Improving Gemini for Learning

Offline Reinforcement Learning for LLM Multi-Step Reasoning

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Fietje: An open, efficient LLM for Dutch

SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Rethinking Uncertainty Estimation in Natural Language Generation

Parallelized Autoregressive Visual Generation

Outcome-Refining Process Supervision for Code Generation

Qwen2.5 Technical Report

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

Progressive Multimodal Reasoning via Active Retrieval

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

How to Synthesize Text Data without Model Collapse?

TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

A Survey on LLM Inference-Time Self-Improvement

PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

AniDoc: Animation Creation Made Easier

Learning from Massive Human Videos for Universal Humanoid Pose Control

FashionComposer: Compositional Fashion Image Generation

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities

Alignment faking in large language models

CAD-Recode: Reverse Engineering CAD Code from Point Clouds

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment

AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

GUI Agents: A Survey

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

Move-in-2D: 2D-Conditioned Human Motion Generation

Are Your LLMs Capable of Stable Reasoning?

VidTok: A Versatile and Open-Source Video Tokenizer

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

MIVE: New Design and Benchmark for Multi-Instance Video Editing

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

When to Speak, When to Abstain: Contrastive Decoding with Abstention

Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers

How to Choose a Threshold for an Evaluation Metric for Large Language Models

Causal Diffusion Transformers for Generative Modeling

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Wonderland: Navigating 3D Scenes from a Single Image

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

The Open Source Advantage in Large Language Models (LLMs)

Cost-Effective Label-free Node Classification with LLMs

Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning

Precise Length Control in Large Language Models

A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges

Stepwise Reasoning Error Disruption Attack of LLMs

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture

ColorFlow: Retrieval-Augmented Image Sequence Colorization

Just a Simple Transformation is Enough for Data Protection in Vertical Federated Learning

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Sequence Matters: Harnessing Video Models in 3D Super-Resolution

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

Whisper-GPT: A Hybrid Representation Audio Large Language Model

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

Smaller Language Models Are Better Instruction Evolvers

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

Superhuman performance of a large language model on the reasoning tasks of a physician

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

TidyBot++: An Open-Source Holonomic Mobile Manipulator for Robot Learning

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Generative AI in Medicine

SCBench: A KV Cache-Centric Analysis of Long-Context Methods

BrushEdit: All-In-One Image Inpainting and Editing

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Can LLMs Convert Graphs to Text-Attributed Graphs?

Large Action Models: From Inception to Implementation

Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images

Byte Latent Transformer: Patches Scale Better Than Tokens

Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models

Bridging AI and Science: Implications from a Large-Scale Literature Analysis of AI4Science

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

GenEx: Generating an Explorable World

LoRACLR: Contrastive Adaptation for Customization of Diffusion Models

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction

JuStRank: Benchmarking LLM Judges for System Ranking

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Learned Compression for Compressed Learning

Word Sense Linking: Disambiguating Outside the Sandbox

DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages

Arbitrary-steps Image Super-resolution via Diffusion Inversion

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Phi-4 Technical Report

Large Concept Models: Language Modeling in a Sentence Representation Space

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

StreamChat: Chatting with Streaming Video

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

Multimodal Latent Language Modeling with Next-Token Diffusion

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

Learning Flow Fields in Attention for Controllable Person Image Generation

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

POINTS1.5: Building a Vision-Language Model towards Real World Applications

SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs

Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision?

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation

Video Motion Transfer with Diffusion Transformers

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

StyleMaster: Stylize Your Video with Artistic Generation and Translation

STIV: Scalable Text and Image Conditioned Video Generation

Granite Guardian

ObjCtrl-2.5D: Training-free Object Control with Camera Poses

The Pitfalls of Memorization: When Memorization Hurts Generalization

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Mobile Video Diffusion

FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

Causal World Representation in the GPT Model

Contextualized Counterspeech: Strategies for Adaptation, Personalization, and Evaluation

Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation

HARP: Hesitation-Aware Reframing in Transformer Inference Pass

A New Federated Learning Framework Against Gradient Inversion Attacks

MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation

Asynchronous LLM Function Calling

AutoReason: Automatic Few-Shot Reasoning Decomposition

Fully Open Source Moxin-7B Technical Report

CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Training Large Language Models to Reason in a Continuous Latent Space

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

EMOv2: Pushing 5M Vision Model Frontier

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

MoViE: Mobile Diffusion for Video Editing

ProcessBench: Identifying Process Errors in Mathematical Reasoning

Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

Normalizing Flows are Capable Generative Models

Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method

PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations

Chimera: Improving Generalist Model with Domain-Specific Experts

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

RL Zero: Zero-Shot Language to Behaviors without any Supervision

Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

APOLLO: SGD-like Memory, AdamW-level Performance

Reinforcement Learning: An Overview

Mind the Time: Temporally-Controlled Multi-Event Video Generation

CompCap: Improving Multimodal Large Language Models with Composite Captions

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Evaluating and Aligning CodeLLMs on Human Preference

Exponential Speedups by Rerooting Levin Tree Search

LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation

The Prompt Canvas: A Literature-Based Practitioner Guide for Creating Effective Prompts in Large Language Models

Frontier Models are Capable of In-context Scheming

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling

Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment

PanoDreamer: 3D Panorama Synthesis from a Single Image

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context in New Environments

Hidden in the Noise: Two-Stage Robust Watermarking for Images

REL: Working out is all you need

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification

NVILA: Efficient Frontier Visual Language Models

VisionZip: Longer is Better but Not Necessary in Vision Language Models

4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Discriminative Fine-tuning of LVLMs

Challenges in Trustworthy Human Evaluation of Chatbots

Densing Law of LLMs

SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Monet: Mixture of Monosemantic Experts for Transformers

MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities

ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

A Noise is Worth Diffusion Guidance

Towards Data Governance of Frontier AI Models

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Evaluating Language Models as Synthetic Data Generators

MV-Adapter: Multi-view Consistent Image Generation Made Easy

How to Correctly do Semantic Backpropagation on Language-based Agentic Systems

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

PaliGemma 2: A Family of Versatile VLMs for Transfer

Imagine360: Immersive 360 Video Generation from Perspective Anchor

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

CleanDIFT: Diffusion Features without Noise

2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Weighted-Reward Preference Optimization for Implicit Model Fusion

Robust Multi-bit Text Watermark with LLM-based Paraphrasers

Mimir: Improving Video Diffusion Models for Precise Text Understanding

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models

SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

Scaling Image Tokenizers with Grouped Spherical Quantization

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

DataLab: A Unified Platform for LLM-Powered Business Intelligence

Personalized Multimodal Large Language Models: A Survey

OmniCreator: Self-Supervised Unified Generation with Universal Editing

NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

Free Process Rewards without Process Labels

MALT: Improving Reasoning with Multi-Agent LLM Training

Towards Universal Soccer Video Understanding

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Structured 3D Latents for Scalable and Versatile 3D Generation

Negative Token Merging: Image-based Adversarial Feature Guidance

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Yi-Lightning Technical Report

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

One Shot, One Talk: Whole-body Talking Avatar from a Single Image

Towards Adaptive Mechanism Activation in Language Agent

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

o1-Coder: an o1 Replication for Coding

2411

AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular Videos

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

On Domain-Specific Post-Training for Multimodal Large Language Models

DeMo: Decoupled Momentum Optimization

Reverse Thinking Makes LLMs Stronger Reasoners

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

KV Shifting Attention Enhances Language Modeling

A dynamic parallel method for performance optimization on hybrid CPUs

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models

Trajectory Attention for Fine-grained Video Motion Control

GRAPE: Generalizing Robot Policy via Preference Alignment

Video Depth without Video Models

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

ICLERB: In-Context Learning Embedding and Reranker Benchmark

MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for Tabular Applications

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

SpotLight: Shadow-Guided Object Relighting via Diffusion

Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding

TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Large Language Model-Brained GUI Agents: A Survey

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Training Noise Token Pruning

ROICtrl: Boosting Instance Control for Visual Generation

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

LongKey: Keyphrase Extraction for Long Documents

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

SketchAgent: Language-Driven Sequential Sketch Generation

Learning 3D Representations from Procedural 3D Programs

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation

DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting

SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Star Attention: Efficient LLM Inference over Long Sequences

Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

Pathways on the Image Manifold: Image Editing via Video Generation

Controllable Human Image Generation with Personalized Multi-Garments

Visual Counter Turing Test (VCT^2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI)

Factorized Visual Tokenization and Generation

DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?

SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey

From CISC to RISC: language-model guided assembly transpilation

One Diffusion to Generate Them All

MH-MoE:Multi-Head Mixture-of-Experts

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

Cautious Optimizers: Improving Training with One Line of Code

Predicting Emergent Capabilities by Finetuning

VisualLens: Personalization through Visual History

LLMs Do Not Think Step-by-step In Implicit Reasoning

Best of Both Worlds: Advantages of Hybrid Graph Sequence Models

AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset

Knowledge Transfer Across Modalities with Natural Language Supervision

A Survey on LLM-as-a-Judge

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

A No Free Lunch Theorem for Human-AI Collaboration

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

Material Anything: Generating Materials for Any 3D Object via Diffusion

WildLMa: Long Horizon Loco-Manipulation in the Wild

Measuring Bullshit in the Language Games played by ChatGPT

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

OminiControl: Minimal and Universal Control for Diffusion Transformer

One to rule them all: natural language to bind communication, perception and action

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Style-Friendly SNR Sampler for Style-Driven Generation

Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction

TEXGen: a Generative Diffusion Model for Mesh Textures

MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts

Understanding LLM Embeddings for Regression

SegBook: A Simple Baseline and Cookbook for Volumetric Medical Image Segmentation

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

MyTimeMachine: Personalized Facial Age Transformation

The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz

Associative Knowledge Graphs for Efficient Sequence Storage and Retrieval

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Stable Flow: Vital Layers for Training-Free Image Editing

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

Multimodal Autoregressive Pre-training of Large Vision Encoders

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Natural Language Reinforcement Learning

Novel View Extrapolation with Video Diffusion Priors

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Hymba: A Hybrid-head Architecture for Small Language Models

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Are Large Language Models Memorizing Bug Benchmarks?

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images

Patience Is The Key to Large Language Model Reasoning

ORID: Organ-Regional Information Driven Framework for Radiology Report Generation

MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

Human-In-the-Loop Software Development Agents

Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline

Stylecodes: Encoding Stylistic Information For Image Generation

Soft Robotic Dynamic In-Hand Pen Spinning

Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

RedPajama: an Open Dataset for Training Large Language Models

Ultra-Sparse Memory Network

Building Trust: Foundations of Security, Safety and Transparency in AI

Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Continuous Speculative Decoding for Autoregressive Image Generation

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

AIGS: Generating Science from AI-Powered Automated Falsification

Generative World Explorer

Bi-Mamba: Towards Accurate 1-Bit State Space Models

Drowning in Documents: Consequences of Scaling Reranker Inference

Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering

LLäMmlein: Compact and Competitive German-Only Language Models from Scratch

StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

VeGaS: Video Gaussian Splatting

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

AnimateAnything: Consistent and Controllable Animation for Video Generation

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Does Prompt Formatting Have Any Impact on LLM Performance?

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Number it: Temporal Grounding Videos like Flipping Manga

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use

An Empirical Study on LLM-based Agents for Automated Bug Fixing

Evaluating the role of `Constitutions' for learning from AI feedback

SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning

Generative Agent Simulations of 1,000 People

Xmodel-1.5: An 1B-scale Multilingual LLM

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

MagicQuill: An Intelligent Interactive Image Editing System

Adaptive Decoding via Latent Preference Optimization

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering

Cut Your Losses in Large-Vocabulary Language Models

Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples

CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

FinRobot: AI Agent for Equity Research and Valuation with Large Language Models

Evaluating World Models with LLM for Decision Making

Can sparse autoencoders be used to decompose and interpret steering vectors?

Sharingan: Extract User Action Sequence from Desktop Recordings

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

Motion Control for Enhanced Complex Action Video Generation

PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

Large Language Models Can Self-Improve in Long-context Reasoning

Scaling Properties of Diffusion Models for Perceptual Tasks

GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation

Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data

Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Top-$nσ$: Not All Logits Are You Need

Direct Preference Optimization Using Sparse Feature-Level Constraints

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Using Generative AI and Multi-Agents to Provide Automatic Feedback

Toward Optimal Search and Retrieval for RAG

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Watermark Anything with Localized Messages

Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

The Super Weight in Large Language Models

SAMPart3D: Segment Any Part in 3D Objects

Counterfactual Generation from Language Models

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

Designing Reliable Experiments with Generative Agent-Based Modeling: A Comprehensive Guide Using Concordia by Google DeepMind

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Hermes: A Large Language Model Framework on the Journey to Autonomous Networks

KMM: Key Frame Mask Mamba for Extended Motion Generation

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction

Acoustic Volume Rendering for Neural Impulse Response Fields

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

GFT: Graph Foundation Model with Transferable Tree Vocabulary

Game-theoretic LLM: Agent Workflow for Negotiation Games

Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation

NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts

Autoregressive Models in Vision: A Survey

GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

LLMs as Method Actors: A Model for Prompt Engineering and Architecture

StdGEN: Semantic-Decomposed 3D Character Generation from Single Images

Improving the detection of technical debt in Java source code with an enriched dataset

Balancing Pipeline Parallelism with Vocabulary Parallelism

A Taxonomy of AgentOps for Enabling Observability of Foundation Model based Agents

Hardware and Software Platform Inference

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Analyzing The Language of Visual Tokens

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities

BitNet a4.8: 4-bit Activations for 1-bit LLMs

CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model

DELIFT: Data Efficient Language model Instruction Fine Tuning

GazeGen: Gaze-Driven User Interaction for Visual Content Generation

Scaling Laws for Precision

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Self-Consistency Preference Optimization

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

Number Cookbook: Number Understanding of Language Models and How to Improve It

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness

Inference Optimal VLMs Need Only One Visual Token but Larger Models

Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?

GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

A Mamba Foundation Model for Time Series Forecasting

Correlation of Object Detection Performance with Visual Saliency and Depth Estimation

Mixtures of In-Context Learners

Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge

Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study

Adaptive Length Image Tokenization via Recurrent Allocation

Attacking Vision-Language Computer Agents via Pop-ups

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Thinking Forward and Backward: Effective Backward Planning with Large Language Models

DreamPolish: Domain Score Distillation With Progressive Geometry Generation

Sample-Efficient Alignment for LLMs

LLaMo: Large Language Model-based Molecular Graph Assistant

Randomized Autoregressive Visual Generation

CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes

Face Anonymization Made Simple

Zipfian Whitening

Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

Human-inspired Perspectives: A Survey on AI Long-term Memory

E2E-AFG: An End-to-End Model with Adaptive Filtering for Retrieval-Augmented Generation

Self-Evolved Reward Learning for LLMs

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

GRS-QA -- Graph Reasoning-Structured Question Answering Dataset

Constant Acceleration Flow

SambaMixer: State of Health Prediction of Li-ion Batteries using Mamba State Space Models

Project Sid: Many-agent simulations toward AI civilization

WikiNER-fr-gold: A Gold-Standard NER Corpus

Personalization of Large Language Models: A Survey

2410

Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use

Learning Video Representations without Natural Videos

DELTA: Dense Efficient Long-range 3D Tracking for any video

SelfCodeAlign: Self-Alignment for Code Generation

Constraint Back-translation Improves Complex Instruction Following of Large Language Models

GPT or BERT: why not both?

Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks

Language Models can Self-Lengthen to Generate Long Texts

BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

In-Context LoRA for Diffusion Transformers

What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation

Controlling Language and Diffusion Models by Transporting Activations

HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models

Stealing User Prompts from Mixture of Experts

Toxicity of the Commons: Curating Open-Source Pre-Training Data

A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents

AAAR-1.0: Assessing AI's Potential to Assist Research

A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks

Survey of User Interface Design and Interaction Techniques in Generative AI Applications

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset

Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

ADAM: An Embodied Causal Agent in Open-World Environments

Standardization Trends on Safety and Trustworthiness Technology for Advanced AI

ProMoE: Fast MoE-based LLM Serving using Proactive Caching

Mapping the Neuro-Symbolic AI Landscape by Architectures: A Handbook on Augmenting Deep Learning Through Symbolic Reasoning

Distinguishing Ignorance from Error in LLM Hallucinations

BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays

Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications

Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning

Minimum Entropy Coupling with Bottleneck

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

GPT-4o System Card

Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

LongReward: Improving Long-context Large Language Models with AI Feedback

LoRA vs Full Fine-tuning: An Illusion of Equivalence

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline

MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Language Models And A Second Opinion Use Case: The Pocket Professional

GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

Fast Best-of-N Decoding via Speculative Rejection

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

Neural Fields in Robotics: A Survey

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

A Survey of Small Language Models

The Geometry of Concepts: Sparse Autoencoder Feature Structure

Counting Ability of Large Language Models and Impact of Tokenization

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models

Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

A prescriptive theory for brain-like inference

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning

Designing LLM-Agents with Personalities: A Psychometric Approach

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

PDL: A Declarative Prompt Programming Language

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

Teach Multimodal LLMs to Comprehend Electrocardiographic Images

O1 Replication Journey: A Strategic Progress Report -- Part 1

Framer: Interactive Frame Interpolation

MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

Unbounded: A Generative Infinite Game of Character Life Simulation

Stable Consistency Tuning: Understanding and Improving Consistency Models

Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Should We Really Edit Language Models? On the Evaluation of Edited Language Models

Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances

Why Does the Effective Context Length of LLMs Fall Short?

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

Data Scaling Laws in Imitation Learning for Robotic Manipulation

AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant

Taipan: Efficient and Expressive State Space Language Models with Selective Attention

Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

SMITE: Segment Me In TimE

LOGO -- Long cOntext aliGnment via efficient preference Optimization

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

The Nature of Mathematical Modeling and Probabilistic Optimization Engineering in Generative AI

Large Language Models Reflect the Ideology of their Creators

WAFFLE: Multi-Modal Model for Automated Front-End Development

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits

ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

WorldSimBench: Towards Video Generation Models as World Simulators

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

CLEAR: Character Unlearning in Textual and Visual Modalities

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Scalable Ranked Preference Optimization for Text-to-Image Generation

Value Residual Learning For Alleviating Attention Concentration In Transformers

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models

Lightweight Neural App Control

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Frontiers in Intelligent Colonoscopy

MiniPLM: Knowledge Distillation for Pre-Training Language Models

Aligning Large Language Models via Self-Steering Optimization

Math Neurosurgery: Isolating Language Models' Math Reasoning Abilities Using Only Forward Passes

A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration

Pantograph: A Machine-to-Machine Interaction Interface for Advanced Theorem Proving, High Level Reasoning, and Data Extraction in Lean 4

Promoting cross-modal representations to improve multimodal foundation models for physiological signals

LLM-based Optimization of Compound AI Systems: A Survey

FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

Reflection-Bench: probing AI intelligence with reflection

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors

Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Can Knowledge Editing Really Correct Hallucinations?

Pre-training Distillation for Large Language Models: A Design Space Exploration

Improve Vision Language Model Chain-of-thought Reasoning

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Learning How to Vote With Principles: Axiomatic Insights Into the Collective Decisions of Neural Networks

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Continuous Speech Synthesis using per-token Latent Diffusion

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

Mitigating Object Hallucination via Concentric Causal Attention

Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation

AutoTrain: No-code training for state-of-the-art models

Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement

Language Models are Symbolic Learners in Arithmetic

M-RewardBench: Evaluating Reward Models in Multilingual Settings

Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold

Baichuan Alignment Technical Report

SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation

Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts

BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search

Teaching Models to Balance Resisting and Accepting Persuasion

How Do Training Methods Influence the Utilization of Vision Models?

Interpretable end-to-end Neurosymbolic Reinforcement Learning agents

Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas

Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning

In-context learning and Occam's razor

UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models

Towards Cross-Cultural Machine Translation with Retrieval-Augmented Generation from Multilingual Knowledge Graphs

FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Can MLLMs Understand the Deep Implication Behind Chinese Images?

Retrospective Learning from Interactions

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

A Unified View of Delta Parameter Editing in Post-Trained Large-Scale Models

VidPanos: Generative Panoramic Videos from Casual Panning Videos

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Harnessing Webpage UIs for Text-Rich Visual Understanding

BenTo: Benchmark Task Reduction with In-Context Transferability

Looking Inward: Language Models Can Learn About Themselves by Introspection

PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

DPLM-2: A Multimodal Diffusion Protein Language Model

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Movie Gen: A Cast of Media Foundation Models

Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning

MedINST: Meta Dataset of Biomedical Instructions

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

SBI-RAG: Enhancing Math Word Problem Solving for Students through Schema-Based Instruction and Retrieval-Augmented Generation

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Roadmap towards Superhuman Speech Understanding using Large Language Models

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

AERO: Softmax-Only LLMs for Efficient Private Inference

MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

A Survey on Data Synthesis and Augmentation for Large Language Models

Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media

Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Exploring Model Kinship for Merging Large Language Models

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

Revealing the Barriers of Language Agents in Planning

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

Tracking Universal Features Through Fine-Tuning and Model Merging

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

A Prompt-Based Knowledge Graph Foundation Model for Universal In-Context Reasoning

Divide-Verify-Refine: Aligning LLM Responses with Complex Instructions

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming

OMCAT: Omni Context Aware Transformer

CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

Neural Metamorphosis

MoH: Multi-Head Attention as Mixture-of-Head Attention

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Improving Long-Text Alignment for Text-to-Image Diffusion Models

NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models

Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices

MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

Zero-shot Model-based Reinforcement Learning using Large Language Models

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

GS^3: Efficient Relighting with Triple Gaussian Splatting

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Agent-as-a-Judge: Evaluate Agents with Agents

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

AFlow: Automating Agentic Workflow Generation

Large Language Model Evaluation via Matrix Nuclear-Norm

Thinking LLMs: General Instruction Following with Thought Generation

Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains

SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

Empirical Study of Mutual Reinforcement Effect and Application in Few-shot Text Classification Tasks via Prompt

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Agentic Information Retrieval

EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation

Toward General Instruction-Following Alignment for Retrieval-Augmented Generation

FlatQuant: Flatness Matters for LLM Quantization

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

MiRAGeNews: Multimodal Realistic AI-Generated News Detection

Mentor-KD: Making Small Language Models Better Multi-step Reasoners

MedMobile: A mobile-sized language model with expert-level clinical capabilities

Semantic Score Distillation Sampling for Compositional Text-to-3D Generation

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights

Towards Trustworthy Knowledge Graph Reasoning: An Uncertainty Aware Perspective

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

Baichuan-Omni Technical Report

KV Prediction for Improved Time to First Token

Agents Thinking Fast and Slow: A Talker-Reasoner Architecture

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models

MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

Agent S: An Open Agentic Framework that Uses Computers Like a Human

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Progressive Autoregressive Video Diffusion Models

Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Benchmarking Agentic Workflow Generation

TVBench: Redesigning Video-Language Evaluation

DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities

MotionGS: Exploring Explicit Motion Guidance for Deformable 3D Gaussian Splatting

Smart Audit System Empowered by LLM

Mechanistic Permutability: Match Features Across Layers

I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow

WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow

MM-Ego: Towards Building Egocentric Multimodal LLMs

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

TextToon: Real-Time Text Toonify Head Avatar from Single Video

Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Personalized Visual Instruction Tuning

I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Let's Ask GNN: Empowering Large Language Model for Graph In-Context Learning

Pixtral 12B

Retrieval-Augmented Decision Transformer: External Memory for In-context RL

Data Selection via Optimal Control for Language Models

TinyEmo: Scaling down Emotional Reasoning via Metric Projection

Emergent properties with repeated examples

PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness

CursorCore: Assist Programming through Aligning Anything

Jointly Generating Multi-view Consistent PBR Textures using Collaborative Control

Self-Boosting Large Language Models with Synthetic Preference Data

Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders

Towards Natural Image Matting in the Wild via Real-Scenario Prior

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning

Does Spatial Cognition Emerge in Frontier Models?

Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders

LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Accelerated Preference Optimization for Large Language Model Alignment

Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way

Multimodal Situational Safety

Temporal Reasoning Transfer from Text to Video

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Diversity-Rewarded CFG Distillation

Aria: An Open Multimodal Native Mixture-of-Experts Model

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Pyramidal Flow Matching for Efficient Video Generative Modeling

MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning

ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

Vector-ICL: In-context Learning with Continuous Vector Representations

Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Falcon Mamba: The First Competitive Attention-free 7B Language Model

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models

PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Differential Transformer

SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Presto! Distilling Steps and Layers for Accelerating Music Generation

Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification

Named Clinical Entity Recognition Benchmark

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

LPZero: Language Model Zero-cost Proxy Search from Zero

Intriguing Properties of Large Language and Vision Models

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

$\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

UniMuMo: Unified Text, Music and Motion Generation

Hyper-multi-step: The Truth Behind Difficult Long-context Tasks

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Inference Scaling for Long-Context Retrieval Augmented Generation

Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning

LongGenBench: Long-context Generation Benchmark

Grounding Language in Multi-Perspective Referential Communication

GraphRouter: A Graph-based Router for LLM Selections

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs

What Matters for Model Merging at Scale?

NRGBoost: Energy-Based Generative Boosted Trees

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

ToolGen: Unified Tool Retrieval and Calling via Generation

Zebra: In-Context and Generative Pretraining for Solving Parametric PDEs

EBES: Easy Benchmarking for Event Sequences

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Autonomous Character-Scene Interaction Synthesis from Text Instruction

Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Erasing Conceptual Knowledge from Language Models

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Contrastive Localized Language-Image Pre-Training

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Large Language Models as Markov Chains

Video Instruction Tuning With Synthetic Data

LLaVA-Critic: Learning to Evaluate Multimodal Models

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

ControlAR: Controllable Image Generation with Autoregressive Models

Selective Attention Improves Transformer

Distilling an End-to-End Voice Assistant Without Instruction Training Data

FAN: Fourier Analysis Networks

NL-Eye: Abductive NLI for Images

Intelligence at the Edge of Chaos

Contextual Document Embeddings

Mixed-Session Conversation with Egocentric Memory

Response Tuning: Aligning Large Language Models without Instruction

MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation

Collective Critics for Creative Story Generation

Learning the Latent Rules of a Game from Data: A Chess Story

Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification, Segmentation, Restoration and Beyond

MIGA: Mixture-of-Experts with Group Aggregation for Stock Market Prediction

Efficient Source-Free Time-Series Adaptation via Parameter Subspace Disentanglement

L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?

MVGS: Multi-view-regulated Gaussian Splatting for Novel View Synthesis

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models

Quantifying Generalization Complexity for Large Language Models

Not All LLM Reasoners Are Created Equal

LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation

HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding

FactAlign: Long-form Factuality Alignment of Large Language Models

VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

3DGS-DET: Empower 3D Gaussian Splatting with Boundary Guidance and Box-Focused Sampling for 3D Object Detection

InfiniPot: Infinite Context Processing on Memory-Constrained LLMs

Selective Aggregation for Low-Rank Adaptation in Federated Learning

Closed-loop Long-horizon Robotic Planning via Equilibrium Sequence Modeling

Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction

HelpSteer2-Preference: Complementing Ratings with Preferences

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Were RNNs All We Needed?

RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

Addition is All You Need for Energy-efficient Language Models

Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation

What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration

SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

2409

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

DressRecon: Freeform 4D Human Reconstruction from Monocular Video

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation

Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Instance-adaptive Zero-shot Chain-of-Thought Prompting

The Perfect Blend: Redefining RLHF with Mixture of Judges

Old Optimizer, New Norm: An Anthology

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis

Visual Context Window Extension: A New Perspective for Long Video Understanding

RoCoTex: A Robust Method for Consistent Texture Synthesis with Diffusion Models

Image Copy Detection for Diffusion Models

Law of the Weakest Link: Cross Capabilities of Large Language Models

Illustrious: an Open Advanced Illustration Model

On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability

Can Models Learn Skill Composition from Examples?

Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code

IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding

Hyper-Connections

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

DiaSynth -- Synthetic Dialogue Generation Framework

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

LML: Language Model Learning a Dataset for Data-Augmented Prediction

Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models

Emu3: Next-Token Prediction is All You Need

MinerU: An Open-Source Solution for Precise Document Content Extraction

A Survey on the Honesty of Large Language Models

Cottention: Linear Transformers With Cosine Attention

KALE-LM: Unleash The Power Of AI For Science Via Knowledge And Logic Enhanced Large Model

Evaluation of OpenAI o1: Opportunities and Challenges of AGI

Data-Prep-Kit: getting your data ready for LLM application development

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

MIO: A Foundation Model on Multimodal Tokens

Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study

Pixel-Space Post-Training of Latent Diffusion Models

Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult

Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis

HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows

Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors

Game4Loc: A UAV Geo-Localization Benchmark from Game Data

MSI-Agent: Incorporating Multi-Scale Insight into Embodied Agents for Superior Planning and Decision-Making

TalkinNeRF: Animatable Neural Fields for Full-Body Talking Humans

Synchronize Dual Hands for Physics-Based Dexterous Guitar Playing

HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

MonoFormer: One Transformer for Both Diffusion and Autoregression

EuroLLM: Multilingual Language Models for Europe

MaskBit: Embedding-free Image Generation via Bit Tokens

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Seeing Faces in Things: A Model and Dataset for Pareidolia

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Improvements to SDXL in NovelAI Diffusion V3

SLIMER-IT: Zero-Shot NER on Italian Language

Small Language Models: Survey, Measurements, and Insights

Making Text Embedders Few-Shot Learners

Reward-Robust RLHF in LLMs

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

OmniBench: Towards The Future of Universal Omni-Language Models

Archon: An Architecture Search Framework for Inference-Time Techniques

Boosting Healthcare LLMs Through Retrieved Context

AIM 2024 Sparse Neural Rendering Challenge: Dataset and Benchmark

Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

Enabling Ultra-Dense, Open-RAN, Vehicular Networks with Non-Linear MIMO Processing

Instruction Following without Instruction Tuning

The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends

Present and Future Generalization of Synthetic Image Detectors

Tabular Data Generation using Binary Diffusion

A Case Study of Web App Coding with OpenAI Reasoning Models

Colorful Diffuse Intrinsic Image Decomposition in the Wild

Temporally Aligned Audio for Video with Autoregression

V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

Prithvi WxC: Foundation Model for Weather and Climate

Portrait Video Editing Empowered by Multimodal Generative Priors

Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench

Imagine yourself: Tuning-Free Personalized Image Generation

MuCodec: Ultra Low-Bitrate Music Codec

An adapted large language model facilitates multiple medical tasks in diabetes care

RRM: Robust Reward Model Training Mitigates Reward Hacking

Can we only use guideline instead of shot in prompt?

CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

LVCD: Reference-based Lineart Video Colorization with Diffusion Models

MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

Training Language Models to Self-Correct via Reinforcement Learning

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

Language Models Learn to Mislead Humans via RLHF

Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent Generation

FlexiTex: Enhancing Texture Generation with Visual Guidance

Vista3D: Unravel the 3D Darkside of a Single Image

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2.5-Coder Technical Report

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

A Controlled Study on Long Context Extension and Generalization in LLMs

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

GRIN: GRadient-INformed MoE

LLMs + Persona-Plug = Personalized LLMs

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Jailbreaking Large Language Models with Symbolic Mathematics

Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

NVLM: Open Frontier-Class Multimodal LLMs

Cesàro operators on the space of analytic functions with logarithmic growth

OSV: One Step is Enough for High-Quality Image to Video Generation

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

OmniGen: Unified Image Generation

Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction

Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning

Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

Agile Continuous Jumping in Discontinuous Terrains

Single-Layer Learnable Activation for Implicit Neural Representation (SL$^{2}$A-INR)

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

Kolmogorov-Arnold Transformer

On the limits of agency in agent-based models

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Schrodinger's Memory: Large Language Models

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

On the Diagram of Thought

SFR-RAG: Towards Contextually Faithful LLMs

Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison

Towards Diverse and Efficient Audio Captioning via Diffusion Models

Implicit Neural Representations with Fourier Kolmogorov-Arnold Networks

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Agents in Software Engineering: Survey, Landscape, and Vision

SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity

A Diffusion Approach to Radiance Field Relighting using Multi-Illumination Synthesis

Exploring Graph Structure Comprehension Ability of Multimodal Large Language Models: Case Studies

InstantDrag: Improving Interactivity in Drag-based Image Editing

B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

DrawingSpinUp: 3D Animation from Single Character Drawings

Apollo: Band-sequence Modeling for High-Quality Audio Restoration

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Robust Dual Gaussian Splatting for Immersive Human-centric Volumetric Videos

DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors

Click2Mask: Local Editing with Dynamic Mask Generation

FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder

IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Instant Facial Gaussians Translator for Relightable and Interactable Facial Rendering

Agent Workflow Memory

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis

What is the Role of Small Models in the LLM Era: A Survey

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

gsplat: An Open-Source Library for Gaussian Splatting

Generative Hierarchical Materials Search

ProteinBench: A Holistic Evaluation of Protein Foundation Models

LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation

INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding

Can Large Language Models Unlock Novel Scientific Research Ideas?

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

SongCreator: Lyrics-based Universal Song Generation

Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments

Evaluating Multiview Object Consistency in Humans and Image Models

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Are Large Language Models a Threat to Programming Platforms? An Exploratory Study

Benchmarking Chinese Knowledge Rectification in Large Language Models

Evidence from fMRI Supports a Two-Phase Abstraction Process in Language Models

LLMs Will Always Hallucinate, and We Need to Live With This

MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery

A framework to compute resonances arising from multiple scattering

SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning

Elsevier Arena: Human Evaluation of Chemistry/Biology/Health Foundational Large Language Models

Insights from Benchmarking Frontier Language Models on Web App Code Generation

Can OOD Object Detectors Learn from Foundation Models?

OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs

Achieving Peak Performance for Large Language Models: A Systematic Review

POINTS: Improving Your Vision-language Model with Affordable Strategies

Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak

UniDet3D: Multi-dataset Indoor 3D Object Detection

GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Self-Harmonized Chain of Thought

Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild

Attention Heads of Large Language Models: A Survey

Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Sketch: A Toolkit for Streamlining LLM Operations

ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding

Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

Large Language Model-Based Agents for Software Engineering: A Survey

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Configurable Foundation Models: Building LLMs from a Modular Perspective

Bioinformatics Retrieval Augmentation Data (BRAD) Digital Assistant

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Building Math Agents with Multi-Turn Iterative Preference Learning

Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges

Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation

LinFusion: 1 GPU, 1 Minute, 16K Image

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text

Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models

OLMoE: Open Mixture-of-Experts Language Models

FuzzCoder: Byte-level Fuzzing Test via Large Language Model

In Defense of RAG in the Era of Long-Context Language Models

Kvasir-VQA: A Text-Image Pair GI Tract Dataset

GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI

Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Bicrucial $k$-power-free permutations

OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Affordance-based Robot Manipulation with Flow Matching

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

Statically Contextualizing Large Language Models with Typed Holes

Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

ContextCite: Attributing Model Generation to Context

Diffusion Policy Policy Optimization

FLUX that Plays Music

Compositional 3D-aware Video Generation with LLM Director

LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models

Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts

Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders

A Survey for Large Language Models in Biomedicine

On-Device Language Models: A Comprehensive Review

2408

ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA Datasets with Large Language Models

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

InkubaLM: A small language model for low-resource African languages

Beyond Preferences in AI Alignment

MemLong: Memory-Augmented Retrieval for Long Text Modeling

SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

CSGO: Content-Style Composition in Text-to-Image Generation

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Examination of Code generated by Large Language Models

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

CogVLM2: Visual Language Models for Image and Video Understanding

SurveySum: A Dataset for Summarizing Multiple Scientific Articles into a Survey Section

Law of Vision Representation in MLLMs

Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

LoraMap: Harnessing the Power of LoRA Connections

Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

3D Reconstruction with Spatial Memory

Scaling Up Diffusion and Flow-based XGBoost Models

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

TEDRA: Text-based Editing of Dynamic and Photoreal Actors

ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution

Distribution Backtracking Builds A Faster Convergence Trajectory for One-step Diffusion Distillation

In-Context Imitation Learning via Next-Token Prediction

Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Persuasion Games using Large Language Models

Knowledge Navigator: LLM-guided Browsing Framework for Exploratory Search in Scientific Literature

Automatic Differential Diagnosis using Transformer-Based Multi-Label Sequence Classification

Efficient LLM Scheduling by Learning to Rank

Towards Realistic Example-based Modeling via 3D Gaussian Stitching

StyleRemix: Interpretable Authorship Obfuscation via Distillation and Perturbation of Style Elements

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

ReMamba: Equip Mamba with Effective Long-Sequence Modeling

GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs

AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

The VoxCeleb Speaker Recognition Challenge: A Retrospective

Diffusion Models Are Real-Time Game Engines

Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

Platypus: A Generalized Specialist Model for Reading Text in Various Forms

CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

Text2SQL is Not Enough: Unifying AI and Databases with TAG

Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation

Artificial intelligence for science: The easy and hard problems

Agentic Retrieval-Augmented Generation for Time Series Analysis

A Practitioner's Guide to Continual Multimodal Pretraining

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

Foundation Models for Music: A Survey

MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Learning to Move Like Professional Counter-Strike Players

MobileQuant: Mobile-friendly Quantization for On-device Language Models

GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models

LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs

Training-free Long Video Generation with Chain of Diffusion Model Experts

TVG: A Training-free Transition Video Generation Method with Diffusion Models

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

A Web-Based Solution for Federated Learning with LLM-Based Automation

FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering

T3M: Text Guided 3D Human Motion Synthesis from Speech

Memory-Efficient LLM Training with Online Subspace Descent

Building and better understanding vision-language models: insights and future directions

DreamCinema: Cinematic Transfer with Free Camera and 3D Character

Controllable Text Generation for Large Language Models: A Survey

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Real-Time Video Generation with Pyramid Attention Broadcast

Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

Sapiens: Foundation for Human Vision Models

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

CODE: Confident Ordinary Differential Editing

Subsurface Scattering for 3D Gaussian Splatting

Scalable Autoregressive Image Generation with Mamba

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM

Evidence-backed Fact Checking using RAG and Few-Shot In-Context Learning with LLMs

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Hermes 3 Technical Report

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

LLM Pruning and Distillation in Practice: The Minitron Approach

Critique-out-Loud Reward Models

FocusLLM: Scaling LLM's Context by Parallel Decoding

Efficient Detection of Toxic Prompts in Large Language Models

FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

The Vizier Gaussian Process Bandit Algorithm

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

StructuredRAG: JSON Response Formatting with Large Language Models

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments

To Code, or Not To Code? Exploring Impact of Code in Pre-training

ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining

Flexora: Flexible Low Rank Adaptation for Large Language Models

Quantum Artificial Intelligence: A Brief Survey

Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information

Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices

Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

ShortCircuit: AlphaZero-Driven Circuit Design

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

TraDiffusion: Trajectory-Based Training-Free Image Generation

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering

Challenges and Responses in the Practice of Large Language Models

Segment Anything with Multiple Modalities

Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

Graph Retrieval-Augmented Generation: A Survey

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars

Backward-Compatible Aligned Representations via an Orthogonal Transformation Layer

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

Automated Design of Agentic Systems

TurboEdit: Instant text-based image editing

Can Large Language Models Understand Symbolic Graphics Programs?

The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Heavy Labels Out! Dataset Distillation with Label Space Lightening

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Towards flexible perception with visual memory

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

FuseChat: Knowledge Fusion of Chat Models

Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning

Fine-tuning Large Language Models with Human-inspired Learning Strategies in Medical Question Answering

Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

3D Gaussian Editing with A Single Image

Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Aquila2 Technical Report

Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

Generative Photomontage

InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

Imagen 3

OpenResearcher: Unleashing AI for Accelerated Scientific Research

Layerwise Recurrent Router for Mixture-of-Experts

SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields

DC3DO: Diffusion Classifier for 3D Objects

Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

TacSL: A Library for Visuotactile Sensor Simulation and Learning

UniT: Unified Tactile Representation for Robot Learning

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Body Transformer: Leveraging Robot Embodiment for Policy Learning

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

MovieSum: An Abstractive Summarization Dataset for Movie Screenplays

FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework

Med42-v2: A Suite of Clinical LLMs

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

HeadGAP: Few-shot 3D Head Avatar via Generalizable Gaussian Priors

ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA datasets with Large Language Models

UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

ZePo: Zero-Shot Portrait Stylization with Faster Sampling

DeepSpeak Dataset v1.0

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Kalman-Inspired Feature Propagation for Video Face Super-Resolution

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning

A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?

MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models

Generating novel experimental hypotheses from language models: A case study on cross-dative generalization

Retrieval-augmented code completion for local projects using large language models

An Empirical Study on Challenges for LLM Developers

HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Transformer Explainer: Interactive Learning of Text-Generative Models

Better Alignment with Instruction Back-and-Forth Translation

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Conversational Prompt Engineering

Advancing Molecular Machine (Learned) Representations with Stereoelectronics-Infused Molecular Graphs

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

EfficientRAG: Efficient Retriever for Multi-Hop Question Answering

Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation

UNLEARN Efficient Removal of Knowledge in Large Language Models

Task-oriented Sequential Grounding in 3D Scenes

Fast Sprite Decomposition from Animated Graphics

CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases

Achieving Human Level Competitive Robot Table Tennis

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

Compact 3D Gaussian Splatting for Static and Dynamic Radiance Fields

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

EXAONE 3.0 7.8B Instruction Tuned Language Model

MoExtend: Tuning New Experts for Modality and Task Extension

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

RayGauss: Volumetric Gaussian-Based Ray Casting for Photorealistic Novel View Synthesis

LLaVA-OneVision: Easy Visual Task Transfer

CoverBench: A Challenging Benchmark for Complex Claim Verification

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Delivery of DART Impact Ejecta to Mars and Earth: Opportunity for Meteor Observations

Learning to Predict Program Execution by Modeling Dynamic Dependency on Code Graphs

Diffusion Models as Data Mining Tools

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Self-Taught Evaluators

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Language Model Can Listen While Speaking

BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba

MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Operationalizing Contextual Integrity in Privacy-Conscious Assistants

ProCreate, Dont Reproduce! Propulsive Energy Diffusion for Creative Generation

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs

Conditional LoRA Parameter Generation

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

A Survey of Mamba

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

POA: Pre-training Once for Models of All Sizes

Medical SAM 2: Segment medical images as video via Segment Anything Model 2

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention

Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

SAM 2: Segment Anything in Images and Videos

Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation

Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names

Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

OmniParser for Pure Vision Based GUI Agent

Finch: Prompt-guided Key-Value Cache Compression

Gemma 2: Improving Open Language Models at a Practical Size

Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget

2407

Projected Language Models: A Large Model Pre-Segmented Into Smaller Ones

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

The Llama 3 Herd of Models

Berkeley Humanoid: A Research Platform for Learning-based Control

ShieldGemma: Generative AI Content Moderation Based on Gemma

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Open-Vocabulary Audio-Visual Semantic Segmentation

Adaptive Retrieval-Augmented Generation for Conversational Systems

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Expressive Whole-Body 3D Gaussian Avatar

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

TAROT: Task-Oriented Authorship Obfuscation Using Policy Optimization Methods

Data Contamination Report from the 2024 CONDA Shared Task

Fine-gained Zero-shot Video Sampling

Cost-Effective Hallucination Detection for LLMs

Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

Apple Intelligence Foundation Language Models

ThinK: Thinner Key Cache by Query-Driven Pruning

Matting by Generation

AI-Assisted Generation of Difficult Math Questions

How to Measure the Intelligence of Large Language Models?

Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning

JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

Meltemi: The first open Large Language Model for Greek

Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework

Harvesting Textual and Structured Data from the HAL Publication Repository

Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

Machine Unlearning in Generative AI: A Survey

Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation

Generating Gender Alternatives in Machine Translation

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Diffusion Feedback Helps CLIP See Better

rLLM: Relational Table Learning with LLMs

ByteCheckpoint: A Unified Checkpointing System for LLM Development

RelBench: A Benchmark for Deep Learning on Relational Databases

ImagiNet: A Multi-Content Dataset for Generalizable Synthetic Image Detection via Contrastive Learning

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

Sentiment Analysis of Lithuanian Online Reviews Using Large Language Models

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost

Improving Retrieval Augmented Language Model with Self-Reasoning

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture

SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain

Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

A Generic Review of Integrating Artificial Intelligence in Cognitive Behavioral Therapy

Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

WalkTheDog: Cross-Morphology Motion Alignment via Phase Manifolds

Floating No More: Object-Ground Reconstruction from a Single Image

Wolf: Captioning Everything with a World Summarization Framework

SHIC: Shape-Image Correspondences with no Keypoint Supervision

Lessons from Learning to Spin "Pens"

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

VSSD: Vision Mamba with Non-Casual State Space Duality

Model-driven Heart Rate Estimation and Heart Murmur Detection based on Phonocardiogram

The Art of Refusal: A Survey of Abstention in Large Language Models

PersonaGym: Evaluating Persona Agents and LLMs

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

Exploring Scaling Trends in LLM Robustness

The FIGNEWS Shared Task on News Media Narratives

Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

LKCell: Efficient Cell Nuclei Instance Segmentation with Large Convolution Kernels

Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Very Large-Scale Multi-Agent Simulation in AgentScope

Text-Driven Neural Collaborative Filtering Model for Paper Source Tracing

LAMBDA: A Large Model Based Data Agent

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

$VILA^2$: VILA Augmented VILA

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

3D Question Answering for City Scene Understanding

PERSONA: A Reproducible Testbed for Pluralistic Alignment

ViPer: Visual Personalization of Generative Models via Individual Preference Learning

Scalify: scale propagation for efficient low-precision LLM training

Solving The Travelling Salesman Problem Using A Single Qubit

DreamCar: Leveraging Car-specific Prior for in-the-wild 3D Car Reconstruction

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Train-Attention: Meta-Learning Where to Focus in Continual Knowledge Learning

Generation Constraint Scaling Can Mitigate Hallucination

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

A Simulation Benchmark for Autonomous Racing with Large-Scale Human Data

KAN or MLP: A Fairer Comparison

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Course-Correction: Safety Alignment Using Synthetic Preferences

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Enhancing LLM's Cognition via Structurization

Cross Anything: General Quadruped Robot Navigation through Complex Terrains

PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing

MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement Learning

TAPTRv2: Attention-based Position Update Improves Tracking Any Point

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Graph-Structured Speculative Decoding

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

DDK: Distilling Domain Knowledge for Efficient Large Language Models

BoostMVSNeRFs: Boosting MVS-based NeRFs to Generalizable View Synthesis in Large-scale Scenes

Artist: Aesthetically Controllable Text-Driven Stylization without Training

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models

Discrete Flow Matching

SIGMA: Sinkhorn-Guided Masked Video Modeling

Local All-Pair Correspondence for Point Tracking

MAVEN-Fact: A Large-scale Event Factuality Detection Dataset

LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation

ThermalNeRF: Thermal Radiance Fields

VideoGameBunny: Towards vision assistants for video games

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model

HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

A Survey on Employing Large Language Models for Text-to-SQL Tasks

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Knowledge Mechanisms in Large Language Models: A Survey and Perspective

GET-Zero: Graph Embodiment Transformer for Zero-shot Embodiment Generalization

Temporal Residual Jacobians For Rig-free Motion Transfer

Consent in Crisis: The Rapid Decline of the AI Data Commons

POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation

Compact Language Models via Pruning and Knowledge Distillation

BOND: Aligning LLMs with Best-of-N Distillation

NNsight and NDIF: Democratizing Access to Foundation Model Internals

Internal Consistency and Self-Feedback in Large Language Models: A Survey

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

The Vision of Autonomic Computing: Can LLMs Make It a Reality?

Stable Audio Open

Efficient Audio Captioning with Encoder-Level Knowledge Distillation

SparseCraft: Few-Shot Neural Reconstruction through Stereopsis Guided Geometric Linearization

EVLM: An Efficient Vision-Language Model for Visual Understanding

Visual Text Generation in the Wild

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

PlacidDreamer: Advancing Harmony in Text-to-3D Generation

Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Visual Haystacks: Answering Harder Questions About Sets of Images

Shape of Motion: 4D Reconstruction from a Single Video

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Scaling Granite Code Models to 128K Context

Understanding Reference Policies in Direct Preference Optimization

Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

Prover-Verifier Games improve legibility of LLM outputs

Weak-to-Strong Reasoning

A Comparative Study on Automatic Coding of Medical Letters with Explainability

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation

CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis

Robust ASR Error Correction with Conservative Data Filtering

PM-LLM-Benchmark: Evaluating Large Language Models on Process Mining Tasks

SciCode: A Research Coding Benchmark Curated by Scientists

Pre-Trained Foundation Model representations to uncover Breathing patterns in Speech

A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks

Retrieval-Enhanced Machine Learning: Synthesis and Opportunities

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

IMAGDressing-v1: Customizable Virtual Dressing

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Patch-Level Training for Large Language Models

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

E5-V: Universal Embeddings with Multimodal Large Language Models

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Case2Code: Learning Inductive Reasoning with Synthetic Data

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models

Splatfacto-W: A Nerfstudio Implementation of Gaussian Splatting for Unconstrained Photo Collections

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

The Art of Saying No: Contextual Noncompliance in Language Models

Exploring Advanced Large Language Models with LLMsuite

Does Refusal Training in LLMs Generalize to the Past Tense?

Efficient Training with Denoised Neural Weights

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in Grammatical Error Detection

Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

Click-Gaussian: Interactive Segmentation to Any 3D Gaussians

Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

CCoE: A Compact LLM with Collaboration of Experts

Scaling Diffusion Transformers to 16 Billion Parameters

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation

Grasping Diverse Objects with Simulated Humanoids

From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

Make-An-Agent: A Generalizable Policy Network Generator with Behavior-Prompted Diffusion

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

GRUtopia: Dream General Robots in a City at Scale

DataDream: Few-shot Guided Dataset Generation

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval

Qwen2-Audio Technical Report

Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning

Qwen2 Technical Report

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

xLSTMTime : Long-term Time Series Forecasting With xLSTM

Practical Unlearning for Large Language Models

Learning to Refuse: Towards Mitigating Privacy Risks in LLMs

Video Occupancy Models

StyleSplat: 3D Object Style Transfer with Gaussian Splatting

Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures

Human-like Episodic Memory for Infinite Context LLMs

MUSCLE: A Model Update Strategy for Compatible LLM Evolution

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

GAVEL: Generating Games Via Evolution and Language Models

Transformer Layers as Painters

H2O-Danube3 Technical Report

Context Embeddings for Efficient Answer Generation in RAG

Accuracy is Not All You Need

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

New Desiderata for Direct Preference Optimization

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

AUITestAgent: Automatic Requirements Oriented GUI Function Testing

TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models

Characterizing Prompt Compression Methods for Long Context Inference

Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing

MAVIS: Mathematical Visual Instruction Tuning

Video Diffusion Alignment via Reward Gradients

Real-Time Anomaly Detection and Reactive Planning with Large Language Models

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data

GTA: A Benchmark for General Tool Agents

OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

SEED-Story: Multimodal Long Story Generation with Large Language Model

Generalizable Implicit Motion Modeling for Video Frame Interpolation

Towards Building Specialized Generalist AI with System 1 and System 2 Fusion

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Autoregressive Speech Synthesis without Vector Quantization

Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents

WildGaussians: 3D Gaussian Splatting in the Wild

Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

Gradient Boosting Reinforcement Learning

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Toto: Time Series Optimized Transformer for Observability

Controlling Space and Time with Diffusion Models

BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark

PaliGemma: A versatile 3B VLM for transfer

VEnhancer: Generative Space-Time Enhancement for Video Generation

On Leakage of Code Generation Evaluation Datasets

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

Video-to-Audio Generation with Hidden Alignment

CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

Inference Performance Optimization for Large Language Models on CPUs

Scaling Up Personalized Aesthetic Assessment via Task Vector Customization

Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Self-Recognition in Language Models

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Vision language models are blind

RRM: Relightable assets using Radiance guided Material extraction

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

VIMI: Grounding Video Generation through Multi-modal Instruction

A Survey on Mixture of Experts

Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Compositional Video Generation as Flow Equalization

On Speeding Up Language Model Evaluation

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty

PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Distilling System 2 into System 1

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Large Language Models Understand Layouts

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

Retrieved In-Context Principles from Previous Mistakes

An accurate detection is not all you need to combat label noise in web-noisy datasets

RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions

Granular Privacy Control for Geolocation with Vision Language Models

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Associative Recurrent Memory Transformer

Revealing the Utilized Rank of Subspaces of Learning in Neural Networks

ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models

On scalable oversight with weak LLMs judging strong LLMs

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

PartCraft: Crafting Creative Objects by Parts

AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Mixture of A Million Experts

DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

CRiM-GS: Continuous Rigid Motion-Aware Gaussian Splatting from Motion Blur Images

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

AgentInstruct: Toward Generative Teaching with Agentic Flows

HEMM: Holistic Evaluation of Multimodal Foundation Models

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

Self-Evaluation as a Defense Against Adversarial Attacks on LLMs

How Does Quantization Affect Multilingual LLMs?

TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

Investigating Decoder-only Large Language Models for Speech-to-text Translation

GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

Knowledge Composition using Task Vectors with Learned Anisotropic Scaling

PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

No Training, No Problem: Rethinking Classifier-Free Guidance for Diffusion Models

Reasoning in Large Language Models: A Geometric Perspective

A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Magic Insert: Style-Aware Drag-and-Drop

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Consistency Flow Matching: Defining Straight Flows with Velocity Consistency

TokenPacker: Efficient Visual Projector for Multimodal LLM

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

AI Agents That Matter

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

RegMix: Data Mixture as Regression for Language Model Pre-training

LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

Agentless: Demystifying LLM-based Software Engineering Agents

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

ColPali: Efficient Document Retrieval with Vision Language Models

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER

MIRAI: Evaluating LLM Agents for Event Forecasting

Searching for Best Practices in Retrieval-Augmented Generation

$\text{Memory}^3$: Language Modeling with Explicit Memory

Eliminating Position Bias of Language Models: A Mechanistic Approach

PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs

Towards Robust Speech Representation Learning for Thousands of Languages

InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

LiteSearch: Efficacious Tree Search for LLM

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

2406

Meta Large Language Model Compiler: Foundation Models of Compiler Optimization

Gemma 2: Improving Open Language Models at a Practical Size

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Applying RLAIF for Code Generation with API-usage in Lightweight LLMs

Understanding and Mitigating Language Confusion in LLMs

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

Wavelets Are All You Need for Autoregressive Image Generation

Direct Preference Knowledge Distillation for Large Language Models

ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

What Matters in Detecting AI-Generated Videos like Sora?

Instance-Optimal Private Density Estimation in the Wasserstein Distance

Investigating How Large Language Models Leverage Internal Knowledge to Perform Complex Reasoning

Dataset Size Recovery from LoRA Weights

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

The Remarkable Robustness of LLMs: Stages of Inference?

TabReD: A Benchmark of Tabular Machine Learning in-the-Wild

Efficient World Models with Context-Aware Tokenization

LiveBench: A Challenging, Contamination-Free LLM Benchmark

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation

Revealing Fine-Grained Values and Opinions in Large Language Models

Aligning Teacher with Student Preferences for Tailored Training Data Generation

Simulating Classroom Education with LLM-Empowered Agents

T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation

RouteLLM: Learning to Route LLMs with Preference Data

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Symbolic Learning Enables Self-Evolving Agents

MatchTime: Towards Automatic Soccer Game Commentary Generation

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

GaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enhanced Quality

A Closer Look into Mixture-of-Experts in Large Language Models

ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models

Poisoned LangChain: Jailbreak LLMs by LangChain

ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs

Octo-planner: On-device Language Model for Planner-Action Agents

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Fast and Uncertainty-Aware SVBRDF Recovery from Multi-View Capture using Frequency Domain Analysis

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

DiffusionPDE: Generative PDE-Solving Under Partial Observation

MotionBooth: Motion-Aware Customized Text-to-Video Generation

Following Length Constraints in Instructions

Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

Aligning Diffusion Models with Noise-Conditioned Perception

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

Multi-property Steering of Large Language Models with Dynamic Activation Composition

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Benchmarking Mental State Representations in Language Models

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

D2LLM: Decomposed and Distilled Large Language Models for Semantic Search

Unlocking Continual Learning Abilities in Language Models

Large Language Models Assume People are More Rational than We Really are

Understanding and Diagnosing Deep Reinforcement Learning

FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Long Context Transfer from Language to Vision

RaTEScore: A Metric for Radiology Report Generation

ClotheDreamer: Text-Guided Garment Generation with 3D Gaussians

Adam-mini: Use Fewer Learning Rates To Gain More

OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?

WARP: On the Benefits of Weight Averaged Rewarded Policies

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

Scaling Laws for Linear Complexity Language Models

Repulsive Score Distillation for Diverse Sampling of Diffusion Models

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

On the Transformations across Reward Model, Parameter Update, and In-Context Prompt

EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Video-Infinity: Distributed Long Video Generation

Confidence Regulation Neurons in Language Models

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Evaluating D-MERIT of Partial-annotation on Information Retrieval

Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

What Matters in Transformers? Not All Attention is Needed

Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking

Image Conductor: Precision Control for Interactive Video Synthesis

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Cognitive Map for Language Models: Optimal Planning via Verbally Representing the World Model

MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems

Towards Retrieval Augmented Generation over Large Video Libraries

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

ToVo: Toxicity Taxonomy via Voting

Efficient Continual Pre-training by Mitigating the Stability Gap

How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation

Can LLMs Learn by Teaching? A Preliminary Study

Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

IRASim: Learning Interactive Real-Robot Action Simulators

Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Jailbreaking as a Reward Misspecification Problem

$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials

LiveMind: Low-latency Large Language Models with Simultaneous Inference

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Towards Event-oriented Long Video Understanding

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

Adaptable Logical Control for Large Language Models

StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images

Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation

Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations

Improving Visual Commonsense in Language Models via Multiple Image Generation

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

4K4DGen: Panoramic 4D Generation at 4K Resolution

EvTexture: Event-driven Texture Enhancement for Video Super-Resolution

Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images

VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models

Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models

DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models

GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks

Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation

VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

Adversarial Attacks on Multimodal Agents

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Measuring Psychological Depth in Language Models

Estimating Knowledge in Large Language Models Without Generating a Single Token

Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP

RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation

Low-Resource Machine Translation through the Lens of Personalized Federated Learning

HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors

PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment

JEN-1 DreamStyler: Customized Musical Concept Learning via Pivotal Parameters Tuning

VoCo-LLaMA: Towards Vision Compression with Large Language Models

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

TroL: Traversal of Layers for Large Language and Vision Models

Interface Design for Self-Supervised Speech Models

BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Learning Molecular Representation in a Cell

Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning

$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

Large Scale Transfer Learning for Tabular Data via Language Modeling

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology

Mixture-of-Subspaces in Low-Rank Adaptation

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Unveiling Encoder-Free Vision-Language Models

WPO: Enhancing RLHF with Weighted Preference Optimization

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

VideoLLM-online: Online Video Large Language Model for Streaming Video

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

DataComp-LM: In search of the next generation of training sets for language models

Task Me Anything

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Measuring memorization in RLHF for code completion

Nemotron-4 340B Technical Report

Tokenization Falling Short: The Curse of Tokenization

Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

HARE: HumAn pRiors, a key to small language model Efficiency

Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion

Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation

The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

From Pixels to Prose: A Large Dataset of Dense Image Captions

L4GM: Large 4D Gaussian Reconstruction Model

Beyond Words: On Large Language Models Actionability in Mission-Critical Risk Analysis

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Training-free Camera Control for Video Generation

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

GaussianSR: 3D Gaussian Super-Resolution with 2D Diffusion Priors

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

GEB-1.3B: Open Lightweight Large Language Model

Bootstrapping Language Models with DPO Implicit Rewards

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Decoding the Diversity: A Review of the Indic AI Research Landscape

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Depth Anything V2

Interpreting the Weight Space of Customized Diffusion Models

Explore the Limits of Omni-modal Pretraining at Scale

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

LRM-Zero: Training Large Reconstruction Models with Synthesized Data

Understanding Hallucinations in Diffusion Models through Mode Interpolation

CMC-Bench: Towards a New Paradigm of Visual Signal Compression

Transformers meet Neural Algorithmic Reasoners

Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

OpenVLA: An Open-Source Vision-Language-Action Model

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Cognitively Inspired Energy-Based World Models

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

HelpSteer2: Open-source dataset for training top-performing reward models

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

DiTFastAttn: Attention Compression for Diffusion Transformer Models

RVT-2: Learning Precise Manipulation from Few Demonstrations

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Real3D: Scaling Up Large Reconstruction Models with Real-World Images

What If We Recaption Billions of Web Images with LLaMA-3?

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Discovering Preference Optimization Algorithms with and for Large Language Models

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation

Is Programming by Example solved by LLMs?

Can Large Language Models Analyze Software Failures in the News? An End-to-End Automated Pipeline with FAIL

Transformer-based Model for ASR N-Best Rescoring and Rewriting

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

Multimodal Table Understanding

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Large Language Model Unlearning via Embedding-Corrupted Prompts

Designing a Dashboard for Transparency and Control of Conversational AI

Hierarchical Patch Diffusion Models for High-Resolution Video Generation

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

An Image is Worth 32 Tokens for Reconstruction and Generation

Zero-shot Image Editing with Reference Imitation

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

Simple and Effective Masked Diffusion Language Models

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Neural Gaffer: Relighting Any Object via Diffusion

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement

TextGrad: Automatic "Differentiation" via Text

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Estimating the Hallucination Rate of Generative AI

McEval: Massively Multilingual Code Evaluation

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

World Models with Hints of Large Language Models for Goal Achieving

Needle In A Multimodal Haystack

Merging Improves Self-Critique Against Jailbreak Attacks

TernaryLLM: Ternarized Large Language Model

Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement

Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Synthetic Query Generation using Large Language Models for Virtual Assistants

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound

The Prompt Report: A Systematic Survey of Prompting Techniques

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

IllumiNeRF: 3D Relighting without Inverse Rendering

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing

Towards a Personal Health Large Language Model

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad

VCR: Visual Caption Restoration

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

On the Minimal Degree Bias in Generalization on the Unseen for non-Boolean Functions

Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

Tx-LLM: A Large Language Model for Therapeutics

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

MaskLID: Code-Switching Language Identification through Iterative Masking

Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis

ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models

Vript: A Video Is Worth Thousands of Words

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Attention as a Hypernetwork

Unified Text-to-Image Generation and Retrieval

MLCM: Multistep Consistency Distillation of Latent Diffusion Model

GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Hibou: A Family of Foundational Vision Transformers for Pathology

SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

CRAG -- Comprehensive RAG Benchmark

Mixture-of-Agents Enhances Large Language Model Capabilities

Learning Task Decomposition to Assist Humans in Competitive Programming

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

Proofread: Fixes All Errors with One Tap

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

Time Sensitive Knowledge Editing through Efficient Finetuning

GenAI Arena: An Open Evaluation Platform for Generative Models

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Large Language Model Confidence Estimation via Black-Box Access

Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

Simplified and Generalized Masked Diffusion for Discrete Data

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

SF-V: Single Forward Video Generation Model

Chimera: Effectively Modeling Multivariate Time Series with 2-Dimensional State Space Models

Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

VideoTetris: Towards Compositional Text-to-Video Generation

Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models

Open-Endedness is Essential for Artificial Superhuman Intelligence

Hypernetworks for Personalizing ASR to Atypical Speech

Confabulation: The Surprising Value of Large Language Model Hallucinations

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments

Are We Done with MMLU?

Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation

Evaluating the World Model Implicit in a Generative Model

Enhancing CTC-based speech recognition with diverse modeling units

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Xmodel-LM Technical Report

Item-Language Model for Conversational Recommendation

RATT: A Thought Structure for Coherent and Correct LLM Reasoning

Block Transformer: Global-to-Local Language Modeling for Fast Inference

To Believe or Not to Believe Your LLM

Parrot: Multilingual Visual Instruction Tuning

Scalable MatMul-free Language Modeling

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Guiding a Diffusion Model with a Bad Version of Itself

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Improved Modelling of Federated Datasets using Mixtures-of-Dirichlet-Multinomials

Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

I4VGen: Image as Stepping Stone for Text-to-Video Generation

OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models

Self-Improving Robust Preference Optimization

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

Learning Temporally Consistent Video Depth from Video Diffusion Priors

pOps: Photo-Inspired Diffusion Operators

Towards Scalable Automated Alignment of LLMs: A Survey

Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

Improving GFlowNets for Text-to-Image Diffusion Alignment

Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning

$μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

2405

PaliGemma

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Xwin-LM: Strong and Scalable Alignment Practice for LLMs

GECO: Generative Image-to-3D within a SECOnd

DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning

PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting

Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

Why Larger Language Models Do In-context Learning Differently?

Contrasting Multiple Representations with the Multi-Marginal Matching Gap

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

NPGA: Neural Parametric Gaussian Avatars

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Offline Regularised Reinforcement Learning for Large Language Models Alignment

EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

LLMs achieve adult human performance on higher-order theory of mind tasks

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

Contextual Position Encoding: Learning to Count What's Important

Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

GFlow: Recovering 4D World from Monocular Video

3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting

Phased Consistency Model

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

Faithful Logical Reasoning via Symbolic Chain-of-Thought

4-bit Shampoo for Memory-Efficient Network Training

2BP: 2-Stage Backpropagation

VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections

Yuan 2.0-M32: Mixture of Experts with Attention Router

Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation

Matryoshka Multimodal Models

Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer

THREAD: Thinking Deeper with Recursive Spawning

Transformers Can Do Arithmetic with the Right Embeddings

Trans-LoRA: towards data-free Transferable Parameter Efficient Finetuning

An Introduction to Vision-Language Modeling

Position: Foundation Agents as the Paradigm Shift for Decision Making

Part123: Part-aware 3D Reconstruction from a Single-view Image

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

The Road Less Scheduled

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

Are Long-LLMs A Necessity For Long-Context Tasks?

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models

OptLLM: Optimal Assignment of Queries to Large Language Models

HDR-GS: Efficient High Dynamic Range Novel View Synthesis at 1000x Speed via Gaussian Splatting

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

Aya 23: Open Weight Releases to Further Multilingual Progress

AGRaME: Any-Granularity Ranking with Multi-Vector Embeddings

CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner

Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining

AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}

NeRF-Casting: Improved View-Dependent Appearance with Consistent Reflections

Improved Distribution Matching Distillation for Fast Image Synthesis

Tele-Aloha: A Low-budget and High-authenticity Telepresence System Using Sparse RGB Cameras

Not All Language Model Features Are Linear

Semantica: An Adaptable Image-Conditioned Diffusion Model

Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling

Lessons from the Trenches on Reproducible Evaluation of Language Models

SimPO: Simple Preference Optimization with a Reference-Free Reward

RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

Agent Planning with World Knowledge Model

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Distributed Speculative Inference of Large Language Models

Attention as an RNN

Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching

ReVideo: Remake a Video with Motion and Content Control

Thermodynamic Natural Gradient Descent

Dense Connector for MLLMs

CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

KPConvX: Modernizing Kernel Point Convolution with Kernel Attention

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

Personalized Residuals for Concept-Driven Text-to-Image Generation

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

Retrieval-Augmented Language Model for Extreme Multi-Label Knowledge Graph Link Prediction

Quantifying Emergence in Large Language Models

Diffusion for World Modeling: Visual Details Matter in Atari

Your Transformer is Secretly Linear

Octo: An Open-Source Generalist Robot Policy

Training Data Attribution via Approximate Unrolled Differentiation

MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

Imp: Highly Capable Large Multimodal Models for Mobile Devices

On Efficient and Statistical Quality Estimation for Data Annotation

Information Leakage from Embedding in Large Language Models

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching

Towards Modular LLMs by Building and Reusing a Library of LoRAs

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Observational Scaling Laws and the Predictability of Language Model Performance

Efficient Multimodal Large Language Models: A Survey

INDUS: Effective and Efficient Language Models for Scientific Applications

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Dynamic data sampler for cross-language transfer learning in large language models

Grounded 3D-LLM with Referent Tokens

Toon3D: Seeing Cartoons from a New Perspective

TRANSIC: Sim-to-Real Policy Transfer by Learning from Online Correction

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

How Far Are We From AGI

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Many-Shot In-Context Learning in Multimodal Foundation Models

LoRA Learns Less and Forgets Less

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Naturalistic Music Decoding from EEG Data via Latent Diffusion Models

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

Risks and Opportunities of Open-Source Generative AI

Understanding the performance gap between online and offline alignment algorithms

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

SpeechVerse: A Large-scale Generalizable Audio Language Model

Compositional Text-to-Image Generation with Dense Blob Representations

Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning

A Survey of Large Language Models for Graphs

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

The Platonic Representation Hypothesis

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Zero-Shot Tokenizer Transfer

RLHF Workflow: From Reward Modeling to Online RLHF

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

PromptLink: Leveraging Large Language Models for Cross-Source Biomedical Concept Linking

LogoMotion: Visually Grounded Code Generation for Content-Aware Animation

Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training

SUTRA: Scalable Multilingual Language Model Architecture

Large Language Models as Planning Domain Generators

Linearizing Large Language Models

Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval

A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

You Only Cache Once: Decoder-Decoder Architectures for Language Models

ChuXin: 1.6B Technical Report

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

xLSTM: Extended Long Short-Term Memory

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

ContextQ: Generated Questions to Support Meaningful Parent-Child Dialogue While Co-Reading

AlphaMath Almost Zero: process Supervision without process

MAmmoTH2: Scaling Instructions from the Web

Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

Is Flash Attention Stable?

What matters when building vision-language models?

Optimization without Retraction on the Random Generalized Stiefel Manifold

Customizing Text-to-Image Models with a Single Image Pair

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

FLAME: Factuality-Aware Alignment for Large Language Models

NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment

WildChat: 1M ChatGPT Interaction Logs in the Wild

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

LLM-AD: Large Language Model based Audio Description System

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Spectrally Pruned Gaussian Fields with Neural Compensation

Self-Play Preference Optimization for Language Model Alignment

Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3

A Note on Large Sums of Divisor-Bounded Multiplicative Functions

BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge

STT: Stateful Tracking with Transformers for Autonomous Driving

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

Constrained Decoding for Secure Code Generation

A Primer on the Inner Workings of Transformer-based Language Models

SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models

In-Context Learning with Long-Context Models: An In-Depth Exploration

Automatic Creative Selection with Cross-Modal Matching

2404

OpenEQA: Embodied Question Answering in the Era of Foundation Models

CodeGemma: Open Code Models Based on Gemma

Lightplane: Highly-Scalable Components for Neural 3D Fields

MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model

Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

KAN: Kolmogorov-Arnold Networks

DOCCI: Descriptions of Connected and Contrasting Images

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Better & Faster Large Language Models via Multi-token Prediction

Iterative Reasoning Preference Optimization

When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

Extending Llama-3's Context Ten-Fold Overnight

RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing

MicroDreamer: Zero-shot 3D Generation in $\sim$20 Seconds by Score-based Iterative Reconstruction

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

Octopus v4: Graph of language models

SAGS: Structure-Aware 3D Gaussian Splatting

In-Context Symbolic Regression: Leveraging Large Language Models for Function Discovery

Hallucination of Multimodal Large Language Models: A Survey

Stylus: Automatic Adapter Selection for Diffusion Models

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

PECC: Problem Extraction and Coding Challenges

ChatGPT as an inventor: Eliciting the strengths and weaknesses of current large language models against humans in engineering design

Capabilities of Gemini Models in Medicine

LEGENT: Open Platform for Embodied Agents

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

MaPa: Text-driven Photorealistic Material Painting for 3D Shapes

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

Make Your LLM Fully Utilize the Context

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving

Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding

Tele-FLM Technical Report

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Interactive3D: Create What You Want by Interactive 3D Generation

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

NeRF-XL: Scaling NeRFs with Multiple GPUs

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

MaGGIe: Masked Guided Gradual Human Instance Matting

MoDE: CLIP Data Experts via Clustering

Editable Image Elements for Controllable Synthesis

PuLID: Pure and Lightning ID Customization via Contrastive Alignment

Leveraging Large Language Models for Multimodal Search

MotionMaster: Training-free Camera Motion Transfer For Video Generation

BASS: Batched Attention-optimized Speculative Sampling

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference

Label-Efficient Sleep Staging Using Transformers Pre-trained with Position Prediction

Multi-Head Mixture-of-Experts

Transformers Can Represent n-gram Language Models

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models

FlashSpeech: Efficient Zero-Shot Speech Synthesis

Pegasus-1 Technical Report

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

Align Your Steps: Optimizing Sampling Schedules in Diffusion Models

SnapKV: LLM Knows What You are Looking for Before Generation

Learning H-Infinity Locomotion Control

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

A Multimodal Automated Interpretability Agent

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer

A Survey on Efficient Inference for Large Language Models

MultiBooth: Towards Generating All Your Concepts in an Image from Text

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

How Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study

Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis

Music Consistency Models

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

FlowMind: Automatic Workflow Generation with LLMs

PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

Stronger Random Baselines for In-Context Learning

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction

LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency

How Far Can We Go with Practical Function-Level Program Repair?

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation

Does Gaussian Splatting need SFM Initialization?

HalluciBot: Is There No Such Thing as a Bad Question?

BLINK: Multimodal Large Language Models Can See but Not Perceive

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

MeshLRM: Large Reconstruction Model for High-Quality Mesh

From r to Q*: Your Language Model is Secretly a Q-Function

AniClipart: Clipart Animation with Text-to-Video Priors

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Introducing v0.5 of the AI Safety Benchmark from MLCommons

OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data

EdgeFusion: On-Device Text-to-Image Generation

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Dynamic Typography: Bringing Words to Life

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

Many-Shot In-Context Learning

A Survey on Retrieval-Augmented Text Generation for Large Language Models

HumMUSS: Human Motion Understanding using State Space Models

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

Social Choice for AI Alignment: Dealing with Diverse Human Feedback

Private Vector Mean Estimation in the Shuffle Model: Optimal Rates Require Many Messages

How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs’ internal prior

Scaling Instructable Agents Across Many Simulated Worlds

Chinchilla Scaling: A replication attempt

Taming Latent Diffusion Model for Neural Radiance Field Inpainting

MMInA: Benchmarking Multihop Multimodal Internet Agents

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

CTRL-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Compression Represents Intelligence Linearly

Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video

Learn Your Reference Model for Real Good Alignment

State Space Model for New-Generation Network Alternative to Transformers: A Survey

CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

TransformerFAM: Feedback attention is working memory

LLM In-Context Recall is Prompt Dependent

On Speculative Decoding for Multimodal Large Language Models

The Illusion of State in State-Space Models

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

COCONut: Modernizing COCO Segmentation

Probing the 3D Awareness of Visual Foundation Models

Pre-training Small Base LMs with Fewer Tokens

On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation

Dataset Reset Policy Optimization for RLHF

MonoPatchNeRF: Improving Neural Radiance Fields with Patch-based Monocular Guidance

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

Reducing hallucination in structured outputs via Retrieval-Augmented Generation

Conformal Prediction via Regression-as-Classification

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

LLoCO: Learning Long Contexts Offline

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

OSWORLD: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

RHO-1: Not All Tokens Are What You Need

HGRN2: Gated Linear RNNs with State Expansion

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Sparse Laneformer

Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models

Audio Dialogues: Dialogues dataset for audio and music understanding

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Best Practices and Lessons Learned on Synthetic Data for Language Models

Transferable and Principled Efficiency for Open-Vocabulary Segmentation

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational Notebooks

BRAVE: Broadening the visual encoding of vision-language models

RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

Adapting LLaMA Decoder to Vision Transformer

RULER: What's the Real Context Size of Your Long-Context Language Models?

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Reconstructing Hand-Held Objects in 3D

pfl-research: simulation framework for accelerating research in Private Federated Learning

Magic-Boost: Boost 3D Generation with Mutli-View Conditioned Diffusion

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

MuPT: A Generative Symbolic Music Pretrained Transformer

OmniFusion Technical Report

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

Revising Densification in Gaussian Splatting

Hash3D: Training-free Acceleration for 3D Generation

Privacy Preserving Prompt Engineering: A Survey

THOUGHTSCULPT: Reasoning with Intermediate Revision and Search

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

CodecLM: Aligning Language Models with Tailored Synthetic Data

SambaLingo: Teaching Large Language Models New Languages

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Evaluating Mathematical Reasoning Beyond Accuracy

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

YaART: Yet Another ART Rendering Technology

UniFL: Improve Stable Diffusion via Unified Feedback Learning

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

DATENeRF: Depth-Aware Text-based Editing of NeRFs

Q-PEFT: Query-dependent Parameter Efficient Fine-tuning for Text Reranking with Large Language Models

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models

Aligning Diffusion Models by Optimizing Human Utility

PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations

Koala: Key frame-conditioned long video-LLM

SpatialTracker: Tracking Any 2D Pixels in 3D Space

Sigma : Siamese Mamba Network for Multi-Modal Semantic Segmentation

Robust Gaussian Splatting

Social Skill Training with Large Language Models

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

BuDDIE: A Business Document Dataset for Multi-task Information Extraction

Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data

CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Stream of Search (SoS): Learning to Search in Language

RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent

Training LLMs over Neurally Compressed Text

ReFT: Representation Finetuning for Language Models

PointInfinity: Resolution-Invariant Point Diffusion Models

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Scaling Up Video Summarization Pretraining with Large Language Models

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

On the Scalability of Diffusion-based Text-to-Image Generation

Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Advancing LLM Reasoning Generalists with Preference Trees

Long-context LLMs Struggle with Long In-context Learning

HyperCLOVA X Technical Report

Poro 34B and the Blessing of Multilinguality

Octopus v2: On-device language model for super agent

Entity Disambiguation via Fusion Entity Decoding

LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models

Are large language models superhuman chemists?

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

HairFastGAN: Realistic and Robust Hair Transfer with a Fast Encoder-Based Approach

2403

ReALM: Reference Resolution As Language Modeling

Gecko: Versatile Text Embeddings Distilled from Large Language Models

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

DiJiang: Efficient Large Language Models through Compact Kernelization

Jamba: A Hybrid Transformer-Mamba Language Model

Localizing Paragraph Memorization in Language Models

Model Stock: All we need is just a few fine-tuned models

sDPO: Don’t Use Your Data All at Once

Learning From Correctness Without Prompting Makes LLM Efficient Reasoner

Towards a World-English Language Model for On-Device Virtual Assistants

BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models

The Unreasonable Ineffectiveness of the Deeper Layers

Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Embedding Pose Graph, Enabling 3D Foundation Model Capabilities with a Compact Representation

Arcee’s MergeKit: A Toolkit for Merging Large Language Models

Evolutionary Optimization of Model Merging Recipes

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

LLM as a System Service on Mobile Devices

RAFT: Adapting Language Model to Domain Specific RAG

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

WavCraft: Audio Editing and Generation with Large Language Models

Gemma: Open Models Based on Gemini Research and Technology

A Direct Algorithm for Multi-Gyroscope Infield Calibration

Process Modeling With Large Language Models

VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Poly-View Contrastive Learning

Is Cosine-Similarity of Embeddings Really About Similarity?

How Far Are We from Intelligent Visual Deductive Reasoning?

Learning to Decode Collaboratively with Multiple Language Models

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

On a Neural Implementation of Brenier's Polar Factorization

LAB: Large-Scale Alignment for ChatBots

CLLMs: Consistency Large Language Models

2402

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

Training Neural Networks from Scratch with Parallel Low-Rank Adapters

FuseChat: Knowledge Fusion of Chat Models

OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining

Divide-or-Conquer? Which Part Should You Distill Your LLM?

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models

OmniPred: Language Models as Universal Regressors

Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge

A Survey on Knowledge Distillation of Large Language Models

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

OneBit: Towards Extremely Low-bit Large Language Models

LaCo: Large Language Model Pruning via Layer Collapse

Speculative Streaming: Fast LLM Inference without Auxiliary Models

Masked Attention is All You Need for Graphs

TOAD: Task-Oriented Automatic Dialogs with Diverse Response Styles

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

DoRA: Weight-Decomposed Low-Rank Adaptation

Higher Layers Need More LoRA Experts

On Computationally Efficient Multi-Class Calibration

X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics and Molecular Design

Accurate LoRA-Finetuning Quantization of LLMs via Information Retention

More Agents Is All You Need

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

DISTILLM: Towards Streamlined Distillation for Large Language Models

ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs

Careful with that Scalpel: Improving Gradient Surgery with an EMA

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Executable Code Actions Elicit Better LLM Agents

2401

DressCode: Autoregressively Sewing and Generating Garments from Text Guidance

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Omnipredictors for Regression and the Approximate Rank of Convex Functions

Demystifying Chains, Trees, and Graphs of Thoughts

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tuning Language Models by Proxy

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models

Extreme Compression of Large Language Models via Additive Quantization

Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages

LLaMA Pro: Progressive LLaMA with Block Expansion

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

2312

SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

How Smooth Is Attention?

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Retrieval-Augmented Generation for Large Language Models: A Survey

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices

LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin

An LLM Compiler for Parallel Function Calling

Revisiting Non-separable Binary Classification and its Applications in Anomaly Detection

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

2311

Diffusion Models Without Attention

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

Swallowing the Bitter Pill: Simplified Scalable Conformer Generation

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

RELIC: Investigating Large Language Model Responses using Self-Consistency

Direct2.5: Diverse 3D Content Creation via Multi-view 2.5D Diffusion

PaSS: Parallel Speculative Sampling

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

MultiLoRA: Democratizing LoRA for Better Multi-Task Learning

PINE: Efficient Norm-Bound Verification for Secret-Shared Vectors

Tied-Lora: Enhancing parameter efficiency of LoRA with weight tying

PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning

Transfer Learning for Structured Pruning under Limited Task Data

Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

Prompt Sketching for Large Language Models

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

Server-side Rescoring of Spoken Entity-centric Knowledge Queries for Virtual Assistants

FlashDecoding++: Faster Large Langauge Model Inference on GPUs

Efficient LLM Inference on CPUs

2310

EELBERT: Tiny Models through Dynamic Embeddings

Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE

LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery

FP8-LM: Training FP8 Large Language Models

Large Language Models as Generalizable Policies for Embodied Tasks

LLM-FP4: 4-Bit Floating-Point Quantized Transformers

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Matryoshka Diffusion Models

We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

SPEED: Speculative Pipelined Execution for Efficient Decoding

VeRA: Vector-based Random Matrix Adaptation

BitNet: Scaling Transformers for Large Language Models

When Can Transformers Reason With Abstract Symbols?

LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

Pseudo-Generalized Dynamic View Synthesis from a Video

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Generative Modeling with Phase Stochastic Bridges

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Improved Baselines with Visual Instruction Tuning

Large Language Models as Analogical Reasoners

Compressing LLMs: The Truth is Rarely Pure and Never Simple

Federated Learning with Differential Privacy for End-to-End Speech Recognition

Towards Automated Accessibility Report Generation for Mobile Apps

2309

Efficient Streaming Language Models with Attention Sinks

Guiding Instruction-based Image Editing via Multimodal Large Language Models

Vision Transformers Need Registers

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Efficient Memory Management for Large Language Model Serving with PagedAttention

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

LLMCad: Fast and Scalable On-device Large Language Model Inference

2308

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library

Fast Feedforward Networks

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

NimbRo wins ANA Avatar XPRIZE Immersive Telepresence Competition: Human-Centric Evaluation and Lessons Learned

Reinforced Self-Training (ReST) for Language Modeling

A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations

Accelerating LLM Inference with Staged Speculative Decoding

AgentBench: Evaluating LLMs as Agents

Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

2307

Samplable Anonymous Aggregation for Private Federated Data Analysis

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

2306

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators

MiniLLM: Knowledge Distillation of Large Language Models

MOFI: Learning Image Representation from Noisy Entity Annotated Images

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

TIES-Merging: Resolving Interference When Merging Models

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Bytes Are All You Need: Transformers Operating Directly On File Bytes

2305

LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Scaling Data-Constrained Language Models

Manifold Diffusion Fields

QLoRA: Efficient Finetuning of Quantized LLMs

Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

RWKV: Reinventing RNNs for the Transformer Era

Accurate Knowledge Distillation with n-best Reranking

LLM-Pruner: On the Structural Pruning of Large Language Models

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Fast Distributed Inference Serving for Large Language Models

Shap-E: Generating Conditional 3D Implicit Functions

2304

Are Emergent Abilities of Large Language Models a Mirage?

Visual Instruction Tuning

2303

A Survey of Large Language Models

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Sigmoid Loss for Language Image Pre-Training

Sparks of Artificial General Intelligence: Early experiments with GPT-4

ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

2302

Full Stack Optimization of Transformer Inference: a Survey

Active Prompting with Chain-of-Thought for Large Language Models

RETVec: Resilient and Efficient Text Vectorizer

Offsite-Tuning: Transfer Learning without Full Model

Accelerating Large Language Model Decoding with Speculative Sampling

2301

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

Muse: Text-To-Image Generation via Masked Generative Transformers

2212

Large Language Models Are Reasoning Teachers

2211

Fast Inference from Transformers via Speculative Decoding

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

2210

Deploying a Retrieval based Response Model for Task Oriented Dialogues

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

2209

FP8 Formats for Deep Learning

2208

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

2207

Confident Adaptive Language Modeling

2206

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

NIPQ: Noise proxy-based Integrated Pseudo-Quantization

2205

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Towards Understanding Grokking: An Effective Theory of Representation Learning

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

2204

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

2203

A Survey of Multi-Tenant Deep Learning Inference on GPU

2202

cosFormer: Rethinking Softmax in Attention

2201

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

2112

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2110

Scalable Smartphone Cluster for Deep Learning

Understanding Dimensional Collapse in Contrastive Self-supervised Learning

2106

LibShalom: Optimizing Small and Irregular-Shaped Matrix Multiplications on ARMv8 Multi-Cores

LoRA: Low-Rank Adaptation of Large Language Models

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

2105

A Survey of Data Augmentation Approaches for NLP

2104

RoFormer: Enhanced Transformer with Rotary Position Embedding

The Power of Scale for Parameter-Efficient Prompt Tuning

2101

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Prefix-Tuning: Optimizing Continuous Prompts for Generation

2010

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

TurboTransformers: An Efficient GPU Serving System For Transformer Models

2009

Flexible Performant GEMM Kernels on GPUs

2007

Soft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks

2005

Language Models are Few-Shot Learners

BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs

2004

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

2003

Transformer++

2002

GLU Variants Improve Transformer

1910

Depth-Adaptive Transformer

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

1909

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

TinyBERT: Distilling BERT for Natural Language Understanding

1908

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

1907

RoBERTa: A Robustly Optimized BERT Pretraining Approach

1906

How multilingual is Multilingual BERT?

1905

HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization

1810

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

1805

Online normalizer calculation for softmax

1803

NVIDIA Tensor Core Programmability, Performance & Precision

1706

Attention Is All You Need

1506

Pointer Networks

About

nlp

ai

papers

nlp-papers

llm

nlp-paper-summarization

18

Stars

2

Forks

Watchers

Owner

← Metadata

18

Stars

2

Forks

Watchers

Owner

Metadata