Paper List
2501
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
Unifying Specialized Visual Encoders for Video Language Models
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Nested Attention: Semantic-aware Attention Values for Concept Personalization
SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
ProgCo: Program Helps Self-Correction of Large Language Models
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
A3: Android Agent Arena for Mobile GUI Agents
Graph Generative Pre-trained Transformer
Dynamic Scaling of Unit Tests for Code Reward Modeling
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Population Aware Diffusion for Time Series Generation
Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding
Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
MLLM-as-a-Judge for Image Safety without Human Labeling
LTX-Video: Realtime Video Latent Diffusion
2412
PERSE: Personalized 3D Generative Avatars from A Single Portrait
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Aviary: training language agents on challenging scientific tasks
PyG-SSL: A Graph Self-Supervised Learning Toolkit
Facilitating large language model Russian adaptation with Learned Embedding Propagation
Training Software Engineering Agents and Verifiers with SWE-Gym
Edicho: Consistent Image Editing in the Wild
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
MapQaTor: A System for Efficient Annotation of Map Query Datasets
Efficiently Serving LLM Reasoning Programs with Certaindex
Slow Perception: Let's Perceive Geometric Figures Step-by-step
Bringing Objects to Life: 4D generation from 3D objects
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
Toward Adaptive Reasoning in Large Language Models with Thought Rollback
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
Xmodel-2 Technical Report
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Introduction to Graph Neural Networks: A Starting Point for Machine Learning Engineers
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Token-Budget-Aware LLM Reasoning
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
GeAR: Graph-enhanced Agent for Retrieval-augmented Generation
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
DepthLab: From Partial to Complete
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
WavePulse: Real-time Content Analytics of Radio Livestreams
Large Motion Video Autoencoding with Cross-modal Video VAE
Automating the Search for Artificial Life with Foundation Models
PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion
ResearchTown: Simulator of Human Research Community
The Superposition of Diffusion Models Using the Itô Density Estimator
In Case You Missed It: ARC 'Challenge' Is Not That Challenging
Deliberation in Latent Space via Differentiable Cache Augmentation
YuLan-Mini: An Open Data-efficient Language Model
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
VidTwin: Video VAE with Decoupled Structure and Dynamics
SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought
A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
Diving into Self-Evolving Training for Multimodal Reasoning
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Better Think with Tables: Leveraging Tables to Enhance Large Language Model Comprehension
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
GraphAgent: Agentic Graph Language Assistant
System-2 Mathematical Reasoning via Enriched Instruction Tuning
Revisiting In-Context Learning with Long Context Language Models
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning
OpenAI o1 System Card
NILE: Internal Consistency Alignment in Large Language Models
LearnLM: Improving Gemini for Learning
Offline Reinforcement Learning for LLM Multi-Step Reasoning
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up
Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning
Fietje: An open, efficient LLM for Dutch
SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
Rethinking Uncertainty Estimation in Natural Language Generation
Parallelized Autoregressive Visual Generation
Outcome-Refining Process Supervision for Code Generation
Qwen2.5 Technical Report
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps
IDOL: Instant Photorealistic 3D Human Creation from a Single Image
RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response
Progressive Multimodal Reasoning via Active Retrieval
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
How to Synthesize Text Data without Model Collapse?
TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
A Survey on LLM Inference-Time Self-Improvement
PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
AniDoc: Animation Creation Made Easier
Learning from Massive Human Videos for Universal Humanoid Pose Control
FashionComposer: Compositional Fashion Image Generation
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities
Alignment faking in large language models
CAD-Recode: Reverse Engineering CAD Code from Point Clouds
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
GUI Agents: A Survey
DateLogicQA: Benchmarking Temporal Biases in Large Language Models
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
Move-in-2D: 2D-Conditioned Human Motion Generation
Are Your LLMs Capable of Stable Reasoning?
VidTok: A Versatile and Open-Source Video Tokenizer
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
MIVE: New Design and Benchmark for Multi-Instance Video Editing
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers
When to Speak, When to Abstain: Contrastive Decoding with Abstention
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
How to Choose a Threshold for an Evaluation Metric for Large Language Models
Causal Diffusion Transformers for Generative Modeling
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator
Wonderland: Navigating 3D Scenes from a Single Image
IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations
The Open Source Advantage in Large Language Models (LLMs)
Cost-Effective Label-free Node Classification with LLMs
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
Precise Length Control in Large Language Models
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
Stepwise Reasoning Error Disruption Attack of LLMs
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation
Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Just a Simple Transformation is Enough for Data Protection in Vertical Federated Learning
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors
Sequence Matters: Harnessing Video Models in 3D Super-Resolution
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
Whisper-GPT: A Hybrid Representation Audio Large Language Model
Reliable, Reproducible, and Really Fast Leaderboards with Evalica
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
Smaller Language Models Are Better Instruction Evolvers
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
Superhuman performance of a large language model on the reasoning tasks of a physician
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
TidyBot++: An Open-Source Holonomic Mobile Manipulator for Robot Learning
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Generative AI in Medicine
SCBench: A KV Cache-Centric Analysis of Long-Context Methods
BrushEdit: All-In-One Image Inpainting and Editing
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Can LLMs Convert Graphs to Text-Attributed Graphs?
Large Action Models: From Inception to Implementation
Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images
Byte Latent Transformer: Patches Scale Better Than Tokens
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
Bridging AI and Science: Implications from a Large-Scale Literature Analysis of AI4Science
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
GenEx: Generating an Explorable World
LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation
FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction
JuStRank: Benchmarking LLM Judges for System Ranking
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
Learned Compression for Compressed Learning
Word Sense Linking: Disambiguating Outside the Sandbox
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages
Arbitrary-steps Image Super-resolution via Diffusion Inversion
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Phi-4 Technical Report
Large Concept Models: Language Modeling in a Sentence Representation Space
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
VisionArena: 230K Real World User-VLM Conversations with Preference Labels
StreamChat: Chatting with Streaming Video
ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation
Multimodal Latent Language Modeling with Next-Token Diffusion
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements
Learning Flow Fields in Attention for Controllable Person Image Generation
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
POINTS1.5: Building a Vision-Language Model towards Real World Applications
SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs
Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision?
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation
Video Motion Transfer with Diffusion Transformers
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
StyleMaster: Stylize Your Video with Artistic Generation and Translation
STIV: Scalable Text and Image Conditioned Video Generation
Granite Guardian
ObjCtrl-2.5D: Training-free Object Control with Camera Poses
The Pitfalls of Memorization: When Memorization Hurts Generalization
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Mobile Video Diffusion
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Causal World Representation in the GPT Model
Contextualized Counterspeech: Strategies for Adaptation, Personalization, and Evaluation
Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation
HARP: Hesitation-Aware Reframing in Transformer Inference Pass
A New Federated Learning Framework Against Gradient Inversion Attacks
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
Asynchronous LLM Function Calling
AutoReason: Automatic Few-Shot Reasoning Decomposition
Fully Open Source Moxin-7B Technical Report
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation
Training Large Language Models to Reason in a Continuous Latent Space
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
EMOv2: Pushing 5M Vision Model Frontier
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
MoViE: Mobile Diffusion for Video Editing
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Normalizing Flows are Capable Generative Models
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction
GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis
KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method
PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations
Chimera: Improving Generalist Model with Domain-Specific Experts
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
RL Zero: Zero-Shot Language to Behaviors without any Supervision
Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
APOLLO: SGD-like Memory, AdamW-level Performance
Reinforcement Learning: An Overview
Mind the Time: Temporally-Controlled Multi-Event Video Generation
CompCap: Improving Multimodal Large Language Models with Composite Captions
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Evaluating and Aligning CodeLLMs on Human Preference
Exponential Speedups by Rerooting Levin Tree Search
LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation
The Prompt Canvas: A Literature-Based Practitioner Guide for Creating Effective Prompts in Large Language Models
Frontier Models are Capable of In-context Scheming
Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference
DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling
Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases
Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment
PanoDreamer: 3D Panorama Synthesis from a Single Image
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context in New Environments
Hidden in the Noise: Two-Stage Robust Watermarking for Images
REL: Working out is all you need
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification
NVILA: Efficient Frontier Visual Language Models
VisionZip: Longer is Better but Not Necessary in Vision Language Models
4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Establishing Task Scaling Laws via Compute-Efficient Model Ladders
Discriminative Fine-tuning of LVLMs
Challenges in Trustworthy Human Evaluation of Chatbots
Densing Law of LLMs
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing
SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models
Monet: Mixture of Monosemantic Experts for Transformers
MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities
ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality
Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
A Noise is Worth Diffusion Guidance
Towards Data Governance of Frontier AI Models
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
Evaluating Language Models as Synthetic Data Generators
MV-Adapter: Multi-view Consistent Image Generation Made Easy
How to Correctly do Semantic Backpropagation on Language-based Agentic Systems
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation
PaliGemma 2: A Family of Versatile VLMs for Transfer
Imagine360: Immersive 360 Video Generation from Perspective Anchor
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images
Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion
CleanDIFT: Diffusion Features without Noise
2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs
Weighted-Reward Preference Optimization for Implicit Model Fusion
Robust Multi-bit Text Watermark with LLM-based Paraphrasers
Mimir: Improving Video Diffusion Models for Precise Text Understanding
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models
SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance
Scaling Image Tokenizers with Grouped Spherical Quantization
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
DataLab: A Unified Platform for LLM-Powered Business Intelligence
Personalized Multimodal Large Language Models: A Survey
OmniCreator: Self-Supervised Unified Generation with Universal Editing
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
Free Process Rewards without Process Labels
MALT: Improving Reasoning with Multi-Agent LLM Training
Towards Universal Soccer Video Understanding
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Structured 3D Latents for Scalable and Versatile 3D Generation
Negative Token Merging: Image-based Adversarial Feature Guidance
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
Yi-Lightning Technical Report
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
One Shot, One Talk: Whole-body Talking Avatar from a Single Image
Towards Adaptive Mechanism Activation in Language Agent
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting
o1-Coder: an o1 Replication for Coding
2411
AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular Videos
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability
On Domain-Specific Post-Training for Multimodal Large Language Models
DeMo: Decoupled Momentum Optimization
Reverse Thinking Makes LLMs Stronger Reasoners
Scaling Transformers for Low-Bitrate High-Quality Speech Coding
LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification
KV Shifting Attention Enhances Language Modeling
A dynamic parallel method for performance optimization on hybrid CPUs
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing
Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models
Trajectory Attention for Fine-grained Video Motion Control
GRAPE: Generalizing Robot Policy via Preference Alignment
Video Depth without Video Models
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
ICLERB: In-Context Learning Embedding and Reranker Benchmark
MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for Tabular Applications
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
SpotLight: Shadow-Guided Object Relighting via Diffusion
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling
FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS
Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models
Large Language Model-Brained GUI Agents: A Survey
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Training Noise Token Pruning
ROICtrl: Boosting Instance Control for Visual Generation
MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation
LongKey: Keyphrase Extraction for Long Documents
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
SketchAgent: Language-Driven Sequential Sketch Generation
Learning 3D Representations from Procedural 3D Programs
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation
DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting
SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting
Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
Star Attention: Efficient LLM Inference over Long Sequences
Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
Pathways on the Image Manifold: Image Editing via Video Generation
Controllable Human Image Generation with Personalized Multi-Garments
Visual Counter Turing Test (VCT^2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI)
Factorized Visual Tokenization and Generation
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis
Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey
From CISC to RISC: language-model guided assembly transpilation
One Diffusion to Generate Them All
MH-MoE:Multi-Head Mixture-of-Experts
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
Cautious Optimizers: Improving Training with One Line of Code
Predicting Emergent Capabilities by Finetuning
VisualLens: Personalization through Visual History
LLMs Do Not Think Step-by-step In Implicit Reasoning
Best of Both Worlds: Advantages of Hybrid Graph Sequence Models
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset
Knowledge Transfer Across Modalities with Natural Language Supervision
A Survey on LLM-as-a-Judge
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
A No Free Lunch Theorem for Human-AI Collaboration
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
Material Anything: Generating Materials for Any 3D Object via Diffusion
WildLMa: Long Horizon Loco-Manipulation in the Wild
Measuring Bullshit in the Language Games played by ChatGPT
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training
VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
OminiControl: Minimal and Universal Control for Diffusion Transformer
One to rule them all: natural language to bind communication, perception and action
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
Style-Friendly SNR Sampler for Style-Driven Generation
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
TEXGen: a Generative Diffusion Model for Mesh Textures
MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts
Understanding LLM Embeddings for Regression
SegBook: A Simple Baseline and Cookbook for Volumetric Medical Image Segmentation
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
MyTimeMachine: Personalized Facial Age Transformation
The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz
Associative Knowledge Graphs for Efficient Sequence Storage and Retrieval
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Stable Flow: Vital Layers for Training-Free Image Editing
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
Multimodal Autoregressive Pre-training of Large Vision Encoders
Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Natural Language Reinforcement Learning
Novel View Extrapolation with Video Diffusion Priors
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
Hymba: A Hybrid-head Architecture for Small Language Models
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Are Large Language Models Memorizing Bug Benchmarks?
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images
Patience Is The Key to Large Language Model Reasoning
ORID: Organ-Regional Information Driven Framework for Radiology Report Generation
MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
Human-In-the-Loop Software Development Agents
Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline
Stylecodes: Encoding Stylistic Information For Image Generation
Soft Robotic Dynamic In-Hand Pen Spinning
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
RedPajama: an Open Dataset for Training Large Language Models
Ultra-Sparse Memory Network
Building Trust: Foundations of Security, Safety and Transparency in AI
Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
Continuous Speculative Decoding for Autoregressive Image Generation
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
AIGS: Generating Science from AI-Powered Automated Falsification
Generative World Explorer
Bi-Mamba: Towards Accurate 1-Bit State Space Models
Drowning in Documents: Consequences of Scaling Reranker Inference
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
LLäMmlein: Compact and Competitive German-Only Language Models from Scratch
StableV2V: Stablizing Shape Consistency in Video-to-Video Editing
VeGaS: Video Gaussian Splatting
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration
AnimateAnything: Consistent and Controllable Animation for Video Generation
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
Does Prompt Formatting Have Any Impact on LLM Performance?
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Number it: Temporal Grounding Videos like Flipping Manga
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
An Empirical Study on LLM-based Agents for Automated Bug Fixing
Evaluating the role of `Constitutions' for learning from AI feedback
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning
Generative Agent Simulations of 1,000 People
Xmodel-1.5: An 1B-scale Multilingual LLM
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
MagicQuill: An Intelligent Interactive Image Editing System
Adaptive Decoding via Latent Preference Optimization
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering
Cut Your Losses in Large-Vocabulary Language Models
Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
FinRobot: AI Agent for Equity Research and Valuation with Large Language Models
Evaluating World Models with LLM for Decision Making
Can sparse autoencoders be used to decompose and interpret steering vectors?
Sharingan: Extract User Action Sequence from Desktop Recordings
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
Motion Control for Enhanced Complex Action Video Generation
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation
Large Language Models Can Self-Improve in Long-context Reasoning
Scaling Properties of Diffusion Models for Perceptual Tasks
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data
Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Top-$nσ$: Not All Logits Are You Need
Direct Preference Optimization Using Sparse Feature-Level Constraints
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Using Generative AI and Multi-Agents to Provide Automatic Feedback
Toward Optimal Search and Retrieval for RAG
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
Watermark Anything with Localized Messages
Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
The Super Weight in Large Language Models
SAMPart3D: Segment Any Part in 3D Objects
Counterfactual Generation from Language Models
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
Stronger Models are NOT Stronger Teachers for Instruction Tuning
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models
Designing Reliable Experiments with Generative Agent-Based Modeling: A Comprehensive Guide Using Concordia by Google DeepMind
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement
Hermes: A Large Language Model Framework on the Journey to Autonomous Networks
KMM: Key Frame Mask Mamba for Extended Motion Generation
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction
Acoustic Volume Rendering for Neural Impulse Response Fields
Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models
IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
GFT: Graph Foundation Model with Transferable Tree Vocabulary
Game-theoretic LLM: Agent Workflow for Negotiation Games
Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation
NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts
Autoregressive Models in Vision: A Survey
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
LLMs as Method Actors: A Model for Prompt Engineering and Architecture
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
Improving the detection of technical debt in Java source code with an enriched dataset
Balancing Pipeline Parallelism with Vocabulary Parallelism
A Taxonomy of AgentOps for Enabling Observability of Foundation Model based Agents
Hardware and Software Platform Inference
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
Analyzing The Language of Visual Tokens
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
BitNet a4.8: 4-bit Activations for 1-bit LLMs
CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model
DELIFT: Data Efficient Language model Instruction Fine Tuning
GazeGen: Gaze-Driven User Interaction for Visual Content Generation
Scaling Laws for Precision
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
Self-Consistency Preference Optimization
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
Number Cookbook: Number Understanding of Language Models and How to Improve It
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness
Inference Optimal VLMs Need Only One Visual Token but Larger Models
Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
A Mamba Foundation Model for Time Series Forecasting
Correlation of Object Detection Performance with Visual Saliency and Depth Estimation
Mixtures of In-Context Learners
Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge
Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study
Adaptive Length Image Tokenization via Recurrent Allocation
Attacking Vision-Language Computer Agents via Pop-ups
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Thinking Forward and Backward: Effective Backward Planning with Large Language Models
DreamPolish: Domain Score Distillation With Progressive Geometry Generation
Sample-Efficient Alignment for LLMs
LLaMo: Large Language Model-based Molecular Graph Assistant
Randomized Autoregressive Visual Generation
CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes
Face Anonymization Made Simple
Zipfian Whitening
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models
Human-inspired Perspectives: A Survey on AI Long-term Memory
E2E-AFG: An End-to-End Model with Adaptive Filtering for Retrieval-Augmented Generation
Self-Evolved Reward Learning for LLMs
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation
GRS-QA -- Graph Reasoning-Structured Question Answering Dataset
Constant Acceleration Flow
SambaMixer: State of Health Prediction of Li-ion Batteries using Mamba State Space Models
Project Sid: Many-agent simulations toward AI civilization
WikiNER-fr-gold: A Gold-Standard NER Corpus
Personalization of Large Language Models: A Survey
2410
Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use
Learning Video Representations without Natural Videos
DELTA: Dense Efficient Long-range 3D Tracking for any video
SelfCodeAlign: Self-Alignment for Code Generation
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
GPT or BERT: why not both?
Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks
Language Models can Self-Lengthen to Generate Long Texts
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
In-Context LoRA for Diffusion Transformers
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation
Controlling Language and Diffusion Models by Transporting Activations
HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models
Stealing User Prompts from Mixture of Experts
Toxicity of the Commons: Curating Open-Source Pre-Training Data
A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents
AAAR-1.0: Assessing AI's Potential to Assist Research
A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks
Survey of User Interface Design and Interaction Techniques in Generative AI Applications
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset
Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning
ADAM: An Embodied Causal Agent in Open-World Environments
Standardization Trends on Safety and Trustworthiness Technology for Advanced AI
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
Mapping the Neuro-Symbolic AI Landscape by Architectures: A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Distinguishing Ignorance from Error in LLM Hallucinations
BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays
Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications
Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning
Minimum Entropy Coupling with Bottleneck
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
GPT-4o System Card
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
LongReward: Improving Long-context Large Language Models with AI Feedback
LoRA vs Full Fine-tuning: An Illusion of Equivalence
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation
AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks
Language Models And A Second Opinion Use Case: The Pocket Professional
GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
Fast Best-of-N Decoding via Speculative Rejection
MarDini: Masked Autoregressive Diffusion for Video Generation at Scale
Neural Fields in Robotics: A Survey
AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels
A Survey of Small Language Models
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Counting Ability of Large Language Models and Impact of Tokenization
OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization
Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models
Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization
FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality
A prescriptive theory for brain-like inference
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning
Designing LLM-Agents with Personalities: A Psychometric Approach
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
PDL: A Declarative Prompt Programming Language
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
Teach Multimodal LLMs to Comprehend Electrocardiographic Images
O1 Replication Journey: A Strategic Progress Report -- Part 1
Framer: Interactive Frame Interpolation
MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Unbounded: A Generative Infinite Game of Character Life Simulation
Stable Consistency Tuning: Understanding and Improving Consistency Models
Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling
Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Should We Really Edit Language Models? On the Evaluation of Edited Language Models
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances
Why Does the Effective Context Length of LLMs Fall Short?
Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch
DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation
Data Scaling Laws in Imitation Learning for Robotic Manipulation
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant
Taipan: Efficient and Expressive State Space Language Models with Selective Attention
Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
SMITE: Segment Me In TimE
LOGO -- Long cOntext aliGnment via efficient preference Optimization
CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models
Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
The Nature of Mathematical Modeling and Probabilistic Optimization Engineering in Generative AI
Large Language Models Reflect the Ideology of their Creators
WAFFLE: Multi-Modal Model for Automated Front-End Development
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes
Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration
WorldSimBench: Towards Video Generation Models as World Simulators
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
CLEAR: Character Unlearning in Textual and Visual Modalities
LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering
Scalable Ranked Preference Optimization for Text-to-Image Generation
Value Residual Learning For Alleviating Attention Concentration In Transformers
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models
Lightweight Neural App Control
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation
SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Frontiers in Intelligent Colonoscopy
MiniPLM: Knowledge Distillation for Pre-Training Language Models
Aligning Large Language Models via Self-Steering Optimization
Math Neurosurgery: Isolating Language Models' Math Reasoning Abilities Using Only Forward Passes
A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration
Pantograph: A Machine-to-Machine Interaction Interface for Advanced Theorem Proving, High Level Reasoning, and Data Extraction in Lean 4
Promoting cross-modal representations to improve multimodal foundation models for physiological signals
LLM-based Optimization of Compound AI Systems: A Survey
FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors
Reflection-Bench: probing AI intelligence with reflection
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors
Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
Can Knowledge Editing Really Correct Hallucinations?
Pre-training Distillation for Large Language Models: A Design Space Exploration
Improve Vision Language Model Chain-of-thought Reasoning
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Learning How to Vote With Principles: Axiomatic Insights Into the Collective Decisions of Neural Networks
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Continuous Speech Synthesis using per-token Latent Diffusion
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Mitigating Object Hallucination via Concentric Causal Attention
Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation
AutoTrain: No-code training for state-of-the-art models
Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement
Language Models are Symbolic Learners in Arithmetic
M-RewardBench: Evaluating Reward Models in Multilingual Settings
Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold
Baichuan Alignment Technical Report
SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation
Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts
BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search
Teaching Models to Balance Resisting and Accepting Persuasion
How Do Training Methods Influence the Utilization of Vision Models?
Interpretable end-to-end Neurosymbolic Reinforcement Learning agents
Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas
Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning
In-context learning and Occam's razor
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
Towards Cross-Cultural Machine Translation with Retrieval-Augmented Generation from Multilingual Knowledge Graphs
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Retrospective Learning from Interactions
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
A Unified View of Delta Parameter Editing in Post-Trained Large-Scale Models
VidPanos: Generative Panoramic Videos from Casual Panning Videos
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control
A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement
Harnessing Webpage UIs for Text-Rich Visual Understanding
BenTo: Benchmark Task Reduction with In-Context Transferability
Looking Inward: Language Models Can Learn About Themselves by Introspection
PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment
DPLM-2: A Multimodal Diffusion Protein Language Model
MobA: A Two-Level Agent System for Efficient Mobile Task Automation
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures
DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation
Movie Gen: A Cast of Media Foundation Models
Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning
MedINST: Meta Dataset of Biomedical Instructions
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems
SBI-RAG: Enhancing Math Word Problem Solving for Students through Schema-Based Instruction and Retrieval-Augmented Generation
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Roadmap towards Superhuman Speech Understanding using Large Language Models
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy
Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
AERO: Softmax-Only LLMs for Efficient Private Inference
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization
A Survey on Data Synthesis and Augmentation for Large Language Models
Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media
Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats
Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Exploring Model Kinship for Merging Large Language Models
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL
Revealing the Barriers of Language Agents in Planning
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Tracking Universal Features Through Fine-Tuning and Model Merging
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks
PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking
Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance
A Prompt-Based Knowledge Graph Foundation Model for Universal In-Context Reasoning
Divide-Verify-Refine: Aligning LLM Responses with Complex Instructions
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming
OMCAT: Omni Context Aware Transformer
CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning
Neural Metamorphosis
MoH: Multi-Head Attention as Mixture-of-Head Attention
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
Improving Long-Text Alignment for Text-to-Image Diffusion Models
NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models
Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Zero-shot Model-based Reinforcement Learning using Large Language Models
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
GS^3: Efficient Relighting with Triple Gaussian Splatting
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Agent-as-a-Judge: Evaluate Agents with Agents
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention
AFlow: Automating Agentic Workflow Generation
Large Language Model Evaluation via Matrix Nuclear-Norm
Thinking LLMs: General Instruction Following with Thought Generation
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Animate-X: Universal Character Image Animation with Enhanced Motion Representation
Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains
SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning
Empirical Study of Mutual Reinforcement Effect and Application in Few-shot Text Classification Tasks via Prompt
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
Agentic Information Retrieval
EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation
Toward General Instruction-Following Alignment for Retrieval-Augmented Generation
FlatQuant: Flatness Matters for LLM Quantization
Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment
LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models
Rethinking Data Selection at Scale: Random Selection is Almost All You Need
MiRAGeNews: Multimodal Realistic AI-Generated News Detection
Mentor-KD: Making Small Language Models Better Multi-step Reasoners
MedMobile: A mobile-sized language model with expert-level clinical capabilities
Semantic Score Distillation Sampling for Compositional Text-to-3D Generation
SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights
Towards Trustworthy Knowledge Graph Reasoning: An Uncertainty Aware Perspective
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression
Baichuan-Omni Technical Report
KV Prediction for Improved Time to First Token
Agents Thinking Fast and Slow: A Talker-Reasoner Architecture
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models
MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code
ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion
Agent S: An Open Agentic Framework that Uses Computers Like a Human
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation
Progressive Autoregressive Video Diffusion Models
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations
Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Benchmarking Agentic Workflow Generation
TVBench: Redesigning Video-Language Evaluation
DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities
MotionGS: Exploring Explicit Motion Guidance for Deformable 3D Gaussian Splatting
Smart Audit System Empowered by LLM
Mechanistic Permutability: Match Features Across Layers
I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents
DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow
MM-Ego: Towards Building Egocentric Multimodal LLMs
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
TextToon: Real-Time Text Toonify Head Avatar from Single Video
Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
Personalized Visual Instruction Tuning
I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Let's Ask GNN: Empowering Large Language Model for Graph In-Context Learning
Pixtral 12B
Retrieval-Augmented Decision Transformer: External Memory for In-context RL
Data Selection via Optimal Control for Language Models
TinyEmo: Scaling down Emotional Reasoning via Metric Projection
Emergent properties with repeated examples
PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness
CursorCore: Assist Programming through Aligning Anything
Jointly Generating Multi-view Consistent PBR Textures using Collaborative Control
Self-Boosting Large Language Models with Synthetic Preference Data
Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders
Towards Natural Image Matting in the Wild via Real-Scenario Prior
ING-VP: MLLMs cannot Play Easy Vision-based Games Yet
Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning
Does Spatial Cognition Emerge in Frontier Models?
Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning
Accelerated Preference Optimization for Large Language Model Alignment
Story-Adapter: A Training-free Iterative Framework for Long Story Visualization
BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way
Multimodal Situational Safety
Temporal Reasoning Transfer from Text to Video
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
Diversity-Rewarded CFG Distillation
Aria: An Open Multimodal Native Mixture-of-Experts Model
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Pyramidal Flow Matching for Efficient Video Generative Modeling
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
Vector-ICL: In-context Learning with Continuous Vector Representations
Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition
TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Falcon Mamba: The First Competitive Attention-free 7B Language Model
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles
Differential Transformer
SePPO: Semi-Policy Preference Optimization for Diffusion Alignment
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Presto! Distilling Steps and Layers for Accelerating Music Generation
Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification
Named Clinical Entity Recognition Benchmark
OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction
LPZero: Language Model Zero-cost Proxy Search from Zero
Intriguing Properties of Large Language and Vision Models
TLDR: Token-Level Detective Reward Model for Large Vision Language Models
$\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs
UniMuMo: Unified Text, Music and Motion Generation
Hyper-multi-step: The Truth Behind Difficult Long-context Tasks
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide
Inference Scaling for Long-Context Retrieval Augmented Generation
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning
LongGenBench: Long-context Generation Benchmark
Grounding Language in Multi-Perspective Referential Communication
GraphRouter: A Graph-based Router for LLM Selections
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs
What Matters for Model Merging at Scale?
NRGBoost: Energy-Based Generative Boosted Trees
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents
ToolGen: Unified Tool Retrieval and Calling via Generation
Zebra: In-Context and Generative Pretraining for Solving Parametric PDEs
EBES: Easy Benchmarking for Event Sequences
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Autonomous Character-Scene Interaction Synthesis from Text Instruction
Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Erasing Conceptual Knowledge from Language Models
Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
Contrastive Localized Language-Image Pre-Training
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
Large Language Models as Markov Chains
Video Instruction Tuning With Synthetic Data
LLaVA-Critic: Learning to Evaluate Multimodal Models
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
ControlAR: Controllable Image Generation with Autoregressive Models
Selective Attention Improves Transformer
Distilling an End-to-End Voice Assistant Without Instruction Training Data
FAN: Fourier Analysis Networks
NL-Eye: Abductive NLI for Images
Intelligence at the Edge of Chaos
Contextual Document Embeddings
Mixed-Session Conversation with Egocentric Memory
Response Tuning: Aligning Large Language Models without Instruction
MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation
Collective Critics for Creative Story Generation
Learning the Latent Rules of a Game from Data: A Chess Story
Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification, Segmentation, Restoration and Beyond
MIGA: Mixture-of-Experts with Group Aggregation for Stock Market Prediction
Efficient Source-Free Time-Series Adaptation via Parameter Subspace Disentanglement
L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?
MVGS: Multi-view-regulated Gaussian Splatting for Novel View Synthesis
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning
SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation
EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models
Quantifying Generalization Complexity for Large Language Models
Not All LLM Reasoners Are Created Equal
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation
HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
FactAlign: Long-form Factuality Alignment of Large Language Models
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment
3DGS-DET: Empower 3D Gaussian Splatting with Boundary Guidance and Box-Focused Sampling for 3D Object Detection
InfiniPot: Infinite Context Processing on Memory-Constrained LLMs
Selective Aggregation for Low-Rank Adaptation in Federated Learning
Closed-loop Long-horizon Robotic Planning via Equilibrium Sequence Modeling
Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models
CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction
HelpSteer2-Preference: Complementing Ratings with Preferences
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
Were RNNs All We Needed?
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Addition is All You Need for Energy-efficient Language Models
Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation
What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration
SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models
ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer
2409
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
DressRecon: Freeform 4D Human Reconstruction from Monocular Video
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation
Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers
Instance-adaptive Zero-shot Chain-of-Thought Prompting
The Perfect Blend: Redefining RLHF with Mixture of Judges
Old Optimizer, New Norm: An Anthology
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models
Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis
Visual Context Window Extension: A New Perspective for Long Video Understanding
RoCoTex: A Robust Method for Consistent Texture Synthesis with Diffusion Models
Image Copy Detection for Diffusion Models
Law of the Weakest Link: Cross Capabilities of Large Language Models
Illustrious: an Open Advanced Illustration Model
On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability
Can Models Learn Skill Composition from Examples?
Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding
Hyper-Connections
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
DiaSynth -- Synthetic Dialogue Generation Framework
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
LML: Language Model Learning a Dataset for Data-Augmented Prediction
Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models
Emu3: Next-Token Prediction is All You Need
MinerU: An Open-Source Solution for Precise Document Content Extraction
A Survey on the Honesty of Large Language Models
Cottention: Linear Transformers With Cosine Attention
KALE-LM: Unleash The Power Of AI For Science Via Knowledge And Logic Enhanced Large Model
Evaluation of OpenAI o1: Opportunities and Challenges of AGI
Data-Prep-Kit: getting your data ready for LLM application development
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
MIO: A Foundation Model on Multimodal Tokens
Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study
Pixel-Space Post-Training of Latent Diffusion Models
Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult
Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models
MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models
Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis
HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion
Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors
Game4Loc: A UAV Geo-Localization Benchmark from Game Data
MSI-Agent: Incorporating Multi-Scale Insight into Embodied Agents for Superior Planning and Decision-Making
TalkinNeRF: Animatable Neural Fields for Full-Body Talking Humans
Synchronize Dual Hands for Physics-Based Dexterous Guitar Playing
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
MonoFormer: One Transformer for Both Diffusion and Autoregression
EuroLLM: Multilingual Language Models for Europe
MaskBit: Embedding-free Image Generation via Bit Tokens
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling
Seeing Faces in Things: A Model and Dataset for Pareidolia
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
Improvements to SDXL in NovelAI Diffusion V3
SLIMER-IT: Zero-Shot NER on Italian Language
Small Language Models: Survey, Measurements, and Insights
Making Text Embedders Few-Shot Learners
Reward-Robust RLHF in LLMs
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?
OmniBench: Towards The Future of Universal Omni-Language Models
Archon: An Architecture Search Framework for Inference-Time Techniques
Boosting Healthcare LLMs Through Retrieved Context
AIM 2024 Sparse Neural Rendering Challenge: Dataset and Benchmark
Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely
Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
Enabling Ultra-Dense, Open-RAN, Vehicular Networks with Non-Linear MIMO Processing
Instruction Following without Instruction Tuning
The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends
Present and Future Generalization of Synthetic Image Detectors
Tabular Data Generation using Binary Diffusion
A Case Study of Web App Coding with OpenAI Reasoning Models
Colorful Diffuse Intrinsic Image Decomposition in the Wild
Temporally Aligned Audio for Video with Autoregression
V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians
Prithvi WxC: Foundation Model for Weather and Climate
Portrait Video Editing Empowered by Multimodal Generative Priors
Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts
LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
Imagine yourself: Tuning-Free Personalized Image Generation
MuCodec: Ultra Low-Bitrate Music Codec
An adapted large language model facilitates multiple medical tasks in diabetes care
RRM: Robust Reward Model Training Mitigates Reward Hacking
Can we only use guideline instead of shot in prompt?
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
LVCD: Reference-based Lineart Video Colorization with Diffusion Models
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Training Language Models to Self-Correct via Reinforcement Learning
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt
Language Models Learn to Mislead Humans via RLHF
Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning
StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning
Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent Generation
FlexiTex: Enhancing Texture Generation with Visual Guidance
Vista3D: Unravel the 3D Darkside of a Single Image
DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen2.5-Coder Technical Report
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
A Controlled Study on Long Context Extension and Generalization in LLMs
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
GRIN: GRadient-INformed MoE
LLMs + Persona-Plug = Personalized LLMs
Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey
Jailbreaking Large Language Models with Symbolic Mathematics
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion
NVLM: Open Frontier-Class Multimodal LLMs
Cesàro operators on the space of analytic functions with logarithmic growth
OSV: One Step is Enough for High-Quality Image to Video Generation
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
OmniGen: Unified Image Generation
Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction
Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning
Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B
Agile Continuous Jumping in Discontinuous Terrains
Single-Layer Learnable Activation for Implicit Neural Representation (SL$^{2}$A-INR)
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
Kolmogorov-Arnold Transformer
On the limits of agency in agent-based models
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Schrodinger's Memory: Large Language Models
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey
On the Diagram of Thought
SFR-RAG: Towards Contextually Faithful LLMs
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison
Towards Diverse and Efficient Audio Captioning via Diffusion Models
Implicit Neural Representations with Fourier Kolmogorov-Arnold Networks
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation
Agents in Software Engineering: Survey, Landscape, and Vision
SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity
A Diffusion Approach to Radiance Field Relighting using Multi-Illumination Synthesis
Exploring Graph Structure Comprehension Ability of Multimodal Large Language Models: Case Studies
InstantDrag: Improving Interactivity in Drag-based Image Editing
B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests
DrawingSpinUp: 3D Animation from Single Character Drawings
Apollo: Band-sequence Modeling for High-Quality Audio Restoration
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer
Robust Dual Gaussian Splatting for Immersive Human-centric Volumetric Videos
DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors
Click2Mask: Local Editing with Dynamic Mask Generation
FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder
IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
Instant Facial Gaussians Translator for Relightable and Interactable Facial Rendering
Agent Workflow Memory
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Gated Slot Attention for Efficient Linear-Time Sequence Modeling
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis
What is the Role of Small Models in the LLM Era: A Survey
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
gsplat: An Open-Source Library for Gaussian Splatting
Generative Hierarchical Materials Search
ProteinBench: A Holistic Evaluation of Protein Foundation Models
LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation
INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding
Can Large Language Models Unlock Novel Scientific Research Ideas?
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
SongCreator: Lyrics-based Universal Song Generation
Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments
Evaluating Multiview Object Consistency in Humans and Image Models
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Are Large Language Models a Threat to Programming Platforms? An Exploratory Study
Benchmarking Chinese Knowledge Rectification in Large Language Models
Evidence from fMRI Supports a Two-Phase Abstraction Process in Language Models
LLMs Will Always Hallucinate, and We Need to Live With This
MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery
A framework to compute resonances arising from multiple scattering
SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning
Elsevier Arena: Human Evaluation of Chemistry/Biology/Health Foundational Large Language Models
Insights from Benchmarking Frontier Language Models on Web App Code Generation
Can OOD Object Detectors Learn from Foundation Models?
OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
Achieving Peak Performance for Large Language Models: A Systematic Review
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance
Theory, Analysis, and Best Practices for Sigmoid Self-Attention
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak
UniDet3D: Multi-dataset Indoor 3D Object Detection
GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
Self-Harmonized Chain of Thought
Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data
WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
Attention Heads of Large Language Models: A Survey
Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation
CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation
FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation
From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
Sketch: A Toolkit for Streamlining LLM Operations
ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation
GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding
Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard
Large Language Model-Based Agents for Software Engineering: A Survey
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Configurable Foundation Models: Building LLMs from a Modular Perspective
Bioinformatics Retrieval Augmentation Data (BRAD) Digital Assistant
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Towards a Unified View of Preference Learning for Large Language Models: A Survey
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
Building Math Agents with Multi-Turn Iterative Preference Learning
Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation
LinFusion: 1 GPU, 1 Minute, 16K Image
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text
Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models
OLMoE: Open Mixture-of-Experts Language Models
FuzzCoder: Byte-level Fuzzing Test via Large Language Model
In Defense of RAG in the Era of Long-Context Language Models
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI
Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing
Bicrucial $k$-power-free permutations
OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model
Affordance-based Robot Manipulation with Flow Matching
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges
Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation
Statically Contextualizing Large Language Models with Typed Holes
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
ContextCite: Attributing Model Generation to Context
Diffusion Policy Policy Optimization
FLUX that Plays Music
Compositional 3D-aware Video Generation with LLM Director
LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models
Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts
Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders
A Survey for Large Language Models in Biomedicine
On-Device Language Models: A Comprehensive Review
2408
ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA Datasets with Large Language Models
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters
VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers
InkubaLM: A small language model for low-resource African languages
Beyond Preferences in AI Alignment
MemLong: Memory-Augmented Retrieval for Long Text Modeling
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model
CSGO: Content-Style Composition in Text-to-Image Generation
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
Examination of Code generated by Large Language Models
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
CogVLM2: Visual Language Models for Image and Video Understanding
SurveySum: A Dataset for Summarizing Multiple Scientific Articles into a Survey Section
Law of Vision Representation in MLLMs
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems
LoraMap: Harnessing the Power of LoRA Connections
Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images
3D Reconstruction with Spatial Memory
Scaling Up Diffusion and Flow-based XGBoost Models
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
TEDRA: Text-based Editing of Dynamic and Photoreal Actors
ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution
Distribution Backtracking Builds A Faster Convergence Trajectory for One-step Diffusion Distillation
In-Context Imitation Learning via Next-Token Prediction
Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models
CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Persuasion Games using Large Language Models
Knowledge Navigator: LLM-guided Browsing Framework for Exploratory Search in Scientific Literature
Automatic Differential Diagnosis using Transformer-Based Multi-Label Sequence Classification
Efficient LLM Scheduling by Learning to Rank
Towards Realistic Example-based Modeling via 3D Gaussian Stitching
StyleRemix: Interpretable Authorship Obfuscation via Distillation and Perturbation of Style Elements
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
ReMamba: Equip Mamba with Effective Long-Sequence Modeling
GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs
AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems
Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline
The VoxCeleb Speaker Recognition Challenge: A Retrospective
Diffusion Models Are Real-Time Game Engines
Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation
Platypus: A Generalized Specialist Model for Reading Text in Various Forms
CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis
Text2SQL is Not Enough: Unifying AI and Databases with TAG
Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold
CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation
Artificial intelligence for science: The easy and hard problems
Agentic Retrieval-Augmented Generation for Time Series Analysis
A Practitioner's Guide to Continual Multimodal Pretraining
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Foundation Models for Music: A Survey
MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher
Learning to Move Like Professional Counter-Strike Players
MobileQuant: Mobile-friendly Quantization for On-device Language Models
GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars
Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models
LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs
Training-free Long Video Generation with Chain of Diffusion Model Experts
TVG: A Training-free Transition Video Generation Method with Diffusion Models
LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time
DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation
A Web-Based Solution for Federated Learning with LLM-Based Automation
FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering
T3M: Text Guided 3D Human Motion Synthesis from Speech
Memory-Efficient LLM Training with Online Subspace Descent
Building and better understanding vision-language models: insights and future directions
DreamCinema: Cinematic Transfer with Free Camera and 3D Character
Controllable Text Generation for Large Language Models: A Survey
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Real-Time Video Generation with Pyramid Attention Broadcast
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Sapiens: Foundation for Human Vision Models
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design
Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
CODE: Confident Ordinary Differential Editing
Subsurface Scattering for 3D Gaussian Splatting
Scalable Autoregressive Image Generation with Mamba
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM
Evidence-backed Fact Checking using RAG and Few-Shot In-Context Learning with LLMs
Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Hermes 3 Technical Report
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs
LLM Pruning and Distillation in Practice: The Minitron Approach
Critique-out-Loud Reward Models
FocusLLM: Scaling LLM's Context by Parallel Decoding
Efficient Detection of Toxic Prompts in Large Language Models
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting
The Vizier Gaussian Process Bandit Algorithm
TrackGo: A Flexible and Efficient Method for Controllable Video Generation
Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation
TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models
StructuredRAG: JSON Response Formatting with Large Language Models
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning
Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
To Code, or Not To Code? Exploring Impact of Code in Pre-training
ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining
Flexora: Flexible Low Rank Adaptation for Large Language Models
Quantum Artificial Intelligence: A Brief Survey
Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search
Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information
Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution
MambaEVT: Event Stream based Visual Object Tracking using State Space Model
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices
Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data
ShortCircuit: AlphaZero-Driven Circuit Design
Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
TraDiffusion: Trajectory-Based Training-Free Image Generation
Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering
Challenges and Responses in the Practice of Large Language Models
Segment Anything with Multiple Modalities
Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models
Graph Retrieval-Augmented Generation: A Survey
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars
Backward-Compatible Aligned Representations via an Orthogonal Transformation Layer
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning
Automated Design of Agentic Systems
TurboEdit: Instant text-based image editing
Can Large Language Models Understand Symbolic Graphics Programs?
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
Heavy Labels Out! Dataset Distillation with Label Space Lightening
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance
Towards flexible perception with visual memory
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization
MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing
FuseChat: Knowledge Fusion of Chat Models
Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning
Fine-tuning Large Language Models with Human-inspired Learning Strategies in Medical Question Answering
Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability
PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation
3D Gaussian Editing with A Single Image
Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space
Aquila2 Technical Report
Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM
Generative Photomontage
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
Imagen 3
OpenResearcher: Unleashing AI for Accelerated Scientific Research
Layerwise Recurrent Router for Mixture-of-Experts
SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields
DC3DO: Diffusion Classifier for 3D Objects
Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
TacSL: A Library for Visuotactile Sensor Simulation and Learning
UniT: Unified Tactile Representation for Robot Learning
Design Proteins Using Large Language Models: Enhancements and Comparative Analyses
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Body Transformer: Leveraging Robot Embodiment for Policy Learning
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
MovieSum: An Abstractive Summarization Dataset for Movie Screenplays
FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework
Med42-v2: A Suite of Clinical LLMs
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
ControlNeXt: Powerful and Efficient Control for Image and Video Generation
HeadGAP: Few-shot 3D Head Avatar via Generalizable Gaussian Priors
ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA datasets with Large Language Models
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation
ZePo: Zero-Shot Portrait Stylization with Faster Sampling
DeepSpeak Dataset v1.0
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Kalman-Inspired Feature Propagation for Video Face Super-Resolution
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning
A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?
MooER: LLM-based Speech Recognition and Translation Models from Moore Threads
Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models
Generating novel experimental hypotheses from language models: A case study on cross-dative generalization
Retrieval-augmented code completion for local projects using large language models
An Empirical Study on Challenges for LLM Developers
HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion
MulliVC: Multi-lingual Voice Conversion With Cycle Consistency
Understanding the Performance and Estimating the Cost of LLM Fine-Tuning
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics
Transformer Explainer: Interactive Learning of Text-Generative Models
Better Alignment with Instruction Back-and-Forth Translation
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches
Conversational Prompt Engineering
Advancing Molecular Machine (Learned) Representations with Stereoelectronics-Infused Molecular Graphs
Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP
LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection
EfficientRAG: Efficient Retriever for Multi-Hop Question Answering
Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation
UNLEARN Efficient Removal of Knowledge in Large Language Models
Task-oriented Sequential Grounding in 3D Scenes
Fast Sprite Decomposition from Animated Graphics
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases
Achieving Human Level Competitive Robot Table Tennis
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
Compact 3D Gaussian Splatting for Static and Dynamic Radiance Fields
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks
Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation
EXAONE 3.0 7.8B Instruction Tuned Language Model
MoExtend: Tuning New Experts for Modality and Task Extension
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
RayGauss: Volumetric Gaussian-Based Ray Casting for Photorealistic Novel View Synthesis
LLaVA-OneVision: Easy Visual Task Transfer
CoverBench: A Challenging Benchmark for Complex Claim Verification
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation
Synthesizing Text-to-SQL Data from Weak and Strong LLMs
IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts
An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine
Delivery of DART Impact Ejecta to Mars and Earth: Opportunity for Meteor Observations
Learning to Predict Program Execution by Modeling Dynamic Dependency on Code Graphs
Diffusion Models as Data Mining Tools
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Self-Taught Evaluators
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
Language Model Can Listen While Speaking
BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization
RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Operationalizing Contextual Integrity in Privacy-Conscious Assistants
ProCreate, Dont Reproduce! Propulsive Energy Diffusion for Creative Generation
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation
GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Conditional LoRA Parameter Generation
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
A Survey of Mamba
The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines
POA: Pre-training Once for Models of All Sizes
Medical SAM 2: Segment medical images as video via Segment Anything Model 2
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model
Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention
Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model
TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models
SAM 2: Segment Anything in Images and Videos
Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses
Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion
In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation
Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation
OmniParser for Pure Vision Based GUI Agent
Finch: Prompt-guided Key-Value Cache Compression
Gemma 2: Improving Open Language Models at a Practical Size
Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget
2407
Projected Language Models: A Large Model Pre-Segmented Into Smaller Ones
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
The Llama 3 Herd of Models
Berkeley Humanoid: A Research Platform for Learning-based Control
ShieldGemma: Generative AI Content Moderation Based on Gemma
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Open-Vocabulary Audio-Visual Semantic Segmentation
Adaptive Retrieval-Augmented Generation for Conversational Systems
Tora: Trajectory-oriented Diffusion Transformer for Video Generation
Expressive Whole-Body 3D Gaussian Avatar
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent
TAROT: Task-Oriented Authorship Obfuscation Using Policy Optimization Methods
Data Contamination Report from the 2024 CONDA Shared Task
Fine-gained Zero-shot Video Sampling
Cost-Effective Hallucination Detection for LLMs
Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning
Apple Intelligence Foundation Language Models
ThinK: Thinner Key Cache by Query-Driven Pruning
Matting by Generation
AI-Assisted Generation of Difficult Math Questions
How to Measure the Intelligence of Large Language Models?
Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning
JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources
Meltemi: The first open Large Language Model for Greek
Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework
Harvesting Textual and Structured Data from the HAL Publication Repository
Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings
Can LLMs be Fooled? Investigating Vulnerabilities in LLMs
Machine Unlearning in Generative AI: A Survey
Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation
Generating Gender Alternatives in Machine Translation
A Large Encoder-Decoder Family of Foundation Models For Chemical Language
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
Theia: Distilling Diverse Vision Foundation Models for Robot Learning
Diffusion Feedback Helps CLIP See Better
rLLM: Relational Table Learning with LLMs
ByteCheckpoint: A Unified Checkpointing System for LLM Development
RelBench: A Benchmark for Deep Learning on Relational Databases
ImagiNet: A Multi-Content Dataset for Generalizable Synthetic Image Detection via Contrastive Learning
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention
Sentiment Analysis of Lithuanian Online Reviews Using Large Language Models
ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation
ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost
Improving Retrieval Augmented Language Model with Self-Reasoning
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture
SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain
Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle
Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models
A Generic Review of Integrating Artificial Intelligence in Cognitive Behavioral Therapy
Integrating Large Language Models into a Tri-Modal Architecture for Automated Depression Classification
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
WalkTheDog: Cross-Morphology Motion Alignment via Phase Manifolds
Floating No More: Object-Ground Reconstruction from a Single Image
Wolf: Captioning Everything with a World Summarization Framework
SHIC: Shape-Image Correspondences with no Keypoint Supervision
Lessons from Learning to Spin "Pens"
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
VSSD: Vision Mamba with Non-Casual State Space Duality
Model-driven Heart Rate Estimation and Heart Murmur Detection based on Phonocardiogram
The Art of Refusal: A Survey of Abstention in Large Language Models
PersonaGym: Evaluating Persona Agents and LLMs
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
Exploring Scaling Trends in LLM Robustness
The FIGNEWS Shared Task on News Media Narratives
Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic
Efficient Inference of Vision Instruction-Following Models with Elastic Cache
LKCell: Efficient Cell Nuclei Instance Segmentation with Large Convolution Kernels
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption
BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation
Very Large-Scale Multi-Agent Simulation in AgentScope
Text-Driven Neural Collaborative Filtering Model for Paper Source Tracing
LAMBDA: A Large Model Based Data Agent
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency
$VILA^2$: VILA Augmented VILA
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation
3D Question Answering for City Scene Understanding
PERSONA: A Reproducible Testbed for Pluralistic Alignment
ViPer: Visual Personalization of Generative Models via Individual Preference Learning
Scalify: scale propagation for efficient low-precision LLM training
Solving The Travelling Salesman Problem Using A Single Qubit
DreamCar: Leveraging Car-specific Prior for in-the-wild 3D Car Reconstruction
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
Train-Attention: Meta-Learning Where to Focus in Continual Knowledge Learning
Generation Constraint Scaling Can Mitigate Hallucination
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
A Simulation Benchmark for Autonomous Racing with Large-Scale Human Data
KAN or MLP: A Fairer Comparison
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
Course-Correction: Safety Alignment Using Synthetic Preferences
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Enhancing LLM's Cognition via Structurization
Cross Anything: General Quadruped Robot Navigation through Complex Terrains
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement Learning
TAPTRv2: Attention-based Position Update Improves Tracking Any Point
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person
Graph-Structured Speculative Decoding
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
DDK: Distilling Domain Knowledge for Efficient Large Language Models
BoostMVSNeRFs: Boosting MVS-based NeRFs to Generalizable View Synthesis in Large-scale Scenes
Artist: Aesthetically Controllable Text-Driven Stylization without Training
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning
Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models
Discrete Flow Matching
SIGMA: Sinkhorn-Guided Masked Video Modeling
Local All-Pair Correspondence for Point Tracking
MAVEN-Fact: A Large-scale Event Factuality Detection Dataset
LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation
ThermalNeRF: Thermal Radiance Fields
VideoGameBunny: Towards vision assistants for video games
MIBench: Evaluating Multimodal Large Language Models over Multiple Images
CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model
HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions
A Survey on Employing Large Language Models for Text-to-SQL Tasks
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
GET-Zero: Graph Embodiment Transformer for Zero-shot Embodiment Generalization
Temporal Residual Jacobians For Rig-free Motion Transfer
Consent in Crisis: The Rapid Decline of the AI Data Commons
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation
Compact Language Models via Pruning and Knowledge Distillation
BOND: Aligning LLMs with Best-of-N Distillation
NNsight and NDIF: Democratizing Access to Foundation Model Internals
Internal Consistency and Self-Feedback in Large Language Models: A Survey
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
The Vision of Autonomic Computing: Can LLMs Make It a Reality?
Stable Audio Open
Efficient Audio Captioning with Encoder-Level Knowledge Distillation
SparseCraft: Few-Shot Neural Reconstruction through Stereopsis Guided Geometric Linearization
EVLM: An Efficient Vision-Language Model for Visual Understanding
Visual Text Generation in the Wild
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
PlacidDreamer: Advancing Harmony in Text-to-3D Generation
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle
Visual Haystacks: Answering Harder Questions About Sets of Images
Shape of Motion: 4D Reconstruction from a Single Video
Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion
Scaling Granite Code Models to 128K Context
Understanding Reference Policies in Direct Preference Optimization
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Prover-Verifier Games improve legibility of LLM outputs
Weak-to-Strong Reasoning
A Comparative Study on Automatic Coding of Medical Letters with Explainability
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition
Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation
CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis
Robust ASR Error Correction with Conservative Data Filtering
PM-LLM-Benchmark: Evaluating Large Language Models on Process Mining Tasks
SciCode: A Research Coding Benchmark Curated by Scientists
Pre-Trained Foundation Model representations to uncover Breathing patterns in Speech
A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks
Retrieval-Enhanced Machine Learning: Synthesis and Opportunities
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
IMAGDressing-v1: Customizable Virtual Dressing
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Patch-Level Training for Large Language Models
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
E5-V: Universal Embeddings with Multimodal Large Language Models
Audio Conditioning for Music Generation via Discrete Bottleneck Features
Case2Code: Learning Inductive Reasoning with Synthetic Data
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
Splatfacto-W: A Nerfstudio Implementation of Gaussian Splatting for Unconstrained Photo Collections
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression
The Art of Saying No: Contextual Noncompliance in Language Models
Exploring Advanced Large Language Models with LLMsuite
Does Refusal Training in LLMs Generalize to the Past Tense?
Efficient Training with Denoised Neural Weights
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in Grammatical Error Detection
Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors
Click-Gaussian: Interactive Segmentation to Any 3D Gaussians
Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
CCoE: A Compact LLM with Collaboration of Experts
Scaling Diffusion Transformers to 16 Billion Parameters
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
Animate3D: Animating Any 3D Model with Multi-view Video Diffusion
DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation
Grasping Diverse Objects with Simulated Humanoids
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients
YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus
Make-An-Agent: A Generalizable Policy Network Generator with Behavior-Prompted Diffusion
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models
GRUtopia: Dream General Robots in a City at Scale
DataDream: Few-shot Guided Dataset Generation
Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation
Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval
Qwen2-Audio Technical Report
Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning
Qwen2 Technical Report
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models
xLSTMTime : Long-term Time Series Forecasting With xLSTM
Practical Unlearning for Large Language Models
Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
Video Occupancy Models
StyleSplat: 3D Object Style Transfer with Gaussian Splatting
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures
Human-like Episodic Memory for Infinite Context LLMs
MUSCLE: A Model Update Strategy for Compatible LLM Evolution
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
GAVEL: Generating Games Via Evolution and Language Models
Transformer Layers as Painters
H2O-Danube3 Technical Report
Context Embeddings for Efficient Answer Generation in RAG
Accuracy is Not All You Need
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
New Desiderata for Direct Preference Optimization
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models
AUITestAgent: Automatic Requirements Oriented GUI Function Testing
TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models
Characterizing Prompt Compression Methods for Long Context Inference
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
MAVIS: Mathematical Visual Instruction Tuning
Video Diffusion Alignment via Reward Gradients
Real-Time Anomaly Detection and Reactive Planning with Large Language Models
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data
GTA: A Benchmark for General Tool Agents
OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects
Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models
SEED-Story: Multimodal Long Story Generation with Large Language Model
Generalizable Implicit Motion Modeling for Video Frame Interpolation
Towards Building Specialized Generalist AI with System 1 and System 2 Fusion
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
Autoregressive Speech Synthesis without Vector Quantization
Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents
WildGaussians: 3D Gaussian Splatting in the Wild
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients
Gradient Boosting Reinforcement Learning
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Toto: Time Series Optimized Transformer for Observability
Controlling Space and Time with Diffusion Models
BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark
PaliGemma: A versatile 3B VLM for transfer
VEnhancer: Generative Space-Time Enhancement for Video Generation
On Leakage of Code Generation Evaluation Datasets
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
Video-to-Audio Generation with Hidden Alignment
CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging
Inference Performance Optimization for Large Language Models on CPUs
Scaling Up Personalized Aesthetic Assessment via Task Vector Customization
Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Self-Recognition in Language Models
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Vision language models are blind
RRM: Relightable assets using Radiance guided Material extraction
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions
VIMI: Grounding Video Generation through Multi-modal Instruction
A Survey on Mixture of Experts
Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
Compositional Video Generation as Flow Equalization
On Speeding Up Language Model Evaluation
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty
PAS: Data-Efficient Plug-and-Play Prompt Augmentation System
Distilling System 2 into System 1
LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages
Large Language Models Understand Layouts
Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition
InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct
Retrieved In-Context Principles from Previous Mistakes
An accurate detection is not all you need to combat label noise in web-noisy datasets
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions
Granular Privacy Control for Geolocation with Vision Language Models
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Associative Recurrent Memory Transformer
Revealing the Utilized Rank of Subspaces of Learning in Neural Networks
ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models
On scalable oversight with weak LLMs judging strong LLMs
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
PartCraft: Crafting Creative Objects by Parts
AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Mixture of A Million Experts
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge
CRiM-GS: Continuous Rigid Motion-Aware Gaussian Splatting from Motion Blur Images
BM25S: Orders of magnitude faster lexical search via eager sparse scoring
AgentInstruct: Toward Generative Teaching with Agentic Flows
HEMM: Holistic Evaluation of Multimodal Foundation Models
Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents
Self-Evaluation as a Defense Against Adversarial Attacks on LLMs
How Does Quantization Affect Multilingual LLMs?
TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts
Investigating Decoder-only Large Language Models for Speech-to-text Translation
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models
Knowledge Composition using Task Vectors with Learned Anisotropic Scaling
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
No Training, No Problem: Rethinking Classifier-Free Guidance for Diffusion Models
Reasoning in Large Language Models: A Geometric Perspective
A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Magic Insert: Style-Aware Drag-and-Drop
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Consistency Flow Matching: Defining Straight Flows with Velocity Consistency
TokenPacker: Efficient Visual Projector for Multimodal LLM
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
μ-Bench: A Vision-Language Benchmark for Microscopy Understanding
xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart
DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
AI Agents That Matter
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
RegMix: Data Mixture as Regression for Language Model Pre-training
LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives
Agentless: Demystifying LLM-based Software Engineering Agents
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
ColPali: Efficient Document Retrieval with Vision Language Models
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER
MIRAI: Evaluating LLM Agents for Event Forecasting
Searching for Best Practices in Retrieval-Augmented Generation
$\text{Memory}^3$: Language Modeling with Explicit Memory
Eliminating Position Bias of Language Models: A Mechanistic Approach
PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs
Towards Robust Speech Representation Learning for Thousands of Languages
InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix
LiteSearch: Efficacious Tree Search for LLM
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models
UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
2406
Meta Large Language Model Compiler: Foundation Models of Compiler Optimization
Gemma 2: Improving Open Language Models at a Practical Size
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs
Understanding and Mitigating Language Confusion in LLMs
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
Wavelets Are All You Need for Autoregressive Image Generation
Direct Preference Knowledge Distillation for Large Language Models
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning
What Matters in Detecting AI-Generated Videos like Sora?
Instance-Optimal Private Density Estimation in the Wasserstein Distance
Investigating How Large Language Models Leverage Internal Knowledge to Perform Complex Reasoning
Dataset Size Recovery from LoRA Weights
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
The Remarkable Robustness of LLMs: Stages of Inference?
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild
Efficient World Models with Context-Aware Tokenization
LiveBench: A Challenging, Contamination-Free LLM Benchmark
From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation
Revealing Fine-Grained Values and Opinions in Large Language Models
Aligning Teacher with Student Preferences for Tailored Training Data Generation
Simulating Classroom Education with LLM-Empowered Agents
T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation
RouteLLM: Learning to Route LLMs with Preference Data
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Symbolic Learning Enables Self-Evolving Agents
MatchTime: Towards Automatic Soccer Game Commentary Generation
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
GaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enhanced Quality
A Closer Look into Mixture-of-Experts in Large Language Models
ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models
Poisoned LangChain: Jailbreak LLMs by LangChain
ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs
Octo-planner: On-device Language Model for Planner-Action Agents
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Fast and Uncertainty-Aware SVBRDF Recovery from Multi-View Capture using Frequency Domain Analysis
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
DiffusionPDE: Generative PDE-Solving Under Partial Observation
MotionBooth: Motion-Aware Customized Text-to-Video Generation
Following Length Constraints in Instructions
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
Aligning Diffusion Models with Noise-Conditioned Perception
LongIns: A Challenging Long-context Instruction-based Exam for LLMs
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
Multi-property Steering of Large Language Models with Dynamic Activation Composition
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Benchmarking Mental State Representations in Language Models
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
D2LLM: Decomposed and Distilled Large Language Models for Semantic Search
Unlocking Continual Learning Abilities in Language Models
Large Language Models Assume People are More Rational than We Really are
Understanding and Diagnosing Deep Reinforcement Learning
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
Long Context Transfer from Language to Vision
RaTEScore: A Metric for Radiology Report Generation
ClotheDreamer: Text-Guided Garment Generation with 3D Gaussians
Adam-mini: Use Fewer Learning Rates To Gain More
OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?
WARP: On the Benefits of Weight Averaged Rewarded Policies
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
Scaling Laws for Linear Complexity Language Models
Repulsive Score Distillation for Diverse Sampling of Diffusion Models
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
On the Transformations across Reward Model, Parameter Update, and In-Context Prompt
EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals
Video-Infinity: Distributed Long Video Generation
Confidence Regulation Neurons in Language Models
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Evaluating D-MERIT of Partial-annotation on Information Retrieval
Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
What Matters in Transformers? Not All Attention is Needed
Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
Image Conductor: Precision Control for Interactive Video Synthesis
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
Cognitive Map for Language Models: Optimal Planning via Verbally Representing the World Model
MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Reward Steering with Evolutionary Heuristics for Decoding-time Alignment
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems
Towards Retrieval Augmented Generation over Large Video Libraries
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
ToVo: Toxicity Taxonomy via Voting
Efficient Continual Pre-training by Mitigating the Stability Gap
How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions
Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework
RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation
Can LLMs Learn by Teaching? A Preliminary Study
Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
IRASim: Learning Interactive Real-Robot Action Simulators
Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Jailbreaking as a Reward Misspecification Problem
$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
LiveMind: Low-latency Large Language Models with Simultaneous Inference
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning
Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task
ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning
Towards Event-oriented Long Video Understanding
How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets
Adaptable Logical Control for Large Language Models
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations
Improving Visual Commonsense in Language Models via Multiple Image Generation
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models
4K4DGen: Panoramic 4D Generation at 4K Resolution
EvTexture: Event-driven Texture Enhancement for Video Super-Resolution
Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images
VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models
DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models
GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks
Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation
VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries
Adversarial Attacks on Multimodal Agents
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Measuring Psychological Depth in Language Models
Estimating Knowledge in Large Language Models Without Generating a Single Token
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP
RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation
Low-Resource Machine Translation through the Lens of Personalized Federated Learning
HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors
PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers
Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models
Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment
JEN-1 DreamStyler: Customized Musical Concept Learning via Pivotal Parameters Tuning
VoCo-LLaMA: Towards Vision Compression with Large Language Models
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
TroL: Traversal of Layers for Large Language and Vision Models
Interface Design for Self-Supervised Speech Models
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks
Learning Molecular Representation in a Cell
Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning
$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models
Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts
Large Scale Transfer Learning for Tabular Data via Language Modeling
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark
AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology
Mixture-of-Subspaces in Low-Rank Adaptation
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
mDPO: Conditional Preference Optimization for Multimodal Large Language Models
Unveiling Encoder-Free Vision-Language Models
WPO: Enhancing RLHF with Weighted Preference Optimization
Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level
VideoLLM-online: Online Video Large Language Model for Streaming Video
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
DataComp-LM: In search of the next generation of training sets for language models
Task Me Anything
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Measuring memorization in RLHF for code completion
Nemotron-4 340B Technical Report
Tokenization Falling Short: The Curse of Tokenization
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming
DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling
Long Code Arena: a Set of Benchmarks for Long-Context Code Models
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
HARE: HumAn pRiors, a key to small language model Efficiency
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report
Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis
A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models
Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion
Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing
From Pixels to Prose: A Large Dataset of Dense Image Captions
L4GM: Large 4D Gaussian Reconstruction Model
Beyond Words: On Large Language Models Actionability in Mission-Critical Risk Analysis
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Make It Count: Text-to-Image Generation with an Accurate Number of Objects
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs
Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering
MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
Training-free Camera Control for Video Generation
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
GaussianSR: 3D Gaussian Super-Resolution with 2D Diffusion Priors
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
GEB-1.3B: Open Lightweight Large Language Model
Bootstrapping Language Models with DPO Implicit Rewards
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Decoding the Diversity: A Review of the Indic AI Research Landscape
Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness
Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Depth Anything V2
Interpreting the Weight Space of Customized Diffusion Models
Explore the Limits of Omni-modal Pretraining at Scale
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
LRM-Zero: Training Large Reconstruction Models with Synthesized Data
Understanding Hallucinations in Diffusion Models through Mode Interpolation
CMC-Bench: Towards a New Paradigm of Visual Signal Compression
Transformers meet Neural Algorithmic Reasoners
Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
OpenVLA: An Open-Source Vision-Language-Action Model
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning
AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis
Cognitively Inspired Energy-Based World Models
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
HelpSteer2: Open-source dataset for training top-performing reward models
Vivid-ZOO: Multi-View Video Generation with Diffusion Model
Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
DiTFastAttn: Attention Compression for Diffusion Transformer Models
RVT-2: Learning Precise Manipulation from Few Demonstrations
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Real3D: Scaling Up Large Reconstruction Models with Real-World Images
What If We Recaption Billions of Web Images with LLaMA-3?
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Discovering Preference Optimization Algorithms with and for Large Language Models
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation
Is Programming by Example solved by LLMs?
Can Large Language Models Analyze Software Failures in the News? An End-to-End Automated Pipeline with FAIL
Transformer-based Model for ASR N-Best Rescoring and Rewriting
Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio
Multimodal Table Understanding
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Large Language Model Unlearning via Embedding-Corrupted Prompts
Designing a Dashboard for Transparency and Control of Conversational AI
Hierarchical Patch Diffusion Models for High-Resolution Video Generation
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
An Image is Worth 32 Tokens for Reconstruction and Generation
Zero-shot Image Editing with Reference Imitation
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?
Simple and Effective Masked Diffusion Language Models
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
Neural Gaffer: Relighting Any Object via Diffusion
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement
TextGrad: Automatic "Differentiation" via Text
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models
Estimating the Hallucination Rate of Generative AI
McEval: Massively Multilingual Code Evaluation
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B
World Models with Hints of Large Language Models for Goal Achieving
Needle In A Multimodal Haystack
Merging Improves Self-Critique Against Jailbreak Attacks
TernaryLLM: Ternarized Large Language Model
Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement
Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Synthetic Query Generation using Large Language Models for Virtual Assistants
SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound
The Prompt Report: A Systematic Survey of Prompting Techniques
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
IllumiNeRF: 3D Relighting without Inverse Rendering
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing
Towards a Personal Health Large Language Model
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad
VCR: Visual Caption Restoration
Margin-aware Preference Optimization for Aligning Diffusion Models without Reference
On the Minimal Degree Bias in Generalization on the Unseen for non-Boolean Functions
Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching
Tx-LLM: A Large Language Model for Therapeutics
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
MaskLID: Code-Switching Language Identification through Iterative Masking
Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis
ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models
Vript: A Video Is Worth Thousands of Words
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters
Attention as a Hypernetwork
Unified Text-to-Image Generation and Retrieval
MLCM: Multistep Consistency Distillation of Latent Diffusion Model
GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
MotionClone: Training-Free Motion Cloning for Controllable Video Generation
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
Hibou: A Family of Foundational Vision Transformers for Pathology
SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
CRAG -- Comprehensive RAG Benchmark
Mixture-of-Agents Enhances Large Language Model Capabilities
Learning Task Decomposition to Assist Humans in Competitive Programming
Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
Proofread: Fixes All Errors with One Tap
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning
Time Sensitive Knowledge Editing through Efficient Finetuning
GenAI Arena: An Open Evaluation Platform for Generative Models
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?
Large Language Model Confidence Estimation via Black-Box Access
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
Simplified and Generalized Masked Diffusion for Discrete Data
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
SF-V: Single Forward Video Generation Model
Chimera: Effectively Modeling Multivariate Time Series with 2-Dimensional State Space Models
Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step
VideoTetris: Towards Compositional Text-to-Video Generation
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Open-Endedness is Essential for Artificial Superhuman Intelligence
Hypernetworks for Personalizing ASR to Atypical Speech
Confabulation: The Surprising Value of Large Language Model Hallucinations
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments
Are We Done with MMLU?
Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation
Evaluating the World Model Implicit in a Generative Model
Enhancing CTC-based speech recognition with diverse modeling units
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
Xmodel-LM Technical Report
Item-Language Model for Conversational Recommendation
RATT: A Thought Structure for Coherent and Correct LLM Reasoning
Block Transformer: Global-to-Local Language Modeling for Fast Inference
To Believe or Not to Believe Your LLM
Parrot: Multilingual Visual Instruction Tuning
Scalable MatMul-free Language Modeling
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation
Guiding a Diffusion Model with a Bad Version of Itself
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Improved Modelling of Federated Datasets using Mixtures-of-Dirichlet-Multinomials
Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning
I4VGen: Image as Stepping Stone for Text-to-Video Generation
OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models
Self-Improving Robust Preference Optimization
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Learning Temporally Consistent Video Depth from Video Diffusion Priors
pOps: Photo-Inspired Diffusion Operators
Towards Scalable Automated Alignment of LLMs: A Survey
Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Improving GFlowNets for Text-to-Image Diffusion Alignment
Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning
$μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
2405
PaliGemma
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales
4Diffusion: Multi-view Video Diffusion Model for 4D Generation
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Xwin-LM: Strong and Scalable Alignment Practice for LLMs
GECO: Generative Image-to-3D within a SECOnd
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Grokfast: Accelerated Grokking by Amplifying Slow Gradients
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning
PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting
Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark
Why Larger Language Models Do In-context Learning Differently?
Contrasting Multiple Representations with the Multi-Marginal Matching Gap
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
NPGA: Neural Parametric Gaussian Avatars
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
Offline Regularised Reinforcement Learning for Large Language Models Alignment
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture
LLMs achieve adult human performance on higher-order theory of mind tasks
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
Contextual Position Encoding: Learning to Count What's Important
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
GFlow: Recovering 4D World from Monocular Video
3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting
Phased Consistency Model
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning
LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models
Faithful Logical Reasoning via Symbolic Chain-of-Thought
4-bit Shampoo for Memory-Efficient Network Training
2BP: 2-Stage Backpropagation
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections
Yuan 2.0-M32: Mixture of Experts with Attention Router
Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation
Matryoshka Multimodal Models
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control
Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer
THREAD: Thinking Deeper with Recursive Spawning
Transformers Can Do Arithmetic with the Right Embeddings
Trans-LoRA: towards data-free Transferable Parameter Efficient Finetuning
An Introduction to Vision-Language Modeling
Position: Foundation Agents as the Paradigm Shift for Decision Making
Part123: Part-aware 3D Reconstruction from a Single-view Image
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
The Road Less Scheduled
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
Are Long-LLMs A Necessity For Long-Context Tasks?
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition
ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models
OptLLM: Optimal Assignment of Queries to Large Language Models
HDR-GS: Efficient High Dynamic Range Novel View Synthesis at 1000x Speed via Gaussian Splatting
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Aya 23: Open Weight Releases to Further Multilingual Progress
AGRaME: Any-Granularity Ranking with Multi-Vector Embeddings
CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner
Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining
AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}
NeRF-Casting: Improved View-Dependent Appearance with Consistent Reflections
Improved Distribution Matching Distillation for Fast Image Synthesis
Tele-Aloha: A Low-budget and High-authenticity Telepresence System Using Sparse RGB Cameras
Not All Language Model Features Are Linear
Semantica: An Adaptable Image-Conditioned Diffusion Model
Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling
Lessons from the Trenches on Reproducible Evaluation of Language Models
SimPO: Simple Preference Optimization with a Reference-Free Reward
RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis
Agent Planning with World Knowledge Model
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Distributed Speculative Inference of Large Language Models
Attention as an RNN
Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching
ReVideo: Remake a Video with Motion and Content Control
Thermodynamic Natural Gradient Descent
Dense Connector for MLLMs
CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers
KPConvX: Modernizing Kernel Point Convolution with Kernel Attention
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
OmniGlue: Generalizable Feature Matching with Foundation Model Guidance
Personalized Residuals for Concept-Driven Text-to-Image Generation
Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models
Retrieval-Augmented Language Model for Extreme Multi-Label Knowledge Graph Link Prediction
Quantifying Emergence in Large Language Models
Diffusion for World Modeling: Visual Details Matter in Atari
Your Transformer is Secretly Linear
Octo: An Open-Source Generalist Robot Policy
Training Data Attribution via Approximate Unrolled Differentiation
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Imp: Highly Capable Large Multimodal Models for Mobile Devices
On Efficient and Statistical Quality Estimation for Data Annotation
Information Leakage from Embedding in Large Language Models
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization
FIFO-Diffusion: Generating Infinite Videos from Text without Training
Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching
Towards Modular LLMs by Building and Reusing a Library of LoRAs
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Observational Scaling Laws and the Predictability of Language Model Performance
Efficient Multimodal Large Language Models: A Survey
INDUS: Effective and Efficient Language Models for Scientific Applications
Layer-Condensed KV Cache for Efficient Inference of Large Language Models
Dynamic data sampler for cross-language transfer learning in large language models
Grounded 3D-LLM with Referent Tokens
Toon3D: Seeing Cartoons from a New Perspective
TRANSIC: Sim-to-Real Policy Transfer by Learning from Online Correction
CAT3D: Create Anything in 3D with Multi-View Diffusion Models
How Far Are We From AGI
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Many-Shot In-Context Learning in Multimodal Foundation Models
LoRA Learns Less and Forgets Less
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Naturalistic Music Decoding from EEG Data via Latent Diffusion Models
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
Risks and Opportunities of Open-Source Generative AI
Understanding the performance gap between online and offline alignment algorithms
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
SpeechVerse: A Large-scale Generalizable Audio Language Model
Compositional Text-to-Image Generation with Dense Blob Representations
Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning
A Survey of Large Language Models for Graphs
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
The Platonic Representation Hypothesis
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
Zero-Shot Tokenizer Transfer
RLHF Workflow: From Reward Modeling to Online RLHF
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
PromptLink: Leveraging Large Language Models for Cross-Source Biomedical Concept Linking
LogoMotion: Visually Grounded Code Generation for Content-Aware Animation
Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training
SUTRA: Scalable Multilingual Language Model Architecture
Large Language Models as Planning Domain Generators
Linearizing Large Language Models
Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval
A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
You Only Cache Once: Decoder-Decoder Architectures for Language Models
ChuXin: 1.6B Technical Report
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
xLSTM: Extended Long Short-Term Memory
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Granite Code Models: A Family of Open Foundation Models for Code Intelligence
ContextQ: Generated Questions to Support Meaningful Parent-Child Dialogue While Co-Reading
AlphaMath Almost Zero: process Supervision without process
MAmmoTH2: Scaling Instructions from the Web
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training
Parameter-Efficient Fine-Tuning with Discrete Fourier Transform
Is Flash Attention Stable?
What matters when building vision-language models?
Optimization without Retraction on the Random Generalized Stiefel Manifold
Customizing Text-to-Image Models with a Single Image Pair
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
FLAME: Factuality-Aware Alignment for Large Language Models
NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment
WildChat: 1M ChatGPT Interaction Logs in the Wild
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
LLM-AD: Large Language Model based Audio Description System
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Spectrally Pruned Gaussian Fields with Neural Compensation
Self-Play Preference Optimization for Language Model Alignment
Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3
A Note on Large Sums of Divisor-Bounded Multiplicative Functions
BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
STT: Stateful Tracking with Transformers for Autonomous Driving
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
Constrained Decoding for Secure Code Generation
A Primer on the Inner Workings of Transformer-based Language Models
SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models
In-Context Learning with Long-Context Models: An In-Depth Exploration
Automatic Creative Selection with Cross-Modal Matching
2404
OpenEQA: Embodied Question Answering in the Era of Foundation Models
CodeGemma: Open Code Models Based on Gemma
Lightplane: Highly-Scalable Components for Neural 3D Fields
MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model
Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting
KAN: Kolmogorov-Arnold Networks
DOCCI: Descriptions of Connected and Contrasting Images
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Better & Faster Large Language Models via Multi-token Prediction
Iterative Reasoning Preference Optimization
When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting
Extending Llama-3's Context Ten-Fold Overnight
RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing
MicroDreamer: Zero-shot 3D Generation in $\sim$20 Seconds by Score-based Iterative Reconstruction
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
Octopus v4: Graph of language models
SAGS: Structure-Aware 3D Gaussian Splatting
In-Context Symbolic Regression: Leveraging Large Language Models for Function Discovery
Hallucination of Multimodal Large Language Models: A Survey
Stylus: Automatic Adapter Selection for Diffusion Models
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
PECC: Problem Extraction and Coding Challenges
ChatGPT as an inventor: Eliciting the strengths and weaknesses of current large language models against humans in engineering design
Capabilities of Gemini Models in Medicine
LEGENT: Open Platform for Embodied Agents
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
MaPa: Text-driven Photorealistic Material Painting for 3D Shapes
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations
Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM
Small Language Models Need Strong Verifiers to Self-Correct Reasoning
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
Make Your LLM Fully Utilize the Context
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving
Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding
Tele-FLM Technical Report
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
Interactive3D: Create What You Want by Interactive 3D Generation
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
NeRF-XL: Scaling NeRFs with Multiple GPUs
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
MaGGIe: Masked Guided Gradual Human Instance Matting
MoDE: CLIP Data Experts via Clustering
Editable Image Elements for Controllable Synthesis
PuLID: Pure and Lightning ID Customization via Contrastive Alignment
Leveraging Large Language Models for Multimodal Search
MotionMaster: Training-free Camera Motion Transfer For Video Generation
BASS: Batched Attention-optimized Speculative Sampling
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
Label-Efficient Sleep Staging Using Transformers Pre-trained with Position Prediction
Multi-Head Mixture-of-Experts
Transformers Can Represent n-gram Language Models
Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models
FlashSpeech: Efficient Zero-Shot Speech Synthesis
Pegasus-1 Technical Report
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
Align Your Steps: Optimizing Sampling Schedules in Diffusion Models
SnapKV: LLM Knows What You are Looking for Before Generation
Learning H-Infinity Locomotion Control
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
A Multimodal Automated Interpretability Agent
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer
A Survey on Efficient Inference for Large Language Models
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
How Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis
Music Consistency Models
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
FlowMind: Automatic Workflow Generation with LLMs
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation
Stronger Random Baselines for In-Context Learning
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction
LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency
How Far Can We Go with Practical Function-Level Program Repair?
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
Does Gaussian Splatting need SFM Initialization?
HalluciBot: Is There No Such Thing as a Bad Question?
BLINK: Multimodal Large Language Models Can See but Not Perceive
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
MeshLRM: Large Reconstruction Model for High-Quality Mesh
From r to Q*: Your Language Model is Secretly a Q-Function
AniClipart: Clipart Animation with Text-to-Video Priors
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Introducing v0.5 of the AI Safety Benchmark from MLCommons
OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
EdgeFusion: On-Device Text-to-Image Generation
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Dynamic Typography: Bringing Words to Life
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
Many-Shot In-Context Learning
A Survey on Retrieval-Augmented Text Generation for Large Language Models
HumMUSS: Human Motion Understanding using State Space Models
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs
Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Private Vector Mean Estimation in the Shuffle Model: Optimal Rates Require Many Messages
How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs’ internal prior
Scaling Instructable Agents Across Many Simulated Worlds
Chinchilla Scaling: A replication attempt
Taming Latent Diffusion Model for Neural Radiance Field Inpainting
MMInA: Benchmarking Multihop Multimodal Internet Agents
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
CTRL-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization
Compression Represents Intelligence Linearly
Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video
Learn Your Reference Model for Real Good Alignment
State Space Model for New-Generation Network Alternative to Transformers: A Survey
CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
TransformerFAM: Feedback attention is working memory
LLM In-Context Recall is Prompt Dependent
On Speculative Decoding for Multimodal Large Language Models
The Illusion of State in State-Space Models
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
COCONut: Modernizing COCO Segmentation
Probing the 3D Awareness of Visual Foundation Models
Pre-training Small Base LMs with Fewer Tokens
On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
Dataset Reset Policy Optimization for RLHF
MonoPatchNeRF: Improving Neural Radiance Fields with Patch-based Monocular Guidance
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
Reducing hallucination in structured outputs via Retrieval-Augmented Generation
Conformal Prediction via Regression-as-Classification
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
LLoCO: Learning Long Contexts Offline
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
OSWORLD: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
RHO-1: Not All Tokens Are What You Need
HGRN2: Gated Linear RNNs with State Expansion
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Sparse Laneformer
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models
Audio Dialogues: Dialogues dataset for audio and music understanding
From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples
Best Practices and Lessons Learned on Synthetic Data for Language Models
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational Notebooks
BRAVE: Broadening the visual encoding of vision-language models
RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation
DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting
Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior
Adapting LLaMA Decoder to Vision Transformer
RULER: What's the Real Context Size of Your Long-Context Language Models?
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Reconstructing Hand-Held Objects in 3D
pfl-research: simulation framework for accelerating research in Private Federated Learning
Magic-Boost: Boost 3D Generation with Mutli-View Conditioned Diffusion
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
MuPT: A Generative Symbolic Music Pretrained Transformer
OmniFusion Technical Report
Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models
Revising Densification in Gaussian Splatting
Hash3D: Training-free Acceleration for 3D Generation
Privacy Preserving Prompt Engineering: A Survey
THOUGHTSCULPT: Reasoning with Intermediate Revision and Search
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
CodecLM: Aligning Language Models with Tailored Synthetic Data
SambaLingo: Teaching Large Language Models New Languages
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing
Evaluating Mathematical Reasoning Beyond Accuracy
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
YaART: Yet Another ART Rendering Technology
UniFL: Improve Stable Diffusion via Unified Feedback Learning
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
DATENeRF: Depth-Aware Text-based Editing of NeRFs
Q-PEFT: Query-dependent Parameter Efficient Fine-tuning for Text Reranking with Large Language Models
Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models
Aligning Diffusion Models by Optimizing Human Utility
PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations
Koala: Key frame-conditioned long video-LLM
SpatialTracker: Tracking Any 2D Pixels in 3D Space
Sigma : Siamese Mamba Network for Multi-Modal Semantic Segmentation
Robust Gaussian Splatting
Social Skill Training with Large Language Models
Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model
No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
BuDDIE: A Business Document Dataset for Multi-task Information Extraction
Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Stream of Search (SoS): Learning to Search in Language
RL for Consistency Models: Faster Reward Guided Text-to-Image Generation
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
Training LLMs over Neurally Compressed Text
ReFT: Representation Finetuning for Language Models
PointInfinity: Resolution-Invariant Point Diffusion Models
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
Scaling Up Video Summarization Pretraining with Large Language Models
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models
Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline
On the Scalability of Diffusion-based Text-to-Image Generation
Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Advancing LLM Reasoning Generalists with Preference Trees
Long-context LLMs Struggle with Long In-context Learning
HyperCLOVA X Technical Report
Poro 34B and the Blessing of Multilinguality
Octopus v2: On-device language model for super agent
Entity Disambiguation via Fusion Entity Decoding
LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models
Are large language models superhuman chemists?
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
HairFastGAN: Realistic and Robust Hair Transfer with a Fast Encoder-Based Approach
2403
ReALM: Reference Resolution As Language Modeling
Gecko: Versatile Text Embeddings Distilled from Large Language Models
Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs
DiJiang: Efficient Large Language Models through Compact Kernelization
Jamba: A Hybrid Transformer-Mamba Language Model
Localizing Paragraph Memorization in Language Models
Model Stock: All we need is just a few fine-tuned models
sDPO: Don’t Use Your Data All at Once
Learning From Correctness Without Prompting Makes LLM Efficient Reasoner
Towards a World-English Language Model for On-Device Virtual Assistants
BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models
The Unreasonable Ineffectiveness of the Deeper Layers
Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Embedding Pose Graph, Enabling 3D Foundation Model Capabilities with a Compact Representation
Arcee’s MergeKit: A Toolkit for Merging Large Language Models
Evolutionary Optimization of Model Merging Recipes
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
LLM as a System Service on Mobile Devices
RAFT: Adapting Language Model to Domain Specific RAG
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
WavCraft: Audio Editing and Generation with Large Language Models
Gemma: Open Models Based on Gemini Research and Technology
A Direct Algorithm for Multi-Gyroscope Infield Calibration
Process Modeling With Large Language Models
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Poly-View Contrastive Learning
Is Cosine-Similarity of Embeddings Really About Similarity?
How Far Are We from Intelligent Visual Deductive Reasoning?
Learning to Decode Collaboratively with Multiple Language Models
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
On a Neural Implementation of Brenier's Polar Factorization
LAB: Large-Scale Alignment for ChatBots
CLLMs: Consistency Large Language Models
2402
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
Training Neural Networks from Scratch with Parallel Low-Rank Adapters
FuseChat: Knowledge Fusion of Chat Models
OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining
Divide-or-Conquer? Which Part Should You Distill Your LLM?
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models
OmniPred: Language Models as Universal Regressors
Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance
KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge
A Survey on Knowledge Distillation of Large Language Models
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
OneBit: Towards Extremely Low-bit Large Language Models
LaCo: Large Language Model Pruning via Layer Collapse
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Masked Attention is All You Need for Graphs
TOAD: Task-Oriented Automatic Dialogs with Diverse Response Styles
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator
DoRA: Weight-Decomposed Low-Rank Adaptation
Higher Layers Need More LoRA Experts
On Computationally Efficient Multi-Class Calibration
X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics and Molecular Design
Accurate LoRA-Finetuning Quantization of LLMs via Information Retention
More Agents Is All You Need
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
DISTILLM: Towards Streamlined Distillation for Large Language Models
ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs
Careful with that Scalpel: Improving Gradient Surgery with an EMA
DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
Executable Code Actions Elicit Better LLM Agents
2401
DressCode: Autoregressively Sewing and Generating Garments from Text Guidance
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Omnipredictors for Regression and the Approximate Rank of Convex Functions
Demystifying Chains, Trees, and Graphs of Thoughts
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tuning Language Models by Proxy
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models
Extreme Compression of Large Language Models via Additive Quantization
Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages
LLaMA Pro: Progressive LLaMA with Block Expansion
RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models
2312
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases
How Smooth Is Attention?
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Retrieval-Augmented Generation for Large Language Models: A Survey
Conformer-Based Speech Recognition On Extreme Edge-Computing Devices
LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin
An LLM Compiler for Parallel Function Calling
Revisiting Non-separable Binary Classification and its Applications in Anomaly Detection
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices
2311
Diffusion Models Without Attention
Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models
Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications
Swallowing the Bitter Pill: Simplified Scalable Conformer Generation
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
RELIC: Investigating Large Language Model Responses using Self-Consistency
Direct2.5: Diverse 3D Content Creation via Multi-view 2.5D Diffusion
PaSS: Parallel Speculative Sampling
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
MultiLoRA: Democratizing LoRA for Better Multi-Task Learning
PINE: Efficient Norm-Bound Verification for Secret-Shared Vectors
Tied-Lora: Enhancing parameter efficiency of LoRA with weight tying
PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning
Transfer Learning for Structured Pruning under Limited Task Data
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Prompt Sketching for Large Language Models
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
Server-side Rescoring of Spoken Entity-centric Knowledge Queries for Virtual Assistants
FlashDecoding++: Faster Large Langauge Model Inference on GPUs
Efficient LLM Inference on CPUs
2310
EELBERT: Tiny Models through Dynamic Embeddings
Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
FP8-LM: Training FP8 Large Language Models
Large Language Models as Generalizable Policies for Embodied Tasks
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Matryoshka Diffusion Models
We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields
CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
SPEED: Speculative Pipelined Execution for Efficient Decoding
VeRA: Vector-based Random Matrix Adaptation
BitNet: Scaling Transformers for Large Language Models
When Can Transformers Reason With Abstract Symbols?
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Pseudo-Generalized Dynamic View Synthesis from a Video
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
Generative Modeling with Phase Stochastic Bridges
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling
Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Improved Baselines with Visual Instruction Tuning
Large Language Models as Analogical Reasoners
Compressing LLMs: The Truth is Rarely Pure and Never Simple
Federated Learning with Differential Privacy for End-to-End Speech Recognition
Towards Automated Accessibility Report Generation for Mobile Apps
2309
Efficient Streaming Language Models with Attention Sinks
Guiding Instruction-based Image Editing via Multimodal Large Language Models
Vision Transformers Need Registers
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
Efficient Memory Management for Large Language Model Serving with PagedAttention
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
LLMCad: Fast and Scalable On-device Large Language Model Inference
2308
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library
Fast Feedforward Networks
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
NimbRo wins ANA Avatar XPRIZE Immersive Telepresence Competition: Human-Centric Evaluation and Lessons Learned
Reinforced Self-Training (ReST) for Language Modeling
A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations
Accelerating LLM Inference with Staged Speculative Decoding
AgentBench: Evaluating LLMs as Agents
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty
2307
Samplable Anonymous Aggregation for Private Federated Data Analysis
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
2306
MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators
MiniLLM: Knowledge Distillation of Large Language Models
MOFI: Learning Image Representation from Noisy Entity Annotated Images
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
TIES-Merging: Resolving Interference When Merging Models
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Bytes Are All You Need: Transformers Operating Directly On File Bytes
2305
LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Scaling Data-Constrained Language Models
Manifold Diffusion Fields
QLoRA: Efficient Finetuning of Quantized LLMs
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
RWKV: Reinventing RNNs for the Transformer Era
Accurate Knowledge Distillation with n-best Reranking
LLM-Pruner: On the Structural Pruning of Large Language Models
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
Fast Distributed Inference Serving for Large Language Models
Shap-E: Generating Conditional 3D Implicit Functions
2304
Are Emergent Abilities of Large Language Models a Mirage?
Visual Instruction Tuning
2303
A Survey of Large Language Models
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Sigmoid Loss for Language Image Pre-Training
Sparks of Artificial General Intelligence: Early experiments with GPT-4
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
2302
Full Stack Optimization of Transformer Inference: a Survey
Active Prompting with Chain-of-Thought for Large Language Models
RETVec: Resilient and Efficient Text Vectorizer
Offsite-Tuning: Transfer Learning without Full Model
Accelerating Large Language Model Decoding with Speculative Sampling
2301
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Muse: Text-To-Image Generation via Masked Generative Transformers
2212
Large Language Models Are Reasoning Teachers
2211
Fast Inference from Transformers via Speculative Decoding
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production
2210
Deploying a Retrieval based Response Model for Task Oriented Dialogues
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
Less is More: Task-aware Layer-wise Distillation for Language Model Compression
2209
FP8 Formats for Deep Learning
2208
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
2207
Confident Adaptive Language Modeling
2206
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
NIPQ: Noise proxy-based Integrated Pseudo-Quantization
2205
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning
Towards Understanding Grokking: An Effective Theory of Representation Learning
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
2204
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
2203
A Survey of Multi-Tenant Deep Learning Inference on GPU
2202
cosFormer: Rethinking Softmax in Attention
2201
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
2112
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
2110
Scalable Smartphone Cluster for Deep Learning
Understanding Dimensional Collapse in Contrastive Self-supervised Learning
2106
LibShalom: Optimizing Small and Irregular-Shaped Matrix Multiplications on ARMv8 Multi-Cores
LoRA: Low-Rank Adaptation of Large Language Models
XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation
2105
A Survey of Data Augmentation Approaches for NLP
2104
RoFormer: Enhanced Transformer with Rotary Position Embedding
The Power of Scale for Parameter-Efficient Prompt Tuning
2101
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Prefix-Tuning: Optimizing Continuous Prompts for Generation
2010
Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks
TurboTransformers: An Efficient GPU Serving System For Transformer Models
2009
Flexible Performant GEMM Kernels on GPUs
2007
Soft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks
2005
Language Models are Few-Shot Learners
BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs
2004
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
2003
Transformer++
2002
GLU Variants Improve Transformer
1910
Depth-Adaptive Transformer
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
1909
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TinyBERT: Distilling BERT for Natural Language Understanding
1908
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
1907
RoBERTa: A Robustly Optimized BERT Pretraining Approach
1906
How multilingual is Multilingual BERT?
1905
HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization
1810
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
1805
Online normalizer calculation for softmax
1803
NVIDIA Tensor Core Programmability, Performance & Precision
1706
Attention Is All You Need
1506
Pointer Networks