arxiv-daily
arxiv-daily copied to clipboard
Showing new listings for Wednesday, 18 December 2024
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
Title:
Endangered Alert: A Field-Validated Self-Training Scheme for Detecting and Protecting Threatened Wildlife on Roads and Roadsides
- Authors: Kunming Li, Mao Shan, Stephany Berrio Perez, Katie Luo, Stewart Worrall
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Traffic accidents are a global safety concern, resulting in numerous fatalities each year. A considerable number of these deaths are caused by animal-vehicle collisions (AVCs), which not only endanger human lives but also present serious risks to animal populations. This paper presents an innovative self-training methodology aimed at detecting rare animals, such as the cassowary in Australia, whose survival is threatened by road accidents. The proposed method addresses critical real-world challenges, including acquiring and labelling sensor data for rare animal species in resource-limited environments. It achieves this by leveraging cloud and edge computing, and automatic data labelling to improve the detection performance of the field-deployed model iteratively. Our approach introduces Label-Augmentation Non-Maximum Suppression (LA-NMS), which incorporates a vision-language model (VLM) to enable automated data labelling. During a five-month deployment, we confirmed the method's robustness and effectiveness, resulting in improved object detection accuracy and increased prediction confidence. The source code is available: this https URL
Title:
Domain Generalization in Autonomous Driving: Evaluating YOLOv8s, RT-DETR, and YOLO-NAS with the ROAD-Almaty Dataset
- Authors: Madiyar Alimov, Temirlan Meiramkhanov
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This study investigates the domain generalization capabilities of three state-of-the-art object detection models - YOLOv8s, RT-DETR, and YOLO-NAS - within the unique driving environment of Kazakhstan. Utilizing the newly constructed ROAD-Almaty dataset, which encompasses diverse weather, lighting, and traffic conditions, we evaluated the models' performance without any retraining. Quantitative analysis revealed that RT-DETR achieved an average F1-score of 0.672 at IoU=0.5, outperforming YOLOv8s (0.458) and YOLO-NAS (0.526) by approximately 46% and 27%, respectively. Additionally, all models exhibited significant performance declines at higher IoU thresholds (e.g., a drop of approximately 20% when increasing IoU from 0.5 to 0.75) and under challenging environmental conditions, such as heavy snowfall and low-light scenarios. These findings underscore the necessity for geographically diverse training datasets and the implementation of specialized domain adaptation techniques to enhance the reliability of autonomous vehicle detection systems globally. This research contributes to the understanding of domain generalization challenges in autonomous driving, particularly in underrepresented regions.
Title:
PromptDet: A Lightweight 3D Object Detection Framework with LiDAR Prompts
- Authors: Kun Guo, Qiang Ling
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Multi-camera 3D object detection aims to detect and localize objects in 3D space using multiple cameras, which has attracted more attention due to its cost-effectiveness trade-off. However, these methods often struggle with the lack of accurate depth estimation caused by the natural weakness of the camera in ranging. Recently, multi-modal fusion and knowledge distillation methods for 3D object detection have been proposed to solve this problem, which are time-consuming during the training phase and not friendly to memory cost. In light of this, we propose PromptDet, a lightweight yet effective 3D object detection framework motivated by the success of prompt learning in 2D foundation model. Our proposed framework, PromptDet, comprises two integral components: a general camera-based detection module, exemplified by models like BEVDet and BEVDepth, and a LiDAR-assisted prompter. The LiDAR-assisted prompter leverages the LiDAR points as a complementary signal, enriched with a minimal set of additional trainable parameters. Notably, our framework is flexible due to our prompt-like design, which can not only be used as a lightweight multi-modal fusion method but also as a camera-only method for 3D object detection during the inference phase. Extensive experiments on nuScenes validate the effectiveness of the proposed PromptDet. As a multi-modal detector, PromptDet improves the mAP and NDS by at most 22.8% and 21.1% with fewer than 2% extra parameters compared with the camera-only baseline. Without LiDAR points, PromptDet still achieves an improvement of at most 2.4% mAP and 4.0% NDS with almost no impact on camera detection inference time.
Title:
CREST: An Efficient Conjointly-trained Spike-driven Framework for Event-based Object Detection Exploiting Spatiotemporal Dynamics
- Authors: Ruixin Mao, Aoyu Shen, Lin Tang, Jun Zhou
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Event-based cameras feature high temporal resolution, wide dynamic range, and low power consumption, which is ideal for high-speed and low-light object detection. Spiking neural networks (SNNs) are promising for event-based object recognition and detection due to their spiking nature but lack efficient training methods, leading to gradient vanishing and high computational complexity, especially in deep SNNs. Additionally, existing SNN frameworks often fail to effectively handle multi-scale spatiotemporal features, leading to increased data redundancy and reduced accuracy. To address these issues, we propose CREST, a novel conjointly-trained spike-driven framework to exploit spatiotemporal dynamics in event-based object detection. We introduce the conjoint learning rule to accelerate SNN learning and alleviate gradient vanishing. It also supports dual operation modes for efficient and flexible implementation on different hardware types. Additionally, CREST features a fully spike-driven framework with a multi-scale spatiotemporal event integrator (MESTOR) and a spatiotemporal-IoU (ST-IoU) loss. Our approach achieves superior object recognition & detection performance and up to 100X energy efficiency compared with state-of-the-art SNN algorithms on three datasets, providing an efficient solution for event-based object detection algorithms suitable for SNN hardware implementation.
Title:
Efficient Oriented Object Detection with Enhanced Small Object Recognition in Aerial Images
- Authors: Zhifei Shi, Zongyao Yin, Sheng Chang, Xiao Yi, Xianchuan Yu
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Achieving a balance between computational efficiency and detection accuracy in the realm of rotated bounding box object detection within aerial imagery is a significant challenge. While prior research has aimed at creating lightweight models that enhance computational performance and feature extraction, there remains a gap in the performance of these networks when it comes to the detection of small and multi-scale objects in remote sensing (RS) imagery. To address these challenges, we present a novel enhancement to the YOLOv8 model, tailored for oriented object detection tasks and optimized for environments with limited computational resources. Our model features a wavelet transform-based C2f module for capturing associative features and an Adaptive Scale Feature Pyramid (ASFP) module that leverages P2 layer details. Additionally, the incorporation of GhostDynamicConv significantly contributes to the model's lightweight nature, ensuring high efficiency in aerial imagery analysis. Featuring a parameter count of 21.6M, our approach provides a more efficient architectural design than DecoupleNet, which has 23.3M parameters, all while maintaining detection accuracy. On the DOTAv1.0 dataset, our model demonstrates a mean Average Precision (mAP) that is competitive with leading methods such as DecoupleNet. The model's efficiency, combined with its reduced parameter count, makes it a strong candidate for aerial object detection, particularly in resource-constrained environments.
Title:
RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion
- Authors: Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Houqiang Li, Yanyong Zhang
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We propose Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection by the following insight. The Radar-Camera fusion in outdoor 3D scene perception is capped by the image-to-BEV transformation--if the depth of pixels is not accurately estimated, the naive combination of BEV features actually integrates unaligned visual content. To avoid this problem, we propose a query-based framework that enables adaptively sample instance-relevant features from both the BEV and the original image view. Furthermore, we enhance system performance by two key designs: optimizing query initialization and strengthening the representational capacity of BEV. For the former, we introduce an adaptive circular distribution in polar coordinates to refine the initialization of object queries, allowing for a distance-based adjustment of query density. For the latter, we initially incorporate a radar-guided depth head to refine the transformation from image view to BEV. Subsequently, we focus on leveraging the Doppler effect of radar and introduce an implicit dynamic catcher to capture the temporal elements within the BEV. Extensive experiments on nuScenes and View-of-Delft (VoD) datasets validate the merits of our design. Remarkably, our method achieves superior results of 64.9% mAP and 70.2% NDS on nuScenes, even outperforming several LiDAR-based detectors. RaCFormer also secures the 1st ranking on the VoD dataset. The code will be released.
Title:
RCTrans: Radar-Camera Transformer via Radar Densifier and Sequential Decoder for 3D Object Detection
- Authors: Yiheng Li, Yang Yang, Zhen Lei
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In radar-camera 3D object detection, the radar point clouds are sparse and noisy, which causes difficulties in fusing camera and radar modalities. To solve this, we introduce a novel query-based detection method named Radar-Camera Transformer (RCTrans). Specifically, we first design a Radar Dense Encoder to enrich the sparse valid radar tokens, and then concatenate them with the image tokens. By doing this, we can fully explore the 3D information of each interest region and reduce the interference of empty tokens during the fusing stage. We then design a Pruning Sequential Decoder to predict 3D boxes based on the obtained tokens and random initialized queries. To alleviate the effect of elevation ambiguity in radar point clouds, we gradually locate the position of the object via a sequential fusion structure. It helps to get more precise and flexible correspondences between tokens and queries. A pruning training strategy is adopted in the decoder, which can save much time during inference and inhibit queries from losing their distinctiveness. Extensive experiments on the large-scale nuScenes dataset prove the superiority of our method, and we also achieve new state-of-the-art radar-camera 3D detection results. Our implementation is available at this https URL.
Title:
Differential Alignment for Domain Adaptive Object Detection
- Authors: Xinyu He (1), Xinhui Li (1), Xiaojie Guo (1) ((1) College of Intelligence and Computing, Tianjin University, Tianjin, China)
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Domain adaptive object detection (DAOD) aims to generalize an object detector trained on labeled source-domain data to a target domain without annotations, the core principle of which is \emph{source-target feature alignment}. Typically, existing approaches employ adversarial learning to align the distributions of the source and target domains as a whole, barely considering the varying significance of distinct regions, say instances under different circumstances and foreground \emph{vs} background areas, during feature alignment. To overcome the shortcoming, we investigates a differential feature alignment strategy. Specifically, a prediction-discrepancy feedback instance alignment module (dubbed PDFA) is designed to adaptively assign higher weights to instances of higher teacher-student detection discrepancy, effectively handling heavier domain-specific information. Additionally, an uncertainty-based foreground-oriented image alignment module (UFOA) is proposed to explicitly guide the model to focus more on regions of interest. Extensive experiments on widely-used DAOD datasets together with ablation studies are conducted to demonstrate the efficacy of our proposed method and reveal its superiority over other SOTA alternatives. Our code is available at this https URL.
Title:
What is YOLOv6? A Deep Insight into the Object Detection Model
- Authors: Athulya Sundaresan Geetha
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This work explores the YOLOv6 object detection model in depth, concentrating on its design framework, optimization techniques, and detection capabilities. YOLOv6's core elements consist of the EfficientRep Backbone for robust feature extraction and the Rep-PAN Neck for seamless feature aggregation, ensuring high-performance object detection. Evaluated on the COCO dataset, YOLOv6-N achieves 37.5% AP at 1187 FPS on an NVIDIA Tesla T4 GPU. YOLOv6-S reaches 45.0% AP at 484 FPS, outperforming models like PPYOLOE-S, YOLOv5-S, YOLOX-S, and YOLOv8-S in the same class. Moreover, YOLOv6-M and YOLOv6-L also show better accuracy (50.0% and 52.8%) while maintaining comparable inference speeds to other detectors. With an upgraded backbone and neck structure, YOLOv6-L6 delivers cutting-edge accuracy in real-time.
Title:
A New Adversarial Perspective for LiDAR-based 3D Object Detection
- Authors: Shijun Zheng, Weiquan Liu, Yu Guo, Yu Zang, Siqi Shen, Cheng Wang
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Autonomous vehicles (AVs) rely on LiDAR sensors for environmental perception and decision-making in driving scenarios. However, ensuring the safety and reliability of AVs in complex environments remains a pressing challenge. To address this issue, we introduce a real-world dataset (ROLiD) comprising LiDAR-scanned point clouds of two random objects: water mist and smoke. In this paper, we introduce a novel adversarial perspective by proposing an attack framework that utilizes water mist and smoke to simulate environmental interference. Specifically, we propose a point cloud sequence generation method using a motion and content decomposition generative adversarial network named PCS-GAN to simulate the distribution of random objects. Furthermore, leveraging the simulated LiDAR scanning characteristics implemented with Range Image, we examine the effects of introducing random object perturbations at various positions on the target vehicle. Extensive experiments demonstrate that adversarial perturbations based on random objects effectively deceive vehicle detection and reduce the recognition rate of 3D object detection models.
Title:
Continuous Patient Monitoring with AI: Real-Time Analysis of Video in Hospital Care Settings
- Authors: Paolo Gabriel, Peter Rehani, Tyler Troy, Tiffany Wyatt, Michael Choma, Narinder Singh
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This study introduces an AI-driven platform for continuous and passive patient monitoring in hospital settings, developed by LookDeep Health. Leveraging advanced computer vision, the platform provides real-time insights into patient behavior and interactions through video analysis, securely storing inference results in the cloud for retrospective evaluation. The dataset, compiled in collaboration with 11 hospital partners, encompasses over 300 high-risk fall patients and over 1,000 days of inference, enabling applications such as fall detection and safety monitoring for vulnerable patient populations. To foster innovation and reproducibility, an anonymized subset of this dataset is publicly available. The AI system detects key components in hospital rooms, including individual presence and role, furniture location, motion magnitude, and boundary crossings. Performance evaluation demonstrates strong accuracy in object detection (macro F1-score = 0.92) and patient-role classification (F1-score = 0.98), as well as reliable trend analysis for the "patient alone" metric (mean logistic regression accuracy = 0.82 \pm 0.15). These capabilities enable automated detection of patient isolation, wandering, or unsupervised movement-key indicators for fall risk and other adverse events. This work establishes benchmarks for validating AI-driven patient monitoring systems, highlighting the platform's potential to enhance patient safety and care by providing continuous, data-driven insights into patient behavior and interactions.
Keyword: transformer
Title:
Climate Aware Deep Neural Networks (CADNN) for Wind Power Simulation
- Authors: Ali Forootani, Danial Esmaeili Aliabadi, Daniela Thraen
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Wind power forecasting plays a critical role in modern energy systems, facilitating the integration of renewable energy sources into the power grid. Accurate prediction of wind energy output is essential for managing the inherent intermittency of wind power, optimizing energy dispatch, and ensuring grid stability. This paper proposes the use of Deep Neural Network (DNN)-based predictive models that leverage climate datasets, including wind speed, atmospheric pressure, temperature, and other meteorological variables, to improve the accuracy of wind power simulations. In particular, we focus on the Coupled Model Intercomparison Project (CMIP) datasets, which provide climate projections, as inputs for training the DNN models. These models aim to capture the complex nonlinear relationships between the CMIP-based climate data and actual wind power generation at wind farms located in Germany. Our study compares various DNN architectures, specifically Multilayer Perceptron (MLP), Long Short-Term Memory (LSTM) networks, and Transformer-enhanced LSTM models, to identify the best configuration among these architectures for climate-aware wind power simulation. The implementation of this framework involves the development of a Python package (CADNN) designed to support multiple tasks, including statistical analysis of the climate data, data visualization, preprocessing, DNN training, and performance evaluation. We demonstrate that the DNN models, when integrated with climate data, significantly enhance forecasting accuracy. This climate-aware approach offers a deeper understanding of the time-dependent climate patterns that influence wind power generation, providing more accurate predictions and making it adaptable to other geographical regions.
Title:
iMoT: Inertial Motion Transformer for Inertial Navigation
- Authors: Son Minh Nguyen, Linh Duy Tran, Duc Viet Le, Paul J.M Havinga
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We propose iMoT, an innovative Transformer-based inertial odometry method that retrieves cross-modal information from motion and rotation modalities for accurate positional estimation. Unlike prior work, during the encoding of the motion context, we introduce Progressive Series Decoupler at the beginning of each encoder layer to stand out critical motion events inherent in acceleration and angular velocity signals. To better aggregate cross-modal interactions, we present Adaptive Positional Encoding, which dynamically modifies positional embeddings for temporal discrepancies between different modalities. During decoding, we introduce a small set of learnable query motion particles as priors to model motion uncertainties within velocity segments. Each query motion particle is intended to draw cross-modal features dedicated to a specific motion mode, all taken together allowing the model to refine its understanding of motion dynamics effectively. Lastly, we design a dynamic scoring mechanism to stabilize iMoT's optimization by considering all aligned motion particles at the final decoding step, ensuring robust and accurate velocity segment estimation. Extensive evaluations on various inertial datasets demonstrate that iMoT significantly outperforms state-of-the-art methods in delivering superior robustness and accuracy in trajectory reconstruction.
Title:
SEE: Sememe Entanglement Encoding for Transformer-bases Models Compression
- Authors: Jing Zhang, Shuzhen Sun, Peng Zhang, Guangxing Cao, Hui Gao, Xindian Ma, Nan Xu, Yuexian Hou
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Transformer-based large language models exhibit groundbreaking capabilities, but their storage and computational costs are prohibitively high, limiting their application in resource-constrained scenarios. An effective approach is to eliminate redundant model parameters and computational costs while incorporating efficient expert-derived knowledge structures to achieve a balance between compression and performance. Therefore, we propose the \textit{Sememe Entanglement Encoding (SEE)} algorithm. Guided by expert prior knowledge, the model is compressed through the low-rank approximation idea. In Entanglement Embedding, basic semantic units such as sememes are represented as low-dimensional vectors, and then reconstructed into high-dimensional word embeddings through the combination of generalized quantum entanglement. We adapt the Sememe Entanglement Encoding algorithm to transformer-based models of different magnitudes. Experimental results indicate that our approach achieves stable performance while compressing model parameters and computational costs.
Title:
EDformer: Embedded Decomposition Transformer for Interpretable Multivariate Time Series Predictions
- Authors: Sanjay Chakraborty, Ibrahim Delibasoglu, Fredrik Heintz
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Time series forecasting is a crucial challenge with significant applications in areas such as weather prediction, stock market analysis, and scientific simulations. This paper introduces an embedded decomposed transformer, 'EDformer', for multivariate time series forecasting tasks. Without altering the fundamental elements, we reuse the Transformer architecture and consider the capable functions of its constituent parts in this work. Edformer first decomposes the input multivariate signal into seasonal and trend components. Next, the prominent multivariate seasonal component is reconstructed across the reverse dimensions, followed by applying the attention mechanism and feed-forward network in the encoder stage. In particular, the feed-forward network is used for each variable frame to learn nonlinear representations, while the attention mechanism uses the time points of individual seasonal series embedded within variate frames to capture multivariate correlations. Therefore, the trend signal is added with projection and performs the final forecasting. The EDformer model obtains state-of-the-art predicting results in terms of accuracy and efficiency on complex real-world time series datasets. This paper also addresses model explainability techniques to provide insights into how the model makes its predictions and why specific features or time steps are important, enhancing the interpretability and trustworthiness of the forecasting results.
Title:
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
- Authors: Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal
- Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. In this paper, we propose \textbf{concept encoding-decoding mechanism} to explain ICL by studying how transformers form and use internal abstractions in their representations. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of concept encoding and decoding. As the model learns to encode different latent concepts (e.g., ``Finding the first noun in a sentence.") into distinct, separable representations, it concureently builds conditional decoding algorithms and improve its ICL performance. We validate the existence of this mechanism across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B). Further, through mechanistic interventions and controlled finetuning, we demonstrate that the quality of concept encoding is causally related and predictive of ICL performance. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.
Title:
Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content
- Authors: Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the \underline{U}niversal \underline{N}etwork for \underline{I}dentifying \underline{T}ampered and synth\underline{E}tic videos (\texttt{UNITE}) model, which, unlike traditional detectors, captures full-frame manipulations. \texttt{UNITE} extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model's tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that \texttt{UNITE} outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.
Title:
Krony-PT: GPT2 compressed with Kronecker Products
- Authors: M. Ayoub Ben Ayad, Jelena Mitrovic, Michael Granitzer
- Subjects: Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We introduce Krony-PT, a compression technique of GPT2 \citep{radford2019language} based on Kronecker Products. We specifically target the MLP layers of each transformer layer, and systematically compress the feed forward layer matrices to various degrees. We introduce a modified Van Loan decomposition to initialize the new factors, and also introduce a new pruning-based initialization trick. Our method compresses the original 124M parameter GPT2 to various smaller models, with 80M being the smallest, and 96M being the largest compressed model. Our 81M model variant outperforms distilgpt2 on next-token prediction on all standard language modeling datasets, and shows competitive scores or performs on par with other Kronecker Products based compressed models of GPT2 that are significantly higher in size.
Title:
Efficient Scaling of Diffusion Transformers for Text-to-Image Generation
- Authors: Hao Li, Shamit Lal, Zhiheng Li, Yusheng Xie, Ying Wang, Yang Zou, Orchid Majumder, R. Manmatha, Zhuowen Tu, Stefano Ermon, Stefano Soatto, Ashwin Swaminathan
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency.
Title:
Numerical Pruning for Efficient Autoregressive Models
- Authors: Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Jing Liu, Ruiyi Zhang, Ryan A. Rossi, Hao Tan, Tong Yu, Xiang Chen, Yufan Zhou, Tong Sun, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Transformers have emerged as the leading architecture in deep learning, proving to be versatile and highly effective across diverse domains beyond language and image processing. However, their impressive performance often incurs high computational costs due to their substantial model size. This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning to improve the model efficiency while preserving performance for both language and image generation tasks. Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and MLP modules, respectively. Besides, we further propose another compensation algorithm to recover the pruned model for better performance. To verify the effectiveness of our method, we provide both theoretical support and extensive experiments. Our experiments show that our method achieves state-of-the-art performance with reduced memory usage and faster generation speeds on GPUs.
Title:
LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers
- Authors: Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, Kai Zhang, Hao Tan, Jason Kuen, Henghui Ding, Zhihao Shu, Wei Niu, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks, demonstrating superior performance and efficacy across various applications. The promising results come at the cost of slow inference, as each denoising step requires running the whole transformer model with a large amount of parameters. In this paper, we show that performing the full computation of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps. Furthermore, we show that the lower bound of similarity between outputs at consecutive steps is notably high, and this similarity can be linearly approximated using the inputs. To verify our demonstrations, we propose the \textbf{LazyDiT}, a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations. Specifically, we incorporate lazy learning layers into the model, effectively trained to maximize laziness, enabling dynamic skipping of redundant computations. Experimental results show that LazyDiT outperforms the DDIM sampler across multiple diffusion transformer models at various resolutions. Furthermore, we implement our method on mobile devices, achieving better performance than DDIM with similar latency.
Title:
Core Context Aware Attention for Long Context Language Modeling
- Authors: Yaofo Chen, Zeng You, Shuhai Zhang, Haokun Li, Yirui Li, Yaowei Wang, Mingkui Tan
- Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Transformer-based Large Language Models (LLMs) have exhibited remarkable success in various natural language processing tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute the attention score. However, when the context length L becomes very large (e.g., 32K), more redundant context information will be included w.r.t. any tokens, making the self-attention suffer from two main limitations: 1) The computational and memory complexity scales quadratically w.r.t. L; 2) The presence of redundant context information may hamper the model to capture dependencies among crucial tokens, which may degrade the representation performance. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling, which consists of two components: 1) Globality-pooling attention that divides input tokens into groups and then dynamically merges tokens within each group into one core token based on their significance; 2) Locality-preserved attention that incorporates neighboring tokens into the attention calculation. The two complementary attentions will then be fused to the final attention, maintaining comprehensive modeling ability as the full self-attention. In this way, the core context information w.r.t. a given token will be automatically focused and strengthened, while the context information in redundant groups will be diminished during the learning process. As a result, the computational and memory complexity will be significantly reduced. More importantly, the CCA-Attention can improve the long-context modeling ability by diminishing the redundant context information. Extensive experimental results demonstrate that our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
Title:
Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training
- Authors: Mingjia Shi, Yuhao Zhou, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Tanmay Rajpurohit, Shanmukha Ramakrishna Vedantam, Wangbo Zhao, Kai Wang, Yang You
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Vision Mamba (e.g., Vim) has successfully been integrated into computer vision, and token reduction has yielded promising outcomes in Vision Transformers (ViTs). However, token reduction performs less effectively on Vision Mamba compared to ViTs. Pruning informative tokens in Mamba leads to a high loss of key knowledge and bad performance. This makes it not a good solution for enhancing efficiency in Mamba. Token merging, which preserves more token information than pruning, has demonstrated commendable performance in ViTs. Nevertheless, vanilla merging performance decreases as the reduction ratio increases either, failing to maintain the key knowledge in Mamba. Re-training the token-reduced model enhances the performance of Mamba, by effectively rebuilding the key knowledge. Empirically, pruned Vims only drop up to 0.9% accuracy on ImageNet-1K, recovered by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drop 1.3% with 1.2x (up to 1.5x) speed up in inference.
Title:
ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers
- Authors: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Chen Liang, Tong Shen, Han Zhang, Huanzhang Dou, Yu Liu, Jingren Zhou
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive visual generation framework that leverages pretrained diffusion transformers in their original form, requiring no additional tuning, adapters, or modifications. Users can interact with ChatDiT to create interleaved text-image articles, multi-page picture books, edit images, design IP derivatives, or develop character design settings, all through free-form natural language across one or more conversational rounds. At its core, ChatDiT employs a multi-agent system comprising three key components: an Instruction-Parsing agent that interprets user-uploaded images and instructions, a Strategy-Planning agent that devises single-step or multi-step generation actions, and an Execution agent that performs these actions using an in-context toolkit of diffusion transformers. We thoroughly evaluate ChatDiT on IDEA-Bench arXiv:2412.11767, comprising 100 real-world design tasks and 275 cases with diverse instructions and varying numbers of input and target images. Despite its simplicity and training-free approach, ChatDiT surpasses all competitors, including those specifically designed and trained on extensive multi-task datasets. We further identify key limitations of pretrained DiTs in zero-shot adapting to tasks. We release all code, agents, results, and intermediate outputs to facilitate further research at this https URL
Title:
Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree
- Authors: Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji
- Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Striking an optimal balance between minimal drafting latency and high speculation accuracy to enhance the inference speed of Large Language Models remains a significant challenge in speculative decoding. In this paper, we introduce Falcon, an innovative semi-autoregressive speculative decoding framework fashioned to augment both the drafter's parallelism and output quality. Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy. We offer a comprehensive theoretical analysis to illuminate the underlying mechanisms. Additionally, we introduce a Custom-Designed Decoding Tree, which permits the drafter to generate multiple tokens in a single forward pass and accommodates multiple forward passes as needed, thereby boosting the number of drafted tokens and significantly improving the overall acceptance rate. Comprehensive evaluations on benchmark datasets such as MT-Bench, HumanEval, and GSM8K demonstrate Falcon's superior acceleration capabilities. The framework achieves a lossless speedup ratio ranging from 2.91x to 3.51x when tested on the Vicuna and LLaMA2-Chat model series. These results outstrip existing speculative decoding methods for LLMs, including Eagle, Medusa, Lookahead, SPS, and PLD, while maintaining a compact drafter architecture equivalent to merely two Transformer layers.
Title:
LLM-based Discriminative Reasoning for Knowledge Graph Question Answering
- Authors: Mufan Xu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao, Min Zhang
- Subjects: Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Large language models (LLMs) based on generative pre-trained Transformer have achieved remarkable performance on knowledge graph question-answering (KGQA) tasks. However, LLMs often produce ungrounded subgraph planning or reasoning results in KGQA due to the hallucinatory behavior brought by the generative paradigm, which may hinder the advancement of the LLM-based KGQA model. To deal with the issue, we propose a novel LLM-based Discriminative Reasoning (LDR) method to explicitly model the subgraph retrieval and answer inference process. By adopting discriminative strategies, the proposed LDR method not only enhances the capability of LLMs to retrieve question-related subgraphs but also alleviates the issue of ungrounded reasoning brought by the generative paradigm of LLMs. Experimental results show that the proposed approach outperforms multiple strong comparison methods, along with achieving state-of-the-art performance on two widely used WebQSP and CWQ benchmarks.
Title:
Predicting User Behavior in Smart Spaces with LLM-Enhanced Logs and Personalized Prompts (Data Description)
- Authors: Yunpeng Song
- Subjects: Subjects: Human-Computer Interaction (cs.HC)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Enhancing the intelligence of smart systems, such as smart home, and smart vehicle, and smart grids, critically depends on developing sophisticated planning capabilities that can anticipate the next desired function based on historical interactions. While existing methods view user behaviors as sequential data and apply models like RNNs and Transformers to predict future actions, they often fail to incorporate domain knowledge and capture personalized user preferences. In this paper, we propose a novel approach that incorporates LLM-enhanced logs and personalized prompts. Our approach first constructs a graph that captures individual behavior preferences derived from their interaction histories. This graph effectively transforms into a soft continuous prompt that precedes the sequence of user behaviors. Then our approach leverages the vast general knowledge and robust reasoning capabilities of a pretrained LLM to enrich the oversimplified and incomplete log records. By enhancing these logs semantically, our approach better understands the user's actions and intentions, especially for those rare events in the dataset. We evaluate the method across four real-world datasets from both smart vehicle and smart home settings. The findings validate the effectiveness of our LLM-enhanced description and personalized prompt, shedding light on potential ways to advance the intelligence of smart space. Note: While this manuscript provides description of the data, we are \textbf{not} permitted to make these datasets publicly available due to restrictions imposed by the data provider.
Title:
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features
- Authors: Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller
- Subjects: Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To address the challenge of detecting highly similar paraphrased texts, we propose MhBART, an encoder-decoder model designed to emulate human writing style while incorporating a novel difference score mechanism. This model outperforms strong classifier baselines and identifies deceptive sentence patterns. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets -- 15.5% absolute improvement on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvement on M4 compared to SOTA approaches.
Title:
RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion
- Authors: Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Houqiang Li, Yanyong Zhang
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We propose Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection by the following insight. The Radar-Camera fusion in outdoor 3D scene perception is capped by the image-to-BEV transformation--if the depth of pixels is not accurately estimated, the naive combination of BEV features actually integrates unaligned visual content. To avoid this problem, we propose a query-based framework that enables adaptively sample instance-relevant features from both the BEV and the original image view. Furthermore, we enhance system performance by two key designs: optimizing query initialization and strengthening the representational capacity of BEV. For the former, we introduce an adaptive circular distribution in polar coordinates to refine the initialization of object queries, allowing for a distance-based adjustment of query density. For the latter, we initially incorporate a radar-guided depth head to refine the transformation from image view to BEV. Subsequently, we focus on leveraging the Doppler effect of radar and introduce an implicit dynamic catcher to capture the temporal elements within the BEV. Extensive experiments on nuScenes and View-of-Delft (VoD) datasets validate the merits of our design. Remarkably, our method achieves superior results of 64.9% mAP and 70.2% NDS on nuScenes, even outperforming several LiDAR-based detectors. RaCFormer also secures the 1st ranking on the VoD dataset. The code will be released.
Title:
RCTrans: Radar-Camera Transformer via Radar Densifier and Sequential Decoder for 3D Object Detection
- Authors: Yiheng Li, Yang Yang, Zhen Lei
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In radar-camera 3D object detection, the radar point clouds are sparse and noisy, which causes difficulties in fusing camera and radar modalities. To solve this, we introduce a novel query-based detection method named Radar-Camera Transformer (RCTrans). Specifically, we first design a Radar Dense Encoder to enrich the sparse valid radar tokens, and then concatenate them with the image tokens. By doing this, we can fully explore the 3D information of each interest region and reduce the interference of empty tokens during the fusing stage. We then design a Pruning Sequential Decoder to predict 3D boxes based on the obtained tokens and random initialized queries. To alleviate the effect of elevation ambiguity in radar point clouds, we gradually locate the position of the object via a sequential fusion structure. It helps to get more precise and flexible correspondences between tokens and queries. A pruning training strategy is adopted in the decoder, which can save much time during inference and inhibit queries from losing their distinctiveness. Extensive experiments on the large-scale nuScenes dataset prove the superiority of our method, and we also achieve new state-of-the-art radar-camera 3D detection results. Our implementation is available at this https URL.
Title:
2by2: Weakly-Supervised Learning for Global Action Segmentation
- Authors: Elena Bueno-Benito, Mariella Dimiccoli
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This paper presents a simple yet effective approach for the poorly investigated task of global action segmentation, aiming at grouping frames capturing the same action across videos of different activities. Unlike the case of videos depicting all the same activity, the temporal order of actions is not roughly shared among all videos, making the task even more challenging. We propose to use activity labels to learn, in a weakly-supervised fashion, action representations suitable for global action segmentation. For this purpose, we introduce a triadic learning approach for video pairs, to ensure intra-video action discrimination, as well as inter-video and inter-activity action association. For the backbone architecture, we use a Siamese network based on sparse transformers that takes as input video pairs and determine whether they belong to the same activity. The proposed approach is validated on two challenging benchmark datasets: Breakfast and YouTube Instructions, outperforming state-of-the-art methods.
Title:
Benchmarking and Understanding Compositional Relational Reasoning of LLMs
- Authors: Ruikang Ni, Da Xiao, Qingye Meng, Xiangyu Li, Shihui Zheng, Hongliang Liang
- Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Compositional relational reasoning (CRR) is a hallmark of human intelligence, but we lack a clear understanding of whether and how existing transformer large language models (LLMs) can solve CRR tasks. To enable systematic exploration of the CRR capability of LLMs, we first propose a new synthetic benchmark called Generalized Associative Recall (GAR) by integrating and generalizing the essence of several tasks in mechanistic interpretability (MI) study in a unified framework. Evaluation shows that GAR is challenging enough for existing LLMs, revealing their fundamental deficiency in CRR. Meanwhile, it is easy enough for systematic MI study. Then, to understand how LLMs solve GAR tasks, we use attribution patching to discover the core circuits reused by Vicuna-33B across different tasks and a set of vital attention heads. Intervention experiments show that the correct functioning of these heads significantly impacts task performance. Especially, we identify two classes of heads whose activations represent the abstract notion of true and false in GAR tasks respectively. They play a fundamental role in CRR across various models and tasks. The dataset and code are available at this https URL.
Title:
Efficient Event-based Semantic Segmentation with Spike-driven Lightweight Transformer-based Networks
- Authors: Xiaxin Zhu, Fangming Guo, Xianlei Long, Qingyi Gu, Chao Chen, Fuqiang Gu
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Event-based semantic segmentation has great potential in autonomous driving and robotics due to the advantages of event cameras, such as high dynamic range, low latency, and low power cost. Unfortunately, current artificial neural network (ANN)-based segmentation methods suffer from high computational demands, the requirements for image frames, and massive energy consumption, limiting their efficiency and application on resource-constrained edge/mobile platforms. To address these problems, we introduce SLTNet, a spike-driven lightweight transformer-based network designed for event-based semantic segmentation. Specifically, SLTNet is built on efficient spike-driven convolution blocks (SCBs) to extract rich semantic features while reducing the model's parameters. Then, to enhance the long-range contextural feature interaction, we propose novel spike-driven transformer blocks (STBs) with binary mask operations. Based on these basic blocks, SLTNet employs a high-efficiency single-branch architecture while maintaining the low energy consumption of the Spiking Neural Network (SNN). Finally, extensive experiments on DDD17 and DSEC-Semantic datasets demonstrate that SLTNet outperforms state-of-the-art (SOTA) SNN-based methods by at least 7.30% and 3.30% mIoU, respectively, with extremely 5.48x lower energy consumption and 1.14x faster inference speed.
Title:
A Comparative Study of Pruning Methods in Transformer-based Time Series Forecasting
- Authors: Nicholas Kiefer, Arvid Weyrauch, Muhammed Öz, Achim Streit, Markus Götz, Charlotte Debus
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract The current landscape in time-series forecasting is dominated by Transformer-based models. Their high parameter count and corresponding demand in computational resources pose a challenge to real-world deployment, especially for commercial and scientific applications with low-power embedded devices. Pruning is an established approach to reduce neural network parameter count and save compute. However, the implications and benefits of pruning Transformer-based models for time series forecasting are largely unknown. To close this gap, we provide a comparative benchmark study by evaluating unstructured and structured pruning on various state-of-the-art multivariate time series models. We study the effects of these pruning strategies on model predictive performance and computational aspects like model size, operations, and inference time. Our results show that certain models can be pruned even up to high sparsity levels, outperforming their dense counterpart. However, fine-tuning pruned models is necessary. Furthermore, we demonstrate that even with corresponding hardware and software support, structured pruning is unable to provide significant time savings.
Title:
TimeCHEAT: A Channel Harmony Strategy for Irregularly Sampled Multivariate Time Series Analysis
- Authors: Jiexi Liu, Meng Cao, Songcan Chen
- Subjects: Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Irregularly sampled multivariate time series (ISMTS) are prevalent in reality. Due to their non-uniform intervals between successive observations and varying sampling rates among series, the channel-independent (CI) strategy, which has been demonstrated more desirable for complete multivariate time series forecasting in recent studies, has failed. This failure can be further attributed to the sampling sparsity, which provides insufficient information for effective CI learning, thereby reducing its capacity. When we resort to the channel-dependent (CD) strategy, even higher capacity cannot mitigate the potential loss of diversity in learning similar embedding patterns across different channels. We find that existing work considers CI and CD strategies to be mutually exclusive, primarily because they apply these strategies to the global channel. However, we hold the view that channel strategies do not necessarily have to be used globally. Instead, by appropriately applying them locally and globally, we can create an opportunity to take full advantage of both strategies. This leads us to introduce the Channel Harmony ISMTS Transformer (TimeCHEAT), which utilizes the CD locally and the CI globally. Specifically, we segment the ISMTS into sub-series level patches. Locally, the CD strategy aggregates information within each patch for time embedding learning, maximizing the use of relevant observations while reducing long-range irrelevant interference. Here, we enhance generality by transforming embedding learning into an edge weight prediction task using bipartite graphs, eliminating the need for special prior knowledge. Globally, the CI strategy is applied across patches, allowing the Transformer to learn individualized attention patterns for each channel. Experimental results indicate our proposed TimeCHEAT demonstrates competitive SOTA performance across three mainstream tasks.
Title:
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
- Authors: Wonseok Roh, Hwanhee Jung, Jong Wook Kim, Seunggwan Lee, Innfarn Yoo, Andreas Lugmayr, Seunggeun Chi, Karthik Ramani, Sangpil Kim
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. However, unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from a single image. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under single-view settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.
Title:
PT: A Plain Transformer is Good Hospital Readmission Predictor
- Authors: Zhenyi Fan, Jiaqi Li, Dongyu Luo, Yuqi Yuan
- Subjects: Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Hospital readmission prediction is critical for clinical decision support, aiming to identify patients at risk of returning within 30 days post-discharge. High readmission rates often indicate inadequate treatment or post-discharge care, making effective prediction models essential for optimizing resources and improving patient outcomes. We propose PT, a Transformer-based model that integrates Electronic Health Records (EHR), medical images, and clinical notes to predict 30-day all-cause hospital readmissions. PT extracts features from raw data and uses specialized Transformer blocks tailored to the data's complexity. Enhanced with Random Forest for EHR feature selection and test-time ensemble techniques, PT achieves superior accuracy, scalability, and robustness. It performs well even when temporal information is missing. Our main contributions are: (1)Simplicity: A powerful and efficient baseline model outperforming existing ones in prediction accuracy; (2)Scalability: Flexible handling of various features from different modalities, achieving high performance with just clinical notes or EHR data; (3)Robustness: Strong predictive performance even with missing or unclear temporal data.
Title:
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
- Authors: Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov
- Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at this https URL.
Title:
ArchesWeather & ArchesWeatherGen: a deterministic and generative model for efficient ML weather forecasting
- Authors: Guillaume Couairon, Renu Singh, Anastase Charantonis, Christian Lessig, Claire Monteleoni
- Subjects: Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Weather forecasting plays a vital role in today's society, from agriculture and logistics to predicting the output of renewable energies, and preparing for extreme weather events. Deep learning weather forecasting models trained with the next state prediction objective on ERA5 have shown great success compared to numerical global circulation models. However, for a wide range of applications, being able to provide representative samples from the distribution of possible future weather states is critical. In this paper, we propose a methodology to leverage deterministic weather models in the design of probabilistic weather models, leading to improved performance and reduced computing costs. We first introduce \textbf{ArchesWeather}, a transformer-based deterministic model that improves upon Pangu-Weather by removing overrestrictive inductive priors. We then design a probabilistic weather model called \textbf{ArchesWeatherGen} based on flow matching, a modern variant of diffusion models, that is trained to project ArchesWeather's predictions to the distribution of ERA5 weather states. ArchesWeatherGen is a true stochastic emulator of ERA5 and surpasses IFS ENS and NeuralGCM on all WeatherBench headline variables (except for NeuralGCM's geopotential). Our work also aims to democratize the use of deterministic and generative machine learning models in weather forecasting research, with academic computing resources. All models are trained at 1.5° resolution, with a training budget of $\sim$9 V100 days for ArchesWeather and $\sim$45 V100 days for ArchesWeatherGen. For inference, ArchesWeatherGen generates 15-day weather trajectories at a rate of 1 minute per ensemble member on a A100 GPU card. To make our work fully reproducible, our code and models are open source, including the complete pipeline for data preparation, training, and evaluation, at this https URL .
Title:
Unlocking LLMs: Addressing Scarce Data and Bias Challenges in Mental Health
- Authors: Vivek Kumar, Eirini Ntoutsi, Pushpraj Singh Rajawat, Giacomo Medda, Diego Reforgiato Recupero
- Subjects: Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Large language models (LLMs) have shown promising capabilities in healthcare analysis but face several challenges like hallucinations, parroting, and bias manifestation. These challenges are exacerbated in complex, sensitive, and low-resource domains. Therefore, in this work we introduce IC-AnnoMI, an expert-annotated motivational interviewing (MI) dataset built upon AnnoMI by generating in-context conversational dialogues leveraging LLMs, particularly ChatGPT. IC-AnnoMI employs targeted prompts accurately engineered through cues and tailored information, taking into account therapy style (empathy, reflection), contextual relevance, and false semantic change. Subsequently, the dialogues are annotated by experts, strictly adhering to the Motivational Interviewing Skills Code (MISC), focusing on both the psychological and linguistic dimensions of MI dialogues. We comprehensively evaluate the IC-AnnoMI dataset and ChatGPT's emotional reasoning ability and understanding of domain intricacies by modeling novel classification tasks employing several classical machine learning and current state-of-the-art transformer approaches. Finally, we discuss the effects of progressive prompting strategies and the impact of augmented data in mitigating the biases manifested in IC-AnnoM. Our contributions provide the MI community with not only a comprehensive dataset but also valuable insights for using LLMs in empathetic text generation for conversational therapy in supervised settings.
Title:
Relational Neurosymbolic Markov Models
- Authors: Lennert De Smet, Gabriele Venturato, Luc De Raedt, Giuseppe Marra
- Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Sequential problems are ubiquitous in AI, such as in reinforcement learning or natural language processing. State-of-the-art deep sequential models, like transformers, excel in these settings but fail to guarantee the satisfaction of constraints necessary for trustworthy deployment. In contrast, neurosymbolic AI (NeSy) provides a sound formalism to enforce constraints in deep probabilistic models but scales exponentially on sequential problems. To overcome these limitations, we introduce relational neurosymbolic Markov models (NeSy-MMs), a new class of end-to-end differentiable sequential models that integrate and provably satisfy relational logical constraints. We propose a strategy for inference and learning that scales on sequential settings, and that combines approximate Bayesian inference, automated reasoning, and gradient estimation. Our experiments show that NeSy-MMs can solve problems beyond the current state-of-the-art in neurosymbolic AI and still provide strong guarantees with respect to desired properties. Moreover, we show that our models are more interpretable and that constraints can be adapted at test time to out-of-distribution scenarios.
Title:
Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach
- Authors: Hugo Math, Rainer Lienhart, Robin Schön
- Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In this paper, we draw an analogy between processing natural languages and processing multivariate event streams from vehicles in order to predict $\textit{when}$ and $\textit{what}$ error pattern is most likely to occur in the future for a given car. Our approach leverages the temporal dynamics and contextual relationships of our event data from a fleet of cars. Event data is composed of discrete values of error codes as well as continuous values such as time and mileage. Modelled by two causal Transformers, we can anticipate vehicle failures and malfunctions before they happen. Thus, we introduce $\textit{CarFormer}$, a Transformer model trained via a new self-supervised learning strategy, and $\textit{EPredictor}$, an autoregressive Transformer decoder model capable of predicting $\textit{when}$ and $\textit{what}$ error pattern will most likely occur after some error code apparition. Despite the challenges of high cardinality of event types, their unbalanced frequency of appearance and limited labelled data, our experimental results demonstrate the excellent predictive ability of our novel model. Specifically, with sequences of $160$ error codes on average, our model is able with only half of the error codes to achieve $80%$ F1 score for predicting $\textit{what}$ error pattern will occur and achieves an average absolute error of $58.4 \pm 13.2$h $\textit{when}$ forecasting the time of occurrence, thus enabling confident predictive maintenance and enhancing vehicle safety.
Title:
Locate n' Rotate: Two-stage Openable Part Detection with Foundation Model Priors
- Authors: Siqi Li, Xiaoxue Chen, Haoyu Cheng, Guyue Zhou, Hao Zhao, Guanzhong Tian
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Detecting the openable parts of articulated objects is crucial for downstream applications in intelligent robotics, such as pulling a drawer. This task poses a multitasking challenge due to the necessity of understanding object categories and motion. Most existing methods are either category-specific or trained on specific datasets, lacking generalization to unseen environments and objects. In this paper, we propose a Transformer-based Openable Part Detection (OPD) framework named Multi-feature Openable Part Detection (MOPD) that incorporates perceptual grouping and geometric priors, outperforming previous methods in performance. In the first stage of the framework, we introduce a perceptual grouping feature model that provides perceptual grouping feature priors for openable part detection, enhancing detection results through a cross-attention mechanism. In the second stage, a geometric understanding feature model offers geometric feature priors for predicting motion parameters. Compared to existing methods, our proposed approach shows better performance in both detection and motion parameter prediction. Codes and models are publicly available at this https URL
Title:
ORFormer: Occlusion-Robust Transformer for Accurate Facial Landmark Detection
- Authors: Jui-Che Chiang, Hou-Ning Hu, Bo-Syuan Hou, Chia-Yu Tseng, Yu-Lun Liu, Min-Hung Chen, Yen-Yu Lin
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Although facial landmark detection (FLD) has gained significant progress, existing FLD methods still suffer from performance drops on partially non-visible faces, such as faces with occlusions or under extreme lighting conditions or poses. To address this issue, we introduce ORFormer, a novel transformer-based method that can detect non-visible regions and recover their missing features from visible parts. Specifically, ORFormer associates each image patch token with one additional learnable token called the messenger token. The messenger token aggregates features from all but its patch. This way, the consensus between a patch and other patches can be assessed by referring to the similarity between its regular and messenger embeddings, enabling non-visible region identification. Our method then recovers occluded patches with features aggregated by the messenger tokens. Leveraging the recovered features, ORFormer compiles high-quality heatmaps for the downstream FLD task. Extensive experiments show that our method generates heatmaps resilient to partial occlusions. By integrating the resultant heatmaps into existing FLD methods, our method performs favorably against the state of the arts on challenging datasets such as WFLW and COFW.
Title:
GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
- Authors: Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tianwei Lin, Zhizhong Su, Wenyu Liu, Xinggang Wang
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract 3D Semantic Occupancy Prediction is fundamental for spatial understanding as it provides a comprehensive semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive labeled data and computationally intensive voxel-based modeling, restricting the scalability and generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian Transformer that leverages alignment with foundation models to advance self-supervised 3D spatial understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that represent scenes in a feed-forward manner. Through aligning rendered Gaussian features with diverse knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D representations and enables open-vocabulary occupancy prediction without explicit annotations. Empirical evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance, achieving 11.70 mIoU while reducing training duration by approximately 50%. These experimental results highlight the significant potential of GaussTR for scalable and holistic 3D spatial understanding, with promising implications for autonomous driving and embodied agents. Code is available at this https URL.
Keyword: scene understanding
Title:
Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration
- Authors: Ziheng Zhou, Jinxing Zhou, Wei Qian, Shengeng Tang, Xiaojun Chang, Dan Guo
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, {untrimmed} videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel \ul{CC}Net, comprising two core modules: the Cross-Modal Consistency \ul{C}ollaboration (CMCC) and the Multi-Temporal Granularity \ul{C}ollaboration (MTGC). Specifically, the CMCC module contains two branches: a cross-modal interaction branch and a temporal consistency-gated branch. The former branch facilitates the aggregation of consistent event semantics across modalities through the encoding of audio-visual relations, while the latter branch guides one modality's focus to pivotal event-relevant temporal areas as discerned in the other modality. The MTGC module includes a coarse-to-fine collaboration block and a fine-to-coarse collaboration block, providing bidirectional support among coarse- and fine-grained temporal features. Extensive experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization. The code is available at \url{this https URL}.
Keyword: visual reasoning
There is no result