arxiv-daily
arxiv-daily copied to clipboard
Showing new listings for Tuesday, 8 October 2024
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
Title:
STONE: A Submodular Optimization Framework for Active 3D Object Detection
- Authors: Ruiyu Mao, Sarthak Kumar Maharana, Rishabh K Iyer, Yunhui Guo
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract 3D object detection is fundamentally important for various emerging applications, including autonomous driving and robotics. A key requirement for training an accurate 3D object detector is the availability of a large amount of LiDAR-based point cloud data. Unfortunately, labeling point cloud data is extremely challenging, as accurate 3D bounding boxes and semantic labels are required for each potential object. This paper proposes a unified active 3D object detection framework, for greatly reducing the labeling cost of training 3D object detector. Our framework is based on a novel formulation of submodular optimization, specifically tailored to the problem of active 3D object detection. In particular, we address two fundamental challenges associated with active 3D object detection: data imbalance and the need to cover the distribution of the data, including LiDAR-based point cloud data of varying difficulty levels. Extensive experiments demonstrate that our method achieves state-of-the-art performance with high computational efficiency compared to existing active learning methods.
Title:
Mamba Capsule Routing Towards Part-Whole Relational Camouflaged Object Detection
- Authors: Dingwen Zhang, Liangbo Cheng, Yi Liu, Xinggang Wang, Junwei Han
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract The part-whole relational property endowed by Capsule Networks (CapsNets) has been known successful for camouflaged object detection due to its segmentation integrity. However, the previous Expectation Maximization (EM) capsule routing algorithm with heavy computation and large parameters obstructs this trend. The primary attribution behind lies in the pixel-level capsule routing. Alternatively, in this paper, we propose a novel mamba capsule routing at the type level. Specifically, we first extract the implicit latent state in mamba as capsule vectors, which abstract type-level capsules from pixel-level versions. These type-level mamba capsules are fed into the EM routing algorithm to get the high-layer mamba capsules, which greatly reduce the computation and parameters caused by the pixel-level capsule routing for part-whole relationships exploration. On top of that, to retrieve the pixel-level capsule features for further camouflaged prediction, we achieve this on the basis of the low-layer pixel-level capsules with the guidance of the correlations from adjacent-layer type-level mamba capsules. Extensive experiments on three widely used COD benchmark datasets demonstrate that our method significantly outperforms state-of-the-arts. Code has been available on this https URL_capsule.
Title:
Cross Resolution Encoding-Decoding For Detection Transformers
- Authors: Ashish Kumar, Jaesik Park
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Detection Transformers (DETR) are renowned object detection pipelines, however computationally efficient multiscale detection using DETR is still challenging. In this paper, we propose a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection. CRED is based on two modules; Cross Resolution Attention Module (CRAM) and One Step Multiscale Attention (OSMA). CRAM is designed to transfer the knowledge of low-resolution encoder output to a high-resolution feature. While OSMA is designed to fuse multiscale features in a single step and produce a feature map of a desired resolution enriched with multiscale information. When used in prominent DETR methods, CRED delivers accuracy similar to the high-resolution DETR counterpart in roughly 50% fewer FLOPs. Specifically, state-of-the-art DN-DETR, when used with CRED (calling CRED-DETR), becomes 76% faster, with ~50% reduced FLOPs than its high-resolution counterpart with 202 G FLOPs on MS-COCO benchmark. We plan to release pretrained CRED-DETRs for use by the community. Code: this https URL
Title:
Robust Task-Oriented Communication Framework for Real-Time Collaborative Vision Perception
- Authors: Zhengru Fang, Jingjing Wang, Yanan Ma, Yihang Tao, Yiqin Deng, Xianhao Chen, Yuguang Fang
- Subjects: Subjects: Networking and Internet Architecture (cs.NI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Cooperative perception enhances sensing in multi-robot and vehicular networks by aggregating information from multiple agents, improving perception accuracy and range. However, mobility and non-rigid sensor mounts introduce extrinsic calibration errors, necessitating online calibration, which is complicated by limited overlap in sensing regions. Maintaining fresh information is crucial for timely and accurate sensing. To address calibration errors and ensure both perception accuracy and transmission timeliness, we propose a Robust Task-Oriented Communication framework (R-TOCOM) that optimizes calibration and feature transmission in both deployment and streaming phases. First, we formulate an Age of Perceived Targets (AoPT) minimization problem to capture information freshness. Then, in the deployment phase, we introduce a channel-aware self-calibration technique based on re-identification (Re-ID). This technique adaptively compresses key-point features according to channel capacities, effectively addressing calibration issues via spatial and temporal cross-camera correlations. In the streaming phase, we tackle the trade-off between bandwidth and inference accuracy by integrating an Information Bottleneck (IB)-based encoding method that adjusts video compression rates based on task relevance, thereby reducing communication overhead and latency. To mitigate performance degradation from packet loss, we introduce a priority network that filters corrupted features. Extensive studies demonstrate our framework outperforms five baselines, improving multiple object detection accuracy (MODA) by 25.49% and reducing communication costs by 51.36% under severe channel condition.
Title:
Fast Object Detection with a Machine Learning Edge Device
- Authors: Richard C. Rodriguez, Jonah Elijah P. Bardos
- Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This machine learning study investigates a lowcost edge device integrated with an embedded system having computer vision and resulting in an improved performance in inferencing time and precision of object detection and classification. A primary aim of this study focused on reducing inferencing time and low-power consumption and to enable an embedded device of a competition-ready autonomous humanoid robot and to support real-time object recognition, scene understanding, visual navigation, motion planning, and autonomous navigation of the robot. This study compares processors for inferencing time performance between a central processing unit (CPU), a graphical processing unit (GPU), and a tensor processing unit (TPU). CPUs, GPUs, and TPUs are all processors that can be used for machine learning tasks. Related to the aim of supporting an autonomous humanoid robot, there was an additional effort to observe whether or not there was a significant difference in using a camera having monocular vision versus stereo vision capability. TPU inference time results for this study reflect a 25% reduction in time over the GPU, and a whopping 87.5% reduction in inference time compared to the CPU. Much information in this paper is contributed to the final selection of Google's Coral brand, Edge TPU device. The Arduino Nano 33 BLE Sense Tiny ML Kit was also considered for comparison but due to initial incompatibilities and in the interest of time to complete this study, a decision was made to review the kit in a future experiment.
Title:
ETHcavation: A Dataset and Pipeline for Panoptic Scene Understanding and Object Tracking in Dynamic Construction Environments
- Authors: Lorenzo Terenzi, Julian Nubert, Pol Eyschen, Pascal Roth, Simin Fei, Edo Jelavic, Marco Hutter
- Subjects: Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Construction sites are challenging environments for autonomous systems due to their unstructured nature and the presence of dynamic actors, such as workers and machinery. This work presents a comprehensive panoptic scene understanding solution designed to handle the complexities of such environments by integrating 2D panoptic segmentation with 3D LiDAR mapping. Our system generates detailed environmental representations in real-time by combining semantic and geometric data, supported by Kalman Filter-based tracking for dynamic object detection. We introduce a fine-tuning method that adapts large pre-trained panoptic segmentation models for construction site applications using a limited number of domain-specific samples. For this use case, we release a first-of-its-kind dataset of 502 hand-labeled sample images with panoptic annotations from construction sites. In addition, we propose a dynamic panoptic mapping technique that enhances scene understanding in unstructured environments. As a case study, we demonstrate the system's application for autonomous navigation, utilizing real-time RRT* for reactive path planning in dynamic scenarios. The dataset (this https URL) and code (this https URL) for training and deployment are publicly available to support future research.
Title:
Learning De-Biased Representations for Remote-Sensing Imagery
- Authors: Zichen Tian, Zhaozheng Chen, Qianru Sun
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Remote sensing (RS) imagery, requiring specialized satellites to collect and being difficult to annotate, suffers from data scarcity and class imbalance in certain spectrums. Due to data scarcity, training any large-scale RS models from scratch is unrealistic, and the alternative is to transfer pre-trained models by fine-tuning or a more data-efficient method LoRA. Due to class imbalance, transferred models exhibit strong bias, where features of the major class dominate over those of the minor class. In this paper, we propose debLoRA, a generic training approach that works with any LoRA variants to yield debiased features. It is an unsupervised learning approach that can diversify minor class features based on the shared attributes with major classes, where the attributes are obtained by a simple step of clustering. To evaluate it, we conduct extensive experiments in two transfer learning scenarios in the RS domain: from natural to optical RS images, and from optical RS to multi-spectrum RS images. We perform object classification and oriented object detection tasks on the optical RS dataset DOTA and the SAR dataset FUSRS. Results show that our debLoRA consistently surpasses prior arts across these RS adaptation settings, yielding up to 3.3 and 4.7 percentage points gains on the tail classes for natural to optical RS and optical RS to multi-spectrum RS adaptations, respectively, while preserving the performance on head classes, substantiating its efficacy and adaptability.
Title:
Improving Object Detection via Local-global Contrastive Learning
- Authors: Danai Triantafyllidou, Sarah Parisot, Ales Leonardis, Steven McDonagh
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Visual domain gaps often impact object detection performance. Image-to-image translation can mitigate this effect, where contrastive approaches enable learning of the image-to-image mapping under unsupervised regimes. However, existing methods often fail to handle content-rich scenes with multiple object instances, which manifests in unsatisfactory detection performance. Sensitivity to such instance-level content is typically only gained through object annotations, which can be expensive to obtain. Towards addressing this issue, we present a novel image-to-image translation method that specifically targets cross-domain object detection. We formulate our approach as a contrastive learning framework with an inductive prior that optimises the appearance of object instances through spatial attention masks, implicitly delineating the scene into foreground regions associated with the target object instances and background non-object regions. Instead of relying on object annotations to explicitly account for object instances during translation, our approach learns to represent objects by contrasting local-global information. This affords investigation of an under-explored challenge: obtaining performant detection, under domain shifts, without relying on object annotations nor detector model fine-tuning. We experiment with multiple cross-domain object detection settings across three challenging benchmarks and report state-of-the-art performance. Project page: this https URL
Title:
Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava
- Authors: Mehdi Azarafza, Fatima Idrees, Ali Ehteshami Bejnordi, Charles Steinmetz, Stefan Henkler, Achim Rettberg
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Traffic Sign Recognition (TSR) detection is a crucial component of autonomous vehicles. While You Only Look Once (YOLO) is a popular real-time object detection algorithm, factors like training data quality and adverse weather conditions (e.g., heavy rain) can lead to detection failures. These failures can be particularly dangerous when visual similarities between objects exist, such as mistaking a 30 km/h sign for a higher speed limit sign. This paper proposes a method that combines video analysis and reasoning, prompting with a human-in-the-loop guide large vision model to improve YOLOs accuracy in detecting road speed limit signs, especially in semi-real-world conditions. It is hypothesized that the guided prompting and reasoning abilities of Video-LLava can enhance YOLOs traffic sign detection capabilities. This hypothesis is supported by an evaluation based on human-annotated accuracy metrics within a dataset of recorded videos from the CARLA car simulator. The results demonstrate that a collaborative approach combining YOLO with Video-LLava and reasoning can effectively address challenging situations such as heavy rain and overcast conditions that hinder YOLOs detection capabilities.
Title:
Real-Time Truly-Coupled Lidar-Inertial Motion Correction and Spatiotemporal Dynamic Object Detection
- Authors: Cedric Le Gentil, Raphael Falque, Teresa Vidal-Calleja
- Subjects: Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Over the past decade, lidars have become a cornerstone of robotics state estimation and perception thanks to their ability to provide accurate geometric information about their surroundings in the form of 3D scans. Unfortunately, most of nowadays lidars do not take snapshots of the environment but sweep the environment over a period of time (typically around 100 ms). Such a rolling-shutter-like mechanism introduces motion distortion into the collected lidar scan, thus hindering downstream perception applications. In this paper, we present a novel method for motion distortion correction of lidar data by tightly coupling lidar with Inertial Measurement Unit (IMU) data. The motivation of this work is a map-free dynamic object detection based on lidar. The proposed lidar data undistortion method relies on continuous preintegrated of IMU measurements that allow parameterising the sensors' continuous 6-DoF trajectory using solely eleven discrete state variables (biases, initial velocity, and gravity direction). The undistortion consists of feature-based distance minimisation of point-to-line and point-to-plane residuals in a non-linear least-square formulation. Given undistorted geometric data over a short temporal window, the proposed pipeline computes the spatiotemporal normal vector of each of the lidar points. The temporal component of the normals is a proxy for the corresponding point's velocity, therefore allowing for learning-free dynamic object classification without the need for registration in a global reference frame. We demonstrate the soundness of the proposed method and its different components using public datasets and compare them with state-of-the-art lidar-inertial state estimation and dynamic object detection algorithms.
Keyword: transformer
Title:
Gradient Boosting Decision Trees on Medical Diagnosis over Tabular Data
- Authors: A. Yarkın Yıldız, Asli Kalayci
- Subjects: Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Medical diagnosis is a crucial task in the medical field, in terms of providing accurate classification and respective treatments. Having near-precise decisions based on correct diagnosis can affect a patient's life itself, and may extremely result in a catastrophe if not classified correctly. Several traditional machine learning (ML), such as support vector machines (SVMs) and logistic regression, and state-of-the-art tabular deep learning (DL) methods, including TabNet and TabTransformer, have been proposed and used over tabular medical datasets. Additionally, due to the superior performances, lower computational costs, and easier optimization over different tasks, ensemble methods have been used in the field more recently. They offer a powerful alternative in terms of providing successful medical decision-making processes in several diagnosis tasks. In this study, we investigated the benefits of ensemble methods, especially the Gradient Boosting Decision Tree (GBDT) algorithms in medical classification tasks over tabular data, focusing on XGBoost, CatBoost, and LightGBM. The experiments demonstrate that GBDT methods outperform traditional ML and deep neural network architectures and have the highest average rank over several benchmark tabular medical diagnosis datasets. Furthermore, they require much less computational power compared to DL models, creating the optimal methodology in terms of high performance and lower complexity.
Title:
Towards a Deeper Understanding of Transformer for Residential Non-intrusive Load Monitoring
- Authors: Minhajur Rahman, Yasir Arafat
- Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Transformer models have demonstrated impressive performance in Non-Intrusive Load Monitoring (NILM) applications in recent years. Despite their success, existing studies have not thoroughly examined the impact of various hyper-parameters on model performance, which is crucial for advancing high-performing transformer models. In this work, a comprehensive series of experiments have been conducted to analyze the influence of these hyper-parameters in the context of residential NILM. This study delves into the effects of the number of hidden dimensions in the attention layer, the number of attention layers, the number of attention heads, and the dropout ratio on transformer performance. Furthermore, the role of the masking ratio has explored in BERT-style transformer training, providing a detailed investigation into its impact on NILM tasks. Based on these experiments, the optimal hyper-parameters have been selected and used them to train a transformer model, which surpasses the performance of existing models. The experimental findings offer valuable insights and guidelines for optimizing transformer architectures, aiming to enhance their effectiveness and efficiency in NILM applications. It is expected that this work will serve as a foundation for future research and development of more robust and capable transformer models for NILM.
Title:
Local Attention Mechanism: Boosting the Transformer Architecture for Long-Sequence Time Series Forecasting
- Authors: Ignacio Aguilera-Martos, Andrés Herrera-Poyatos, Julián Luengo, Francisco Herrera
- Subjects: Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Transformers have become the leading choice in natural language processing over other deep learning architectures. This trend has also permeated the field of time series analysis, especially for long-horizon forecasting, showcasing promising results both in performance and running time. In this paper, we introduce Local Attention Mechanism (LAM), an efficient attention mechanism tailored for time series analysis. This mechanism exploits the continuity properties of time series to reduce the number of attention scores computed. We present an algorithm for implementing LAM in tensor algebra that runs in time and memory O(nlogn), significantly improving upon the O(n^2) time and memory complexity of traditional attention mechanisms. We also note the lack of proper datasets to evaluate long-horizon forecast models. Thus, we propose a novel set of datasets to improve the evaluation of models addressing long-horizon forecasting challenges. Our experimental analysis demonstrates that the vanilla transformer architecture magnified with LAM surpasses state-of-the-art models, including the vanilla attention mechanism. These results confirm the effectiveness of our approach and highlight a range of future challenges in long-sequence time series forecasting.
Title:
Metadata Matters for Time Series: Informative Forecasting with Transformers
- Authors: Jiaxiang Dong, Haixu Wu, Yuxuan Wang, Li Zhang, Jianmin Wang, Mingsheng Long
- Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Time series forecasting is prevalent in extensive real-world applications, such as financial analysis and energy planning. Previous studies primarily focus on time series modality, endeavoring to capture the intricate variations and dependencies inherent in time series. Beyond numerical time series data, we notice that metadata (e.g.~dataset and variate descriptions) also carries valuable information essential for forecasting, which can be used to identify the application scenario and provide more interpretable knowledge than digit sequences. Inspired by this observation, we propose a Metadata-informed Time Series Transformer (MetaTST), which incorporates multiple levels of context-specific metadata into Transformer forecasting models to enable informative time series forecasting. To tackle the unstructured nature of metadata, MetaTST formalizes them into natural languages by pre-designed templates and leverages large language models (LLMs) to encode these texts into metadata tokens as a supplement to classic series tokens, resulting in an informative embedding. Further, a Transformer encoder is employed to communicate series and metadata tokens, which can extend series representations by metadata information for more accurate forecasting. This design also allows the model to adaptively learn context-specific patterns across various scenarios, which is particularly effective in handling large-scale, diverse-scenario forecasting tasks. Experimentally, MetaTST achieves state-of-the-art compared to advanced time series models and LLM-based methods on widely acknowledged short- and long-term forecasting benchmarks, covering both single-dataset individual and multi-dataset joint training settings.
Title:
Can Mamba Always Enjoy the "Free Lunch"?
- Authors: Ruifeng Ren, Zhicong Li, Yong Liu
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Transformers have been the cornerstone of current Large Language Models (LLMs); however, its linear growth in overhead during inference with respect to sequence length poses challenges for modeling long sequences. In this context, Mamba has gradually attracted attention due to its constant-level size during inference and existing empirical results have shown that it can perform comparably to Transformers in sequence modeling while offering significant savings. However, one may ask that, can Mamba always enjoy the ``free lunch"? In this paper, we focus on analyzing the expressive ability of Mamba from a theoretical standpoint. First, inspired by the connection between Mamba and linear attention, we investigate potential shortcomings of the Mamba when performing the COPY operation. Our results indicate that Mamba with constant size may encounter bottlenecks when handling COPY, while it can achieve perfect performance when the size scales linearly with sequence length. Based on this observation, we analyze Mamba's ability to tackle DP problems when equipped with Chain of Thought (CoT). Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is comparable to standard and efficient Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our results contribute to a deeper understanding of Mamba.
Title:
MDMP: Multi-modal Diffusion for supervised Motion Predictions with uncertainty
- Authors: Leo Bringer, Joey Wilson, Kira Barton, Maani Ghaffari
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This paper introduces a Multi-modal Diffusion model for Motion Prediction (MDMP) that integrates and synchronizes skeletal data and textual descriptions of actions to generate refined long-term motion predictions with quantifiable uncertainty. Existing methods for motion forecasting or motion generation rely solely on either prior motions or text prompts, facing limitations with precision or control, particularly over extended durations. The multi-modal nature of our approach enhances the contextual understanding of human motion, while our graph-based transformer framework effectively capture both spatial and temporal motion dynamics. As a result, our model consistently outperforms existing generative techniques in accurately predicting long-term motions. Additionally, by leveraging diffusion models' ability to capture different modes of prediction, we estimate uncertainty, significantly improving spatial awareness in human-robot interactions by incorporating zones of presence with varying confidence levels for each body joint.
Title:
ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training
- Authors: Weihuang Liu, Xi Shen, Chi-Man Pun, Xiaodong Cun
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Social media is increasingly plagued by realistic fake images, making it hard to trust content. Previous algorithms to detect these fakes often fail in new, real-world scenarios because they are trained on specific datasets. To address the problem, we introduce ForgeryTTT, the first method leveraging test-time training (TTT) to identify manipulated regions in images. The proposed approach fine-tunes the model for each individual test sample, improving its performance. ForgeryTTT first employs vision transformers as a shared image encoder to learn both classification and localization tasks simultaneously during the training-time training using a large synthetic dataset. Precisely, the localization head predicts a mask to highlight manipulated areas. Given such a mask, the input tokens can be divided into manipulated and genuine groups, which are then fed into the classification head to distinguish between manipulated and genuine parts. During test-time training, the predicted mask from the localization head is used for the classification head to update the image encoder for better adaptation. Additionally, using the classical dropout strategy in each token group significantly improves performance and efficiency. We test ForgeryTTT on five standard benchmarks. Despite its simplicity, ForgeryTTT achieves a 20.1% improvement in localization accuracy compared to other zero-shot methods and a 4.3% improvement over non-zero-shot techniques. Our code and data will be released upon publication.
Title:
Cross Resolution Encoding-Decoding For Detection Transformers
- Authors: Ashish Kumar, Jaesik Park
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Detection Transformers (DETR) are renowned object detection pipelines, however computationally efficient multiscale detection using DETR is still challenging. In this paper, we propose a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection. CRED is based on two modules; Cross Resolution Attention Module (CRAM) and One Step Multiscale Attention (OSMA). CRAM is designed to transfer the knowledge of low-resolution encoder output to a high-resolution feature. While OSMA is designed to fuse multiscale features in a single step and produce a feature map of a desired resolution enriched with multiscale information. When used in prominent DETR methods, CRED delivers accuracy similar to the high-resolution DETR counterpart in roughly 50% fewer FLOPs. Specifically, state-of-the-art DN-DETR, when used with CRED (calling CRED-DETR), becomes 76% faster, with ~50% reduced FLOPs than its high-resolution counterpart with 202 G FLOPs on MS-COCO benchmark. We plan to release pretrained CRED-DETRs for use by the community. Code: this https URL
Title:
Designing Concise ConvNets with Columnar Stages
- Authors: Ashish Kumar, Jaesik Park
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In the era of vision Transformers, the recent success of VanillaNet shows the huge potential of simple and concise convolutional neural networks (ConvNets). Where such models mainly focus on runtime, it is also crucial to simultaneously focus on other aspects, e.g., FLOPs, parameters, etc, to strengthen their utility further. To this end, we introduce a refreshing ConvNet macro design called Columnar Stage Network (CoSNet). CoSNet has a systematically developed simple and concise structure, smaller depth, low parameter count, low FLOPs, and attention-less operations, well suited for resource-constrained deployment. The key novelty of CoSNet is deploying parallel convolutions with fewer kernels fed by input replication, using columnar stacking of these convolutions, and minimizing the use of 1x1 convolution layers. Our comprehensive evaluations show that CoSNet rivals many renowned ConvNets and Transformer designs under resource-constrained scenarios. Code: this https URL
Title:
Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach
- Authors: Allahdadi Fatemeh, Mahdian Toroghi Rahil, Zareian Hassan
- Subjects: Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio.
Title:
Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher
- Authors: Yong Guo, Shulian Zhang, Haolin Pan, Jing Liu, Yulun Zhang, Jian Chen
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Knowledge distillation aims to transfer knowledge from a large teacher model to a compact student counterpart, often coming with a significant performance gap between them. We find that a too-large performance gap can hamper the training process, which is also verified in recent studies. To address this, we propose a Gap Preserving Distillation (GPD) method that trains an additional dynamic teacher model from scratch along with training the student to bridge this gap. In this way, it becomes possible to maintain a reasonable performance gap between teacher and student during the whole distillation process. To further strengthen distillation from the dynamic teacher to the student, we develop a hard strategy by enforcing them to share parameters and encouraging parameter inheritance. Besides hard strategy, we also build the soft bidirectional mappings between them which are built on an Inverse Reparameterization (IR) method and a Channel-Branch Reparameterization (CBR) strategy. We highlight that our IR is able to initialize a larger dynamic teacher with an arbitrary expansion ratio, while preserving exactly the same accuracy as the given student model. In this way, it guarantees that the dynamic teacher and student start from the same point and avoid a too large gap in early stage of training. As for our CBR, with parameter-sharing, it directly extracts an effective student model from the well-learned dynamic teacher without any post-training, making our method highly flexible for model deployment. In the experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers architectures, achieving up to 1.58% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pre-trained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on ResNet18, respectively.
Title:
Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?
- Authors: Àlex R. Atrio, Alexis Allemann, Ljiljana Dolamic, Andrei Popescu-Belis
- Subjects: Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Many-to-one neural machine translation systems improve over one-to-one systems when training data is scarce. In this paper, we design and test a novel algorithm for selecting the language of minibatches when training such systems. The algorithm changes the language of the minibatch when the weights of the model do not evolve significantly, as measured by the smoothed KL divergence between all layers of the Transformer network. This algorithm outperforms the use of alternating monolingual batches, but not the use of shuffled batches, in terms of translation quality (measured with BLEU and COMET) and convergence speed.
Title:
Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer
- Authors: Tomoki Honda, Shinsuke Sakai, Tatsuya Kawahara
- Subjects: Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Recently, Conformer has achieved state-of-the-art performance in many speech recognition tasks. However, the Transformer-based models show significant deterioration for long-form speech, such as lectures, because the self-attention mechanism becomes unreliable with the computation of the square order of the input length. To solve the problem, we incorporate a kind of state-space model, Hungry Hungry Hippos (H3), to replace or complement the multi-head self-attention (MHSA). H3 allows for efficient modeling of long-form sequences with a linear-order computation. In experiments using two datasets of CSJ and LibriSpeech, our proposed H3-Conformer model performs efficient and robust recognition of long-form speech. Moreover, we propose a hybrid of H3 and MHSA and show that using H3 in higher layers and MHSA in lower layers provides significant improvement in online recognition. We also investigate a parallel use of H3 and MHSA in all layers, resulting in the best performance.
Title:
Beyond Language: Applying MLX Transformers to Engineering Physics
- Authors: Stavros Kassinos, Alessio Alexiadis
- Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Transformer Neural Networks are driving an explosion of activity and discovery in the field of Large Language Models (LLMs). In contrast, there have been only a few attempts to apply Transformers in engineering physics. Aiming to offer an easy entry point to physics-centric Transformers, we introduce a physics-informed Transformer model for solving the heat conduction problem in a 2D plate with Dirichlet boundary conditions. The model is implemented in the machine learning framework MLX and leverages the unified memory of Apple M-series processors. The use of MLX means that the models can be trained and perform predictions efficiently on personal machines with only modest memory requirements. To train, validate and test the Transformer model we solve the 2D heat conduction problem using central finite differences. Each finite difference solution in these sets is initialized with four random Dirichlet boundary conditions, a uniform but random internal temperature distribution and a randomly selected thermal diffusivity. Validation is performed in-line during training to monitor against over-fitting. The excellent performance of the trained model is demonstrated by predicting the evolution of the temperature field to steady state for the unseen test set of conditions.
Title:
Equivariant Neural Functional Networks for Transformers
- Authors: Viet-Hoang Tran, Thieu N. Vo, An Nguyen The, Tho Tran Huu, Minh-Khoi Nguyen-Nhat, Thanh Tran, Duy-Tung Pham, Tan Minh Nguyen
- Subjects: Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This paper systematically explores neural functional networks (NFN) for transformer architectures. NFN are specialized neural networks that treat the weights, gradients, or sparsity patterns of a deep neural network (DNN) as input data and have proven valuable for tasks such as learnable optimizers, implicit data representations, and weight editing. While NFN have been extensively developed for MLP and CNN, no prior work has addressed their design for transformers, despite the importance of transformers in modern deep learning. This paper aims to address this gap by providing a systematic study of NFN for transformers. We first determine the maximal symmetric group of the weights in a multi-head attention module as well as a necessary and sufficient condition under which two sets of hyperparameters of the multi-head attention module define the same function. We then define the weight space of transformer architectures and its associated group action, which leads to the design principles for NFN in transformers. Based on these, we introduce Transformer-NFN, an NFN that is equivariant under this group action. Additionally, we release a dataset of more than 125,000 Transformers model checkpoints trained on two datasets with two different tasks, providing a benchmark for evaluating Transformer-NFN and encouraging further research on transformer training and performance.
Title:
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning
- Authors: Gang Liu, Michael Sun, Wojciech Matusik, Meng Jiang, Jie Chen
- Subjects: Subjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract While large language models (LLMs) have integrated images, adapting them to graphs remains challenging, limiting their applications in materials and drug design. This difficulty stems from the need for coherent autoregressive generation across texts and graphs. To address this, we introduce Llamole, the first multimodal LLM capable of interleaved text and graph generation, enabling molecular inverse design with retrosynthetic planning. Llamole integrates a base LLM with the Graph Diffusion Transformer and Graph Neural Networks for multi-conditional molecular generation and reaction inference within texts, while the LLM, with enhanced molecular understanding, flexibly controls activation among the different graph modules. Additionally, Llamole integrates A* search with LLM-based cost functions for efficient retrosynthetic planning. We create benchmarking datasets and conduct extensive experiments to evaluate Llamole against in-context learning and supervised fine-tuning. Llamole significantly outperforms 14 adapted LLMs across 12 metrics for controllable molecular design and retrosynthetic planning.
Title:
Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels
- Authors: Maria Marrium, Arif Mahmood, Mohammed Bennamoun
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Automatic annotation of large-scale datasets can introduce noisy training data labels, which adversely affect the learning process of deep neural networks (DNNs). Consequently, Noisy Labels Learning (NLL) has become a critical research field for Convolutional Neural Networks (CNNs), though it remains less explored for Vision Transformers (ViTs). In this study, we evaluate the vulnerability of ViT fine-tuning to noisy labels and compare its robustness with CNNs. We also investigate whether NLL methods developed for CNNs are equally effective for ViTs. Using linear probing and MLP-K fine-tuning, we benchmark two ViT backbones (ViT-B/16 and ViT-L/16) using three commonly used classification losses: Cross Entropy (CE), Focal Loss (FL), and Mean Absolute Error (MAE), alongside six robust NLL methods: GCE, SCE, NLNL, APL, NCE+AGCE, and ANL-CE. The evaluation is conducted across six datasets including MNIST, CIFAR-10/100, WebVision, Clothing1M, and Food-101N. Furthermore, we explore whether implicit prediction entropy minimization contributes to ViT robustness against noisy labels, noting a general trend of prediction entropy reduction across most NLL methods. Building on this observation, we examine whether explicit entropy minimization could enhance ViT resilience to noisy labels. Our findings indicate that incorporating entropy regularization enhances the performance of established loss functions such as CE and FL, as well as the robustness of the six studied NLL methods across both ViT backbones.
Title:
DeFoG: Discrete Flow Matching for Graph Generation
- Authors: Yiming Qin, Manuel Madeira, Dorina Thanou, Pascal Frossard
- Subjects: Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Graph generation is fundamental in diverse scientific applications, due to its ability to reveal the underlying distribution of complex data, and eventually generate new, realistic data points. Despite the success of diffusion models in this domain, those face limitations in sampling efficiency and flexibility, stemming from the tight coupling between the training and sampling stages. To address this, we propose DeFoG, a novel framework using discrete flow matching for graph generation. DeFoG employs a flow-based approach that features an efficient linear interpolation noising process and a flexible denoising process based on a continuous-time Markov chain formulation. We leverage an expressive graph transformer and ensure desirable node permutation properties to respect graph symmetry. Crucially, our framework enables a disentangled design of the training and sampling stages, enabling more effective and efficient optimization of model performance. We navigate this design space by introducing several algorithmic improvements that boost the model performance, consistently surpassing existing diffusion models. We also theoretically demonstrate that, for general discrete data, discrete flow models can faithfully replicate the ground truth distribution - a result that naturally extends to graph data and reinforces DeFoG's foundations. Extensive experiments show that DeFoG achieves state-of-the-art results on synthetic and molecular datasets, improving both training and sampling efficiency over diffusion models, and excels in conditional generation on a digital pathology dataset.
Title:
Constructing Cloze Questions Generatively
- Authors: Yicheng Sun (1), Jie Wang (2)
- Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We present a generative method called CQG for constructing cloze questions from a given article using neural networks and WordNet, with an emphasis on generating multigram distractors. Built on sense disambiguation, text-to-text transformation, WordNet's synset taxonomies and lexical labels, CQG selects an answer key for a given sentence, segments it into a sequence of instances, generates instance-level distractor candidates (IDCs) using a transformer and sibling this http URL then removes inappropriate IDCs, ranks the remaining IDCs based on contextual embedding similarities, as well as synset and lexical relatedness, forms distractor candidates by combinatorially replacing instances with the corresponding top-ranked IDCs, and checks if they are legitimate phrases. Finally, it selects top-ranked distractor candidates based on contextual semantic similarities to the answer key. Experiments show that this method significantly outperforms SOTA results. Human judges also confirm the high qualities of the generated distractors.
Title:
Fundamental Limitations on Subquadratic Alternatives to Transformers
- Authors: Josh Alman, Hantao Yu
- Subjects: Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative. In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time - whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason - cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time.
Title:
Algorithmic Capabilities of Random Transformers
- Authors: Ziqian Zhong, Jacob Andreas
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Trained transformer models have been found to implement interpretable procedures for tasks like arithmetic and associative recall, but little is understood about how the circuits that implement these procedures originate during training. To what extent do they depend on the supervisory signal provided to models, and to what extent are they attributable to behavior already present in models at the beginning of training? To investigate these questions, we investigate what functions can be learned by randomly initialized transformers in which only the embedding layers are optimized, so that the only input--output mappings learnable from data are those already implemented (up to a choice of encoding scheme) by the randomly initialized model. We find that these random transformers can perform a wide range of meaningful algorithmic tasks, including modular arithmetic, in-weights and in-context associative recall, decimal addition, parenthesis balancing, and even some aspects of natural language text generation. Our results indicate that some algorithmic capabilities are present in transformers (and accessible via appropriately structured inputs) even before these models are trained. Code is available at this https URL.
Title:
Automated Detection of Defects on Metal Surfaces using Vision Transformers
- Authors: Toqa Alaa, Mostafa Kotb, Arwa Zakaria, Mariam Diab, Walid Gomaa
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Metal manufacturing often results in the production of defective products, leading to operational challenges. Since traditional manual inspection is time-consuming and resource-intensive, automatic solutions are needed. The study utilizes deep learning techniques to develop a model for detecting metal surface defects using Vision Transformers (ViTs). The proposed model focuses on the classification and localization of defects using a ViT for feature extraction. The architecture branches into two paths: classification and localization. The model must approach high classification accuracy while keeping the Mean Square Error (MSE) and Mean Absolute Error (MAE) as low as possible in the localization process. Experimental results show that it can be utilized in the process of automated defects detection, improve operational efficiency, and reduce errors in metal manufacturing.
Title:
MECFormer: Multi-task Whole Slide Image Classification with Expert Consultation Network
- Authors: Doanh C. Bui, Jin Tae Kwak
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Whole slide image (WSI) classification is a crucial problem for cancer diagnostics in clinics and hospitals. A WSI, acquired at gigapixel size, is commonly tiled into patches and processed by multiple-instance learning (MIL) models. Previous MIL-based models designed for this problem have only been evaluated on individual tasks for specific organs, and the ability to handle multiple tasks within a single model has not been investigated. In this study, we propose MECFormer, a generative Transformer-based model designed to handle multiple tasks within one model. To leverage the power of learning multiple tasks simultaneously and to enhance the model's effectiveness in focusing on each individual task, we introduce an Expert Consultation Network, a projection layer placed at the beginning of the Transformer-based model. Additionally, to enable flexible classification, autoregressive decoding is incorporated by a language decoder for WSI classification. Through extensive experiments on five datasets involving four different organs, one cancer classification task, and four cancer subtyping tasks, MECFormer demonstrates superior performance compared to individual state-of-the-art multiple-instance learning models.
Title:
DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination
- Authors: Xuan Gong, Tianshi Ming, Xinpeng Wang, Zhihua Wei
- Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that $D$ive into $A$ttention $M$echanism of LVLM to $R$educe $O$bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code of our method will be released soon.
Title:
UniMuMo: Unified Text, Music and Motion Generation
- Authors: Han Yang, Kun Su, Yutong Zhang, Jiaben Chen, Kaizhi Qian, Gaowen Liu, Chuang Gan
- Subjects: Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the \href{this https URL}{project page}.
Title:
Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion
- Authors: Mehwish Ghafoor, Arif Mahmood, Muhammad Bilal
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In the field of 3D Human Pose Estimation from monocular videos, the presence of diverse occlusion types presents a formidable challenge. Prior research has made progress by harnessing spatial and temporal cues to infer 3D poses from 2D joint observations. This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation, even in the presence of severe occlusions. Confronting the issue of occlusion-induced missing joint data, we propose a temporal interpolation-based occlusion guidance mechanism. To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views. Each intermediate-view undergoes spatial refinement through a self-refinement schema. Subsequently, these intermediate-views are fused to yield the final 3D human pose estimation. The entire system is end-to-end trainable. Through extensive experiments conducted on the Human3.6M and MPI-INF-3DHP datasets, our method's performance is rigorously evaluated. Notably, our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements. The code is available here: this https URL.
Title:
Low-Rank Continual Pyramid Vision Transformer: Incrementally Segment Whole-Body Organs in CT with Light-Weighted Adaptation
- Authors: Vince Zhu, Zhanghexuan Ji, Dazhou Guo, Puyang Wang, Yingda Xia, Le Lu, Xianghua Ye, Wei Zhu, Dakai Jin
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Deep segmentation networks achieve high performance when trained on specific datasets. However, in clinical practice, it is often desirable that pretrained segmentation models can be dynamically extended to enable segmenting new organs without access to previous training datasets or without training from scratch. This would ensure a much more efficient model development and deployment paradigm accounting for the patient privacy and data storage issues. This clinically preferred process can be viewed as a continual semantic segmentation (CSS) problem. Previous CSS works would either experience catastrophic forgetting or lead to unaffordable memory costs as models expand. In this work, we propose a new continual whole-body organ segmentation model with light-weighted low-rank adaptation (LoRA). We first train and freeze a pyramid vision transformer (PVT) base segmentation model on the initial task, then continually add light-weighted trainable LoRA parameters to the frozen model for each new learning task. Through a holistically exploration of the architecture modification, we identify three most important layers (i.e., patch-embedding, multi-head attention and feed forward layers) that are critical in adapting to the new segmentation tasks, while retaining the majority of the pretrained parameters fixed. Our proposed model continually segments new organs without catastrophic forgetting and meanwhile maintaining a low parameter increasing rate. Continually trained and tested on four datasets covering different body parts of a total of 121 organs, results show that our model achieves high segmentation accuracy, closely reaching the PVT and nnUNet upper bounds, and significantly outperforms other regularization-based CSS methods. When comparing to the leading architecture-based CSS method, our model has a substantial lower parameter increasing rate while achieving comparable performance.
Title:
Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models
- Authors: Xinyu Liu, Runsong Zhao, Pengcheng Huang, Chunyang Xiao, Bei Li, Jingang Wang, Tong Xiao, Jingbo Zhu
- Subjects: Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Numerous recent works target to extend effective context length for language models and various methods, tasks and benchmarks exist to measure model's effective memorization length. However, through thorough investigations, we find limitations for currently existing evaluations on model's memorization capability. We provide an extensive survey for limitations in this work and propose a new method called forgetting curve to measure the memorization capability of long-context models. We show that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings, of not relying on prompts and can be applied to any model size. We apply our forgetting curve to a large variety of models involving both transformer and RNN/SSM based architectures. Our measurement provides empirical evidence for the effectiveness of transformer extension techniques while raises questions for the effective length of RNN/SSM based models. We also examine the difference between our measurement and existing benchmarks as well as popular metrics for various models. Our code and results can be found at this https URL.
Title:
Efficient transformer with reinforced position embedding for language models
- Authors: Yen-Che Hsiao, Abhishek Dutta
- Subjects: Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In this paper, we propose an efficient transformer architecture that uses reinforced positional embedding to obtain superior performance with half the number of encoder decoder layers. We demonstrate that concatenating positional encoding with trainable token embeddings, normalizing columns in the token embedding matrix, and using the normalized token embedding matrix as the value of the attention layer improve the training and validation loss and the training time in an encoder-decoder Transformer model for a Portuguese-English translation task with 10 epochs or 12 hours of training across 10 trials. Our method, with roughly a threefold parameter reduction compared to the baseline model, yields a mean training loss of 1.21, a mean validation loss of 1.51, and an average training time of 1352.27 seconds per epoch, surpassing the baseline model with the same embedding dimension that employs addition of positional encoding and token embeddings, which achieves a mean training loss of 1.96, a validation loss of 2.18, and an average training time of 4297.79 seconds per epoch. Additionally, we evaluated our proposed architecture and the baseline across 14 diverse translation datasets from TensorFlow. The results indicate that our method consistently achieves lower or comparable training and validation losses, suggesting enhanced learning efficiency.
Title:
PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners
- Authors: Yujin Tang, Lu Qi, Fei Xie, Xiangtai Li, Chao Ma, Ming-Hsuan Yang
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Spatiotemporal predictive learning methods generally fall into two categories: recurrent-based approaches, which face challenges in parallelization and performance, and recurrent-free methods, which employ convolutional neural networks (CNNs) as encoder-decoder architectures. These methods benefit from strong inductive biases but often at the expense of scalability and generalization. This paper proposes PredFormer, a pure transformer-based framework for spatiotemporal predictive learning. Motivated by the Vision Transformers (ViT) design, PredFormer leverages carefully designed Gated Transformer blocks, following a comprehensive analysis of 3D attention mechanisms, including full-, factorized-, and interleaved- spatial-temporal attention. With its recurrent-free, transformer-based design, PredFormer is both simple and efficient, significantly outperforming previous methods by large margins. Extensive experiments on synthetic and real-world datasets demonstrate that PredFormer achieves state-of-the-art performance. On Moving MNIST, PredFormer achieves a 51.3% reduction in MSE relative to SimVP. For TaxiBJ, the model decreases MSE by 33.1% and boosts FPS from 533 to 2364. Additionally, on WeatherBench, it reduces MSE by 11.1% while enhancing FPS from 196 to 404. These performance gains in both accuracy and efficiency demonstrate PredFormer's potential for real-world applications. The source code will be released at this https URL.
Title:
LLaVA Needs More Knowledge: Retrieval Augmented Natural Language Generation with Knowledge Graph for Explaining Thoracic Pathologies
- Authors: Ameer Hamza, Abdullah, Yong Hyun Ahn, Sungyoung Lee, Seong Tae Kim
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Generating Natural Language Explanations (NLEs) for model predictions on medical images, particularly those depicting thoracic pathologies, remains a critical and challenging task. Existing methodologies often struggle due to general models' insufficient domain-specific medical knowledge and privacy concerns associated with retrieval-based augmentation techniques. To address these issues, we propose a novel Vision-Language framework augmented with a Knowledge Graph (KG)-based datastore, which enhances the model's understanding by incorporating additional domain-specific medical knowledge essential for generating accurate and informative NLEs. Our framework employs a KG-based retrieval mechanism that not only improves the precision of the generated explanations but also preserves data privacy by avoiding direct data retrieval. The KG datastore is designed as a plug-and-play module, allowing for seamless integration with various model architectures. We introduce and evaluate three distinct frameworks within this paradigm: KG-LLaVA, which integrates the pre-trained LLaVA model with KG-RAG; Med-XPT, a custom framework combining MedCLIP, a transformer-based projector, and GPT-2; and Bio-LLaVA, which adapts LLaVA by incorporating the Bio-ViT-L vision model. These frameworks are validated on the MIMIC-NLE dataset, where they achieve state-of-the-art results, underscoring the effectiveness of KG augmentation in generating high-quality NLEs for thoracic pathologies.
Title:
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation
- Authors: Chuanyang Zheng, Yihang Gao, Han Shi, Jing Xiong, Jiankai Sun, Jingyao Li, Minbin Huang, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, Yu Li
- Subjects: Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feed-forward neural networks. In general, the attention scores are determined simply by the key-query products. However, this work's occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.
Title:
Transforming Color: A Novel Image Colorization Method
- Authors: Hamza Shafiq, Bumshik Lee
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This paper introduces a novel method for image colorization that utilizes a color transformer and generative adversarial networks (GANs) to address the challenge of generating visually appealing colorized images. Conventional approaches often struggle with capturing long-range dependencies and producing realistic colorizations. The proposed method integrates a transformer architecture to capture global information and a GAN framework to improve visual quality. In this study, a color encoder that utilizes a random normal distribution to generate color features is applied. These features are then integrated with grayscale image features to enhance the overall representation of the images. Our method demonstrates superior performance compared with existing approaches by utilizing the capacity of the transformer, which can capture long-range dependencies and generate a realistic colorization of the GAN. Experimental results show that the proposed network significantly outperforms other state-of-the-art colorization techniques, highlighting its potential for image colorization. This research opens new possibilities for precise and visually compelling image colorization in domains such as digital restoration and historical image analysis.
Title:
Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering
- Authors: Kazumoto Nakamura, Yuji Nozawa, Yu-Chieh Lin, Kengo Nakata, Youyang Ng
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract The goal of this paper is to improve the performance of pretrained Vision Transformer (ViT) models, particularly DINOv2, in image clustering task without requiring re-training or fine-tuning. As model size increases, high-norm artifacts anomaly appears in the patches of multi-head attention. We observe that this anomaly leads to reduced accuracy in zero-shot image clustering. These artifacts are characterized by disproportionately large values in the attention map compared to other patch tokens. To address these artifacts, we propose an approach called Inference-Time Attention Engineering (ITAE), which manipulates attention function during inference. Specifically, we identify the artifacts by investigating one of the Query-Key-Value (QKV) patches in the multi-head attention and attenuate their corresponding attention values inside the pretrained models. ITAE shows improved clustering accuracy on multiple datasets by exhibiting more expressive features in latent space. Our findings highlight the potential of ITAE as a practical solution for reducing artifacts in pretrained ViT models and improving model performance in clustering tasks without the need for re-training or fine-tuning.
Title:
Timer-XL: Long-Context Transformers for Unified Time Series Forecasting
- Authors: Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, Mingsheng Long
- Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We present Timer-XL, a generative Transformer for unified time series forecasting. To uniformly predict 1D and 2D time series, we generalize next token prediction, predominantly adopted for causal generation of 1D sequences, to multivariate next token prediction. The proposed paradigm uniformly formulates various forecasting scenarios as a long-context generation problem. We opt for the generative Transformer, which can capture global-range and causal dependencies while providing contextual flexibility, to implement unified forecasting on univariate series characterized by non-stationarity, multivariate time series with complicated dynamics and correlations, and covariate-informed contexts that include both endogenous and exogenous variables. Technically, we propose a universal TimeAttention to facilitate generative Transformers on time series, which can effectively capture fine-grained intra- and inter-series dependencies of flattened time series tokens (patches) and is further strengthened by position embeddings in both temporal and variable dimensions. Timer-XL achieves state-of-the-art performance across challenging forecasting benchmarks through a unified approach. As a large time series model, it demonstrates notable model transferability by large-scale pre-training, as well as contextual flexibility in token lengths, positioning it as a one-for-all forecaster.
Title:
An Optimized H5 Hysteresis Current Control with Clamped Diodes in Transformer-less Grid-PV Inverter
- Authors: Sushil Phuyal, Shashwot Shrestha, Swodesh Sharma, Rachana Subedi
- Subjects: Subjects: Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract With the rise of renewable energy penetration in the grid, photovoltaic (PV) panels are connected to the grid via inverters to supply solar energy. Transformer-less grid-tied PV inverters are gaining popularity because of their improved efficiency, reduced size, and lower costs. However, they can induce a path for leakage currents between the PV and the grid part due to the absence of galvanic isolation between them. This leads to serious electromagnetic interference, loss in efficiency and safety concerns. The leakage current is primarily influenced by the nature of the common mode voltage (CMV), which is determined by the switching techniques of the inverter. In this paper, a novel inverter topology of Hysteresis Controlled H5 with Two Clamping Diodes (HCH5-D2) has been derived. The HCH5-D2 topology helps to decouple the AC part (Grid) and DC part (PV) during the freewheeling to make the CMV constant and in turn, reduces the leakage current. Also, the additional diodes help to reduce the voltage spikes generated during the freewheeling period and maintain the CMV at a constant value. Finally, a 2.2kW grid-connected single-phase HCH5-D2 PV inverter system's MATLAB simulation has been presented with better results when compared with a traditional H4 inverter.
Title:
A Simple Image Segmentation Framework via In-Context Examples
- Authors: Yang Liu, Chenchen Jing, Hengtao Li, Muzhi Zhu, Hao Chen, Xinlong Wang, Chunhua Shen
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Recently, there have been explorations of generalist segmentation models that can effectively tackle a variety of image segmentation tasks within a unified in-context learning framework. However, these methods still struggle with task ambiguity in in-context segmentation, as not all in-context examples can accurately convey the task information. In order to address this issue, we present SINE, a simple image Segmentation framework utilizing in-context examples. Our approach leverages a Transformer encoder-decoder structure, where the encoder provides high-quality image representations, and the decoder is designed to yield multiple task-specific output masks to effectively eliminate task ambiguity. Specifically, we introduce an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks. In addition, we have further perfected the current evaluation system for in-context image segmentation, aiming to facilitate a holistic appraisal of these models. Experiments on various segmentation tasks show the effectiveness of the proposed method.
Title:
TimeCNN: Refining Cross-Variable Interaction on Time Point for Time Series Forecasting
- Authors: Ao Hu, Dongkai Wang, Yong Dai, Shiyi Qi, Liangjian Wen, Jun Wang, Zhi Chen, Xun Zhou, Zenglin Xu, Jiang Duan
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Time series forecasting is extensively applied across diverse domains. Transformer-based models demonstrate significant potential in modeling cross-time and cross-variable interaction. However, we notice that the cross-variable correlation of multivariate time series demonstrates multifaceted (positive and negative correlations) and dynamic progression over time, which is not well captured by existing Transformer-based models. To address this issue, we propose a TimeCNN model to refine cross-variable interactions to enhance time series forecasting. Its key innovation is timepoint-independent, where each time point has an independent convolution kernel, allowing each time point to have its independent model to capture relationships among variables. This approach effectively handles both positive and negative correlations and adapts to the evolving nature of variable relationships over time. Extensive experiments conducted on 12 real-world datasets demonstrate that TimeCNN consistently outperforms state-of-the-art models. Notably, our model achieves significant reductions in computational requirements (approximately 60.46%) and parameter count (about 57.50%), while delivering inference speeds 3 to 4 times faster than the benchmark iTransformer model
Title:
Mastering Chinese Chess AI (Xiangqi) Without Search
- Authors: Yu Chen, Juntong Lin, Zhichao Shu
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We have developed a high-performance Chinese Chess AI that operates without reliance on search algorithms. This AI has demonstrated the capability to compete at a level commensurate with the top 0.1% of human players. By eliminating the search process typically associated with such systems, this AI achieves a Queries Per Second (QPS) rate that exceeds those of systems based on the Monte Carlo Tree Search (MCTS) algorithm by over a thousandfold and surpasses those based on the AlphaBeta pruning algorithm by more than a hundredfold. The AI training system consists of two parts: supervised learning and reinforcement learning. Supervised learning provides an initial human-like Chinese chess AI, while reinforcement learning, based on supervised learning, elevates the strength of the entire AI to a new level. Based on this training system, we carried out enough ablation experiments and discovered that 1. The same parameter amount of Transformer architecture has a higher performance than CNN on Chinese chess; 2. Possible moves of both sides as features can greatly improve the training process; 3. Selective opponent pool, compared to pure self-play training, results in a faster improvement curve and a higher strength limit. 4. Value Estimation with Cutoff(VECT) improves the original PPO algorithm training process and we will give the explanation.
Title:
On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent
- Authors: Bingrui Li, Wei Huang, Andi Han, Zhanpeng Zhou, Taiji Suzuki, Jun Zhu, Jianfei Chen
- Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract The Adam optimizer is widely used for transformer optimization in practice, which makes understanding the underlying optimization mechanisms an important problem. However, due to the Adam's complexity, theoretical analysis of how it optimizes transformers remains a challenging task. Fortunately, Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam. Despite its simplicity, theoretical understanding of how SignGD optimizes transformers still lags behind. In this work, we study how SignGD optimizes a two-layer transformer -- consisting of a softmax attention layer with trainable query-key parameterization followed by a linear layer -- on a linearly separable noisy dataset. We identify four stages in the training dynamics, each exhibiting intriguing behaviors. Based on the training dynamics, we prove the fast convergence but poor generalization of the learned transformer on the noisy dataset. We also show that Adam behaves similarly to SignGD in terms of both optimization and generalization in this setting. Additionally, we find that the poor generalization of SignGD is not solely due to data noise, suggesting that both SignGD and Adam requires high-quality data for real-world tasks. Finally, experiments on synthetic and real-world datasets empirically support our theoretical results.
Title:
Leveraging Grammar Induction for Language Understanding and Generation
- Authors: Jushi Kai, Shengyuan Hou, Yusheng Huang, Zhouhan Lin
- Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Grammar induction has made significant progress in recent years. However, it is not clear how the application of induced grammar could enhance practical performance in downstream tasks. In this work, we introduce an unsupervised grammar induction method for language understanding and generation. We construct a grammar parser to induce constituency structures and dependency relations, which is simultaneously trained on downstream tasks without additional syntax annotations. The induced grammar features are subsequently incorporated into Transformer as a syntactic mask to guide self-attention. We evaluate and apply our method to multiple machine translation tasks and natural language understanding tasks. Our method demonstrates superior performance compared to the original Transformer and other models enhanced with external parsers. Experimental results indicate that our method is effective in both from-scratch and pre-trained scenarios. Additionally, our research highlights the contribution of explicitly modeling the grammatical structure of texts to neural network models.
Title:
D-PoSE: Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation
- Authors: Nikolaos Vasilikopoulos, Drosakis Drosakis, Antonis Argyros
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We present D-PoSE (Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation), a one-stage method that estimates human pose and SMPL-X shape parameters from a single RGB image. Recent works use larger models with transformer backbones and decoders to improve the accuracy in human pose and shape (HPS) benchmarks. D-PoSE proposes a vision based approach that uses the estimated human depth-maps as an intermediate representation for HPS and leverages training with synthetic data and the ground-truth depth-maps provided with them for depth supervision during training. Although trained on synthetic datasets, D-PoSE achieves state-of-the-art performance on the real-world benchmark datasets, EMDB and 3DPW. Despite its simple lightweight design and the CNN backbone, it outperforms ViT-based models that have a number of parameters that is larger by almost an order of magnitude. D-PoSE code is available at: this https URL
Title:
DEPT: Decoupled Embeddings for Pre-training Language Models
- Authors: Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, William F. Shen, Xinchi Qiu, Dongqi Cai, Yan Gao, Nicholas D. Lane
- Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Language Model pre-training benefits from a broader data mixture to enhance performance across domains and languages. However, training on such heterogeneous text corpora is complex, requiring extensive and cost-intensive efforts. Since these data sources vary in lexical, syntactic, and semantic aspects, they cause negative interference or the "curse of multilinguality". We propose a novel pre-training framework to alleviate this curse. Our method, DEPT, decouples the embedding layers from the transformer body while simultaneously training the latter in multiple contexts. DEPT enables the model to train without being bound to a shared global vocabulary. DEPT: (1) can train robustly and effectively under significant data heterogeneity, (2) reduces the parameter count of the token embeddings by up to 80% and the communication costs by 675x for billion-scale models (3) enhances model generalization and plasticity in adapting to new languages and domains, and (4) allows training with custom optimized vocabulary per data source. We prove DEPT's potential by performing the first vocabulary-agnostic federated multilingual pre-training of a 1.3 billion-parameter model across high and low-resource languages, reducing its parameter count by 409 million.
Title:
Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes
- Authors: Kosuke Nishida, Kyosuke Nishida, Kuniko Saito
- Subjects: Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Loss spikes, a phenomenon in which the loss value diverges suddenly, is a fundamental issue in the pre-training of large language models. This paper supposes that the non-uniformity of the norm of the parameters is one of the causes of loss spikes. Here, in training of neural networks, the scale of the gradients is required to be kept constant throughout the layers to avoid the vanishing and exploding gradients problem. However, to meet these requirements in the Transformer model, the norm of the model parameters must be non-uniform, and thus, parameters whose norm is smaller are more sensitive to the parameter update. To address this issue, we propose a novel technique, weight scaling as reparameterization (WeSaR). WeSaR introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements. Because of the gate parameter, WeSaR sets the norm of the original parameters uniformly, which results in stable training. Experimental results with the Transformer decoders consisting of 130 million, 1.3 billion, and 13 billion parameters showed that WeSaR stabilizes and accelerates training and that it outperformed compared methods including popular initialization methods.
Title:
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
- Authors: Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.
Title:
Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data
- Authors: David Heurtel-Depeiges, Anian Ruoss, Joel Veness, Tim Genewein
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Foundation models have recently been shown to be strong data compressors. However, when accounting for their excessive parameter count, their compression ratios are actually inferior to standard compression algorithms. Moreover, naively reducing the number of parameters may not necessarily help as it leads to worse predictions and thus weaker compression. In this paper, we conduct a large-scale empirical study to investigate whether there is a sweet spot where competitive compression ratios with pre-trained vanilla transformers are possible. To this end, we train families of models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (i.e., millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG 2000, FLAC) - even when factoring in parameter count. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). To study the impact of model- and dataset scale, we conduct extensive ablations and hyperparameter sweeps, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but, in contrast to previously reported results with large-scale foundation models, transfer to unseen modalities is generally weak.
Title:
Presto! Distilling Steps and Layers for Accelerating Music Generation
- Authors: Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan
- Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at this https URL.
Title:
DiffuseReg: Denoising Diffusion Model for Obtaining Deformation Fields in Unsupervised Deformable Image Registration
- Authors: Yongtai Zhuo, Yiqing Shen
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Deformable image registration aims to precisely align medical images from different modalities or times. Traditional deep learning methods, while effective, often lack interpretability, real-time observability and adjustment capacity during registration inference. Denoising diffusion models present an alternative by reformulating registration as iterative image denoising. However, existing diffusion registration approaches do not fully harness capabilities, neglecting the critical sampling phase that enables continuous observability during the inference. Hence, we introduce DiffuseReg, an innovative diffusion-based method that denoises deformation fields instead of images for improved transparency. We also propose a novel denoising network upon Swin Transformer, which better integrates moving and fixed images with diffusion time step throughout the denoising process. Furthermore, we enhance control over the denoising registration process with a novel similarity consistency regularization. Experiments on ACDC datasets demonstrate DiffuseReg outperforms existing diffusion registration methods by 1.32 in Dice score. The sampling process in DiffuseReg enables real-time output observability and adjustment unmatched by previous deep models.
Title:
Differential Transformer
- Authors: Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei
- Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.
Keyword: scene understanding
Title:
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
- Authors: Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kordjamshidi, Lifu Huang
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Integrating the 3D world into large language models (3D-based LLMs) has been a promising research direction for 3D scene understanding. However, current 3D-based LLMs fall short in situated understanding due to two key limitations: 1) existing 3D datasets are constructed from a global perspective of the 3D scenes and lack situated context. 2) the architectures of existing 3D-based LLMs lack explicit alignment between the spatial representations of 3D scenes and natural language, limiting their performance in tasks requiring precise spatial reasoning. We address these issues by introducing a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks. Furthermore, we propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module, aiming to enhance the alignment between 3D visual representations and their corresponding textual descriptions. Experimental results demonstrate that both our proposed dataset and alignment module significantly enhance the situated spatial understanding of 3D-based LLMs.
Title:
Fast Object Detection with a Machine Learning Edge Device
- Authors: Richard C. Rodriguez, Jonah Elijah P. Bardos
- Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This machine learning study investigates a lowcost edge device integrated with an embedded system having computer vision and resulting in an improved performance in inferencing time and precision of object detection and classification. A primary aim of this study focused on reducing inferencing time and low-power consumption and to enable an embedded device of a competition-ready autonomous humanoid robot and to support real-time object recognition, scene understanding, visual navigation, motion planning, and autonomous navigation of the robot. This study compares processors for inferencing time performance between a central processing unit (CPU), a graphical processing unit (GPU), and a tensor processing unit (TPU). CPUs, GPUs, and TPUs are all processors that can be used for machine learning tasks. Related to the aim of supporting an autonomous humanoid robot, there was an additional effort to observe whether or not there was a significant difference in using a camera having monocular vision versus stereo vision capability. TPU inference time results for this study reflect a 25% reduction in time over the GPU, and a whopping 87.5% reduction in inference time compared to the CPU. Much information in this paper is contributed to the final selection of Google's Coral brand, Edge TPU device. The Arduino Nano 33 BLE Sense Tiny ML Kit was also considered for comparison but due to initial incompatibilities and in the interest of time to complete this study, a decision was made to review the kit in a future experiment.
Title:
ETHcavation: A Dataset and Pipeline for Panoptic Scene Understanding and Object Tracking in Dynamic Construction Environments
- Authors: Lorenzo Terenzi, Julian Nubert, Pol Eyschen, Pascal Roth, Simin Fei, Edo Jelavic, Marco Hutter
- Subjects: Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Construction sites are challenging environments for autonomous systems due to their unstructured nature and the presence of dynamic actors, such as workers and machinery. This work presents a comprehensive panoptic scene understanding solution designed to handle the complexities of such environments by integrating 2D panoptic segmentation with 3D LiDAR mapping. Our system generates detailed environmental representations in real-time by combining semantic and geometric data, supported by Kalman Filter-based tracking for dynamic object detection. We introduce a fine-tuning method that adapts large pre-trained panoptic segmentation models for construction site applications using a limited number of domain-specific samples. For this use case, we release a first-of-its-kind dataset of 502 hand-labeled sample images with panoptic annotations from construction sites. In addition, we propose a dynamic panoptic mapping technique that enhances scene understanding in unstructured environments. As a case study, we demonstrate the system's application for autonomous navigation, utilizing real-time RRT* for reactive path planning in dynamic scenarios. The dataset (this https URL) and code (this https URL) for training and deployment are publicly available to support future research.
Title:
Diffusion Models in 3D Vision: A Survey
- Authors: Zhen Wang, Dongyuan Li, Renhe Jiang
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In recent years, 3D vision has become a crucial field within computer vision, powering a wide range of applications such as autonomous driving, robotics, augmented reality (AR), and medical imaging. This field relies on the accurate perception, understanding, and reconstruction of 3D scenes from 2D data sources like images and videos. Diffusion models, originally designed for 2D generative tasks, offer the potential for more flexible, probabilistic approaches that can better capture the variability and uncertainty present in real-world 3D data. However, traditional methods often struggle with efficiency and scalability. In this paper, we review the state-of-the-art approaches that leverage diffusion models for 3D visual tasks, including but not limited to 3D object generation, shape completion, point cloud reconstruction, and scene understanding. We provide an in-depth discussion of the underlying mathematical principles of diffusion models, outlining their forward and reverse processes, as well as the various architectural advancements that enable these models to work with 3D datasets. We also discuss the key challenges in applying diffusion models to 3D vision, such as handling occlusions and varying point densities, and the computational demands of high-dimensional data. Finally, we discuss potential solutions, including improving computational efficiency, enhancing multimodal fusion, and exploring the use of large-scale pretraining for better generalization across 3D tasks. This paper serves as a foundation for future exploration and development in this rapidly evolving field.
Title:
Resource-Efficient Multiview Perception: Integrating Semantic Masking with Masked Autoencoders
- Authors: Kosta Dakic, Kanchana Thilakarathna, Rodrigo N. Calheiros, Teng Joon Lim
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Multiview systems have become a key technology in modern computer vision, offering advanced capabilities in scene understanding and analysis. However, these systems face critical challenges in bandwidth limitations and computational constraints, particularly for resource-limited camera nodes like drones. This paper presents a novel approach for communication-efficient distributed multiview detection and tracking using masked autoencoders (MAEs). We introduce a semantic-guided masking strategy that leverages pre-trained segmentation models and a tunable power function to prioritize informative image regions. This approach, combined with an MAE, reduces communication overhead while preserving essential visual information. We evaluate our method on both virtual and real-world multiview datasets, demonstrating comparable performance in terms of detection and tracking performance metrics compared to state-of-the-art techniques, even at high masking ratios. Our selective masking algorithm outperforms random masking, maintaining higher accuracy and precision as the masking ratio increases. Furthermore, our approach achieves a significant reduction in transmission data volume compared to baseline methods, thereby balancing multiview tracking performance with communication efficiency.
Keyword: visual reasoning
There is no result