arxiv-daily
arxiv-daily copied to clipboard
New submissions for Wed, 21 Feb 24
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
LangXAI: Integrating Large Vision Models for Generating Textual Explanations to Enhance Explainability in Visual Perception Tasks
- Authors: Truong Thanh Hung Nguyen, Tobias Clement, Phuc Truong Loc Nguyen, Nils Kemmerzell, Van Binh Truong, Vo Thanh Khang Nguyen, Mohamed Abdelaal, Hung Cao
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.12525
- Pdf link: https://arxiv.org/pdf/2402.12525
- Abstract LangXAI is a framework that integrates Explainable Artificial Intelligence (XAI) with advanced vision models to generate textual explanations for visual recognition tasks. Despite XAI advancements, an understanding gap persists for end-users with limited domain knowledge in artificial intelligence and computer vision. LangXAI addresses this by furnishing text-based explanations for classification, object detection, and semantic segmentation model outputs to end-users. Preliminary results demonstrate LangXAI's enhanced plausibility, with high BERTScore across tasks, fostering a more transparent and reliable AI framework on vision tasks for end-users.
Efficient Parameter Mining and Freezing for Continual Object Detection
- Authors: Angelo G. Menezes, Augusto J. Peterlevitz, Mateus A. Chinelatto, André C. P. L. F. de Carvalho
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.12624
- Pdf link: https://arxiv.org/pdf/2402.12624
- Abstract Continual Object Detection is essential for enabling intelligent agents to interact proactively with humans in real-world settings. While parameter-isolation strategies have been extensively explored in the context of continual learning for classification, they have yet to be fully harnessed for incremental object detection scenarios. Drawing inspiration from prior research that focused on mining individual neuron responses and integrating insights from recent developments in neural pruning, we proposed efficient ways to identify which layers are the most important for a network to maintain the performance of a detector across sequential updates. The presented findings highlight the substantial advantages of layer-level parameter isolation in facilitating incremental learning within object detection models, offering promising avenues for future research and application in real-world scenarios.
YOLO-Ant: A Lightweight Detector via Depthwise Separable Convolutional and Large Kernel Design for Antenna Interference Source Detection
- Authors: Xiaoyu Tang, Xingming Chen, Jintao Cheng, Jin Wu, Rui Fan, Chengxi Zhang, Zebo Zhou
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.12641
- Pdf link: https://arxiv.org/pdf/2402.12641
- Abstract In the era of 5G communication, removing interference sources that affect communication is a resource-intensive task. The rapid development of computer vision has enabled unmanned aerial vehicles to perform various high-altitude detection tasks. Because the field of object detection for antenna interference sources has not been fully explored, this industry lacks dedicated learning samples and detection models for this specific task. In this article, an antenna dataset is created to address important antenna interference source detection issues and serves as the basis for subsequent research. We introduce YOLO-Ant, a lightweight CNN and transformer hybrid detector specifically designed for antenna interference source detection. Specifically, we initially formulated a lightweight design for the network depth and width, ensuring that subsequent investigations were conducted within a lightweight framework. Then, we propose a DSLK-Block module based on depthwise separable convolution and large convolution kernels to enhance the network's feature extraction ability, effectively improving small object detection. To address challenges such as complex backgrounds and large interclass differences in antenna detection, we construct DSLKVit-Block, a powerful feature extraction module that combines DSLK-Block and transformer structures. Considering both its lightweight design and accuracy, our method not only achieves optimal performance on the antenna dataset but also yields competitive results on public datasets.
CST: Calibration Side-Tuning for Parameter and Memory Efficient Transfer Learning
- Authors: Feng Chen
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.12736
- Pdf link: https://arxiv.org/pdf/2402.12736
- Abstract Achieving a universally high accuracy in object detection is quite challenging, and the mainstream focus in the industry currently lies on detecting specific classes of objects. However, deploying one or multiple object detection networks requires a certain amount of GPU memory for training and storage capacity for inference. This presents challenges in terms of how to effectively coordinate multiple object detection tasks under resource-constrained conditions. This paper introduces a lightweight fine-tuning strategy called Calibration side tuning, which integrates aspects of adapter tuning and side tuning to adapt the successful techniques employed in transformers for use with ResNet. The Calibration side tuning architecture that incorporates maximal transition calibration, utilizing a small number of additional parameters to enhance network performance while maintaining a smooth training process. Furthermore, this paper has conducted an analysis on multiple fine-tuning strategies and have implemented their application within ResNet, thereby expanding the research on fine-tuning strategies for object detection networks. Besides, this paper carried out extensive experiments using five benchmark datasets. The experimental results demonstrated that this method outperforms other compared state-of-the-art techniques, and a better balance between the complexity and performance of the finetune schemes is achieved.
GOOD: Towards Domain Generalized Orientated Object Detection
- Authors: Qi Bi, Beichen Zhou, Jingjun Yi, Wei Ji, Haolan Zhan, Gui-Song Xia
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.12765
- Pdf link: https://arxiv.org/pdf/2402.12765
- Abstract Oriented object detection has been rapidly developed in the past few years, but most of these methods assume the training and testing images are under the same statistical distribution, which is far from reality. In this paper, we propose the task of domain generalized oriented object detection, which intends to explore the generalization of oriented object detectors on arbitrary unseen target domains. Learning domain generalized oriented object detectors is particularly challenging, as the cross-domain style variation not only negatively impacts the content representation, but also leads to unreliable orientation predictions. To address these challenges, we propose a generalized oriented object detector (GOOD). After style hallucination by the emerging contrastive language-image pre-training (CLIP), it consists of two key components, namely, rotation-aware content consistency learning (RAC) and style consistency learning (SEC). The proposed RAC allows the oriented object detector to learn stable orientation representation from style-diversified samples. The proposed SEC further stabilizes the generalization ability of content representation from different image styles. Extensive experiments on multiple cross-domain settings show the state-of-the-art performance of GOOD. Source code will be publicly available.
Keyword: transformer
Beyond Uniform Scaling: Exploring Depth Heterogeneity in Neural Architectures
- Authors: Akash Guna R.T, Arnav Chavan, Deepak Gupta
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
- Arxiv link: https://arxiv.org/abs/2402.12418
- Pdf link: https://arxiv.org/pdf/2402.12418
- Abstract Conventional scaling of neural networks typically involves designing a base network and growing different dimensions like width, depth, etc. of the same by some predefined scaling factors. We introduce an automated scaling approach leveraging second-order loss landscape information. Our method is flexible towards skip connections a mainstay in modern vision transformers. Our training-aware method jointly scales and trains transformers without additional training iterations. Motivated by the hypothesis that not all neurons need uniform depth complexity, our approach embraces depth heterogeneity. Extensive evaluations on DeiT-S with ImageNet100 show a 2.5% accuracy gain and 10% parameter efficiency improvement over conventional scaling. Scaled networks demonstrate superior performance upon training small scale datasets from scratch. We introduce the first intact scaling mechanism for vision transformers, a step towards efficient model scaling.
Induced Model Matching: How Restricted Models Can Help Larger Ones
- Authors: Usama Muneeb, Mesrob I. Ohannessian
- Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.12513
- Pdf link: https://arxiv.org/pdf/2402.12513
- Abstract We consider scenarios where a very accurate predictive model using restricted features is available at the time of training of a larger, full-featured, model. This restricted model may be thought of as "side-information", derived either from an auxiliary exhaustive dataset or on the same dataset, by forcing the restriction. How can the restricted model be useful to the full model? We propose an approach for transferring the knowledge of the restricted model to the full model, by aligning the full model's context-restricted performance with that of the restricted model's. We call this methodology Induced Model Matching (IMM) and first illustrate its general applicability by using logistic regression as a toy example. We then explore IMM's use in language modeling, the application that initially inspired it, and where it offers an explicit foundation in contrast to the implicit use of restricted models in techniques such as noising. We demonstrate the methodology on both LSTM and transformer full models, using $N$-grams as restricted models. To further illustrate the potential of the principle whenever it is much cheaper to collect restricted rather than full information, we conclude with a simple RL example where POMDP policies can improve learned MDP policies via IMM.
System Identification of Neural Systems: Going Beyond Images to Modelling Dynamics
- Authors: Mai Gamal, Mohamed Rashad, Eman Ehab, Seif Eldawlatly, Mennatullah Siam
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.12519
- Pdf link: https://arxiv.org/pdf/2402.12519
- Abstract Vast literature has compared the recordings of biological neurons in the brain to deep neural networks. The ultimate goal is to interpret deep networks or to better understand and encode biological neural systems. Recently, there has been a debate on whether system identification is possible and how much it can tell us about the brain computation. System identification recognizes whether one model is more valid to represent the brain computation over another. Nonetheless, previous work did not consider the time aspect and how video and dynamics (e.g., motion) modelling in deep networks relate to these biological neural systems within a large-scale comparison. Towards this end, we propose a system identification study focused on comparing single image vs. video understanding models with respect to the visual cortex recordings. Our study encompasses two sets of experiments; a real environment setup and a simulated environment setup. The study also encompasses more than 30 models and, unlike prior works, we focus on convolutional vs. transformer-based, single vs. two-stream, and fully vs. self-supervised video understanding models. The goal is to capture a greater variety of architectures that model dynamics. As such, this signifies the first large-scale study of video understanding models from a neuroscience perspective. Our results in the simulated experiments, show that system identification can be attained to a certain level in differentiating image vs. video understanding models. Moreover, we provide key insights on how video understanding models predict visual cortex responses; showing video understanding better than image understanding models, convolutional models are better in the early-mid regions than transformer based except for multiscale transformers that are still good in predicting these regions, and that two-stream models are better than single stream.
Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics
- Authors: Siqi Miao, Zhiyuan Lu, Mia Liu, Javier Duarte, Pan Li
- Subjects: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
- Arxiv link: https://arxiv.org/abs/2402.12535
- Pdf link: https://arxiv.org/pdf/2402.12535
- Abstract This study introduces a novel transformer model optimized for large-scale point cloud processing in scientific domains such as high-energy physics (HEP) and astrophysics. Addressing the limitations of graph neural networks and standard transformers, our model integrates local inductive bias and achieves near-linear complexity with hardware-friendly regular operations. One contribution of this work is the quantitative analysis of the error-complexity tradeoff of various sparsification techniques for building efficient transformers. Our findings highlight the superiority of using locality-sensitive hashing (LSH), especially OR & AND-construction LSH, in kernel approximation for large-scale point cloud data with local inductive bias. Based on this finding, we propose LSH-based Efficient Point Transformer (\textbf{HEPT}), which combines E$^2$LSH with OR & AND constructions and is built upon regular computations. HEPT demonstrates remarkable performance in two critical yet time-consuming HEP tasks, significantly outperforming existing GNNs and transformers in accuracy and computational speed, marking a significant advancement in geometric deep learning and large-scale scientific data processing. Our code is available at \url{https://github.com/Graph-COM/HEPT}.
YOLO-Ant: A Lightweight Detector via Depthwise Separable Convolutional and Large Kernel Design for Antenna Interference Source Detection
- Authors: Xiaoyu Tang, Xingming Chen, Jintao Cheng, Jin Wu, Rui Fan, Chengxi Zhang, Zebo Zhou
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.12641
- Pdf link: https://arxiv.org/pdf/2402.12641
- Abstract In the era of 5G communication, removing interference sources that affect communication is a resource-intensive task. The rapid development of computer vision has enabled unmanned aerial vehicles to perform various high-altitude detection tasks. Because the field of object detection for antenna interference sources has not been fully explored, this industry lacks dedicated learning samples and detection models for this specific task. In this article, an antenna dataset is created to address important antenna interference source detection issues and serves as the basis for subsequent research. We introduce YOLO-Ant, a lightweight CNN and transformer hybrid detector specifically designed for antenna interference source detection. Specifically, we initially formulated a lightweight design for the network depth and width, ensuring that subsequent investigations were conducted within a lightweight framework. Then, we propose a DSLK-Block module based on depthwise separable convolution and large convolution kernels to enhance the network's feature extraction ability, effectively improving small object detection. To address challenges such as complex backgrounds and large interclass differences in antenna detection, we construct DSLKVit-Block, a powerful feature extraction module that combines DSLK-Block and transformer structures. Considering both its lightweight design and accuracy, our method not only achieves optimal performance on the antenna dataset but also yields competitive results on public datasets.
PDEformer: Towards a Foundation Model for One-Dimensional Partial Differential Equations
- Authors: Zhanhong Ye, Xiang Huang, Leheng Chen, Hongsheng Liu, Zidong Wang, Bin Dong
- Subjects: Numerical Analysis (math.NA)
- Arxiv link: https://arxiv.org/abs/2402.12652
- Pdf link: https://arxiv.org/pdf/2402.12652
- Abstract This paper introduces PDEformer, a neural solver for partial differential equations (PDEs) capable of simultaneously addressing various types of PDEs. We advocate representing the PDE in the form of a computational graph, facilitating the seamless integration of both symbolic and numerical information inherent in a PDE. A graph Transformer and an implicit neural representation (INR) are employed to generate mesh-free predicted solutions. Following pretraining on data exhibiting a certain level of diversity, our model achieves zero-shot accuracies on benchmark datasets that surpass those of adequately trained expert models. Additionally, PDEformer demonstrates promising results in the inverse problem of PDE coefficient recovery.
Pre-trained Transformer-Enabled Strategies with Human-Guided Fine-Tuning for End-to-end Navigation of Autonomous Vehicles
- Authors: Dong Hu, Chao Huang, Jingda Wu, Hongbo Gao
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2402.12666
- Pdf link: https://arxiv.org/pdf/2402.12666
- Abstract Autonomous driving (AD) technology, leveraging artificial intelligence, strives for vehicle automation. End-toend strategies, emerging to simplify traditional driving systems by integrating perception, decision-making, and control, offer new avenues for advanced driving functionalities. Despite their potential, current challenges include data efficiency, training complexities, and poor generalization. This study addresses these issues with a novel end-to-end AD training model, enhancing system adaptability and intelligence. The model incorporates a Transformer module into the policy network, undergoing initial behavior cloning (BC) pre-training for update gradients. Subsequently, fine-tuning through reinforcement learning with human guidance (RLHG) adapts the model to specific driving environments, aiming to surpass the performance limits of imitation learning (IL). The fine-tuning process involves human interactions, guiding the model to acquire more efficient and safer driving behaviors through supervision, intervention, demonstration, and reward feedback. Simulation results demonstrate that this framework accelerates learning, achieving precise control and significantly enhancing safety and reliability. Compared to other advanced baseline methods, the proposed approach excels in challenging AD tasks. The introduction of the Transformer module and human-guided fine-tuning provides valuable insights and methods for research and applications in the AD field.
Tree-Planted Transformers: Large Language Models with Implicit Syntactic Supervision
- Authors: Ryo Yoshida, Taiga Someya, Yohei Oseki
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.12691
- Pdf link: https://arxiv.org/pdf/2402.12691
- Abstract Large Language Models (LLMs) have achieved remarkable success thanks to scalability on large text corpora, but have some drawback in training efficiency. In contrast, Syntactic Language Models (SLMs) can be trained efficiently to reach relatively high performance thanks to syntactic supervision, but have trouble with scalability. Thus, given these complementary advantages of LLMs and SLMs, it is necessary to develop an architecture that integrates the scalability of LLMs with the training efficiency of SLMs, namely Syntactic Large Language Models (SLLM). In this paper, we propose a novel method dubbed tree-planting: implicitly "plant" trees into attention weights of Transformer LMs to reflect syntactic structures of natural language. Specifically, Transformer LMs trained with tree-planting will be called Tree-Planted Transformers (TPT), which learn syntax on small treebanks via tree-planting and then scale on large text corpora via continual learning with syntactic scaffolding. Targeted syntactic evaluations on the SyntaxGym benchmark demonstrated that TPTs, despite the lack of explicit syntactic supervision, significantly outperformed various SLMs with explicit syntactic supervision that generate hundreds of syntactic structures in parallel, suggesting that tree-planting and TPTs are the promising foundation for SLLMs.
Equivariant Pretrained Transformer for Unified Geometric Learning on Multi-Domain 3D Molecules
- Authors: Rui Jiao, Xiangzhe Kong, Ziyang Yu, Wenbing Huang, Yang Liu
- Subjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
- Arxiv link: https://arxiv.org/abs/2402.12714
- Pdf link: https://arxiv.org/pdf/2402.12714
- Abstract Pretraining on a large number of unlabeled 3D molecules has showcased superiority in various scientific applications. However, prior efforts typically focus on pretraining models on a specific domain, either proteins or small molecules, missing the opportunity to leverage the cross-domain knowledge. To mitigate this gap, we introduce Equivariant Pretrained Transformer (EPT), a novel pretraining framework designed to harmonize the geometric learning of small molecules and proteins. To be specific, EPT unifies the geometric modeling of multi-domain molecules via the block-enhanced representation that can attend a broader context of each atom. Upon transformer framework, EPT is further enhanced with E(3) equivariance to facilitate the accurate representation of 3D structures. Another key innovation of EPT is its block-level pretraining task, which allows for joint pretraining on datasets comprising both small molecules and proteins. Experimental evaluations on a diverse group of benchmarks, including ligand binding affinity prediction, molecular property prediction, and protein property prediction, show that EPT significantly outperforms previous SOTA methods for affinity prediction, and achieves the best or comparable performance with existing domain-specific pretraining models for other tasks.
CST: Calibration Side-Tuning for Parameter and Memory Efficient Transfer Learning
- Authors: Feng Chen
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.12736
- Pdf link: https://arxiv.org/pdf/2402.12736
- Abstract Achieving a universally high accuracy in object detection is quite challenging, and the mainstream focus in the industry currently lies on detecting specific classes of objects. However, deploying one or multiple object detection networks requires a certain amount of GPU memory for training and storage capacity for inference. This presents challenges in terms of how to effectively coordinate multiple object detection tasks under resource-constrained conditions. This paper introduces a lightweight fine-tuning strategy called Calibration side tuning, which integrates aspects of adapter tuning and side tuning to adapt the successful techniques employed in transformers for use with ResNet. The Calibration side tuning architecture that incorporates maximal transition calibration, utilizing a small number of additional parameters to enhance network performance while maintaining a smooth training process. Furthermore, this paper has conducted an analysis on multiple fine-tuning strategies and have implemented their application within ResNet, thereby expanding the research on fine-tuning strategies for object detection networks. Besides, this paper carried out extensive experiments using five benchmark datasets. The experimental results demonstrated that this method outperforms other compared state-of-the-art techniques, and a better balance between the complexity and performance of the finetune schemes is achieved.
RhythmFormer: Extracting rPPG Signals Based on Hierarchical Temporal Periodic Transformer
- Authors: Bochao Zou, Zizheng Guo, Jiansheng Chen, Huimin Ma
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.12788
- Pdf link: https://arxiv.org/pdf/2402.12788
- Abstract Remote photoplethysmography (rPPG) is a non-contact method for detecting physiological signals based on facial videos, holding high potential in various applications such as healthcare, affective computing, anti-spoofing, etc. Due to the periodicity nature of rPPG, the long-range dependency capturing capacity of the Transformer was assumed to be advantageous for such signals. However, existing approaches have not conclusively demonstrated the superior performance of Transformer over traditional convolutional neural network methods, this gap may stem from a lack of thorough exploration of rPPG periodicity. In this paper, we propose RhythmFormer, a fully end-to-end transformer-based method for extracting rPPG signals by explicitly leveraging the quasi-periodic nature of rPPG. The core module, Hierarchical Temporal Periodic Transformer, hierarchically extracts periodic features from multiple temporal scales. It utilizes dynamic sparse attention based on periodicity in the temporal domain, allowing for fine-grained modeling of rPPG features. Furthermore, a fusion stem is proposed to guide self-attention to rPPG features effectively, and it can be easily transferred to existing methods to enhance their performance significantly. RhythmFormer achieves state-of-the-art performance with fewer parameters and reduced computational complexity in comprehensive experiments compared to previous approaches. The codes are available at https://github.com/zizheng-guo/RhythmFormer.
ASCEND: Accurate yet Efficient End-to-End Stochastic Computing Acceleration of Vision Transformer
- Authors: Tong Xie, Yixuan Hu, Renjie Wei, Meng Li, Yuan Wang, Runsheng Wang, Ru Huang
- Subjects: Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2402.12820
- Pdf link: https://arxiv.org/pdf/2402.12820
- Abstract Stochastic computing (SC) has emerged as a promising computing paradigm for neural acceleration. However, how to accelerate the state-of-the-art Vision Transformer (ViT) with SC remains unclear. Unlike convolutional neural networks, ViTs introduce notable compatibility and efficiency challenges because of their nonlinear functions, e.g., softmax and Gaussian Error Linear Units (GELU). In this paper, for the first time, a ViT accelerator based on end-to-end SC, dubbed ASCEND, is proposed. ASCEND co-designs the SC circuits and ViT networks to enable accurate yet efficient acceleration. To overcome the compatibility challenges, ASCEND proposes a novel deterministic SC block for GELU and leverages an SC-friendly iterative approximate algorithm to design an accurate and efficient softmax circuit. To improve inference efficiency, ASCEND develops a two-stage training pipeline to produce accurate low-precision ViTs. With extensive experiments, we show the proposed GELU and softmax blocks achieve 56.3% and 22.6% error reduction compared to existing SC designs, respectively and reduce the area-delay product (ADP) by 5.29x and 12.6x, respectively. Moreover, compared to the baseline low-precision ViTs, ASCEND also achieves significant accuracy improvements on CIFAR10 and CIFAR100.
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
- Authors: Shahar Katz, Yonatan Belinkov, Mor Geva, Lior Wolf
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2402.12865
- Pdf link: https://arxiv.org/pdf/2402.12865
- Abstract Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models' vocabularies, helping to uncover how information flows within LMs. In this work, we extend this methodology to LMs' backward pass and gradients. We first prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs. We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
- Authors: Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma
- Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2402.12875
- Pdf link: https://arxiv.org/pdf/2402.12875
- Abstract Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
Autism Detection in Speech -- A Survey
- Authors: Nadine Probol, Margot Mieskes
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.12880
- Pdf link: https://arxiv.org/pdf/2402.12880
- Abstract There has been a range of studies of how autism is displayed in voice, speech, and language. We analyse studies from the biomedical, as well as the psychological domain, but also from the NLP domain in order to find linguistic, prosodic and acoustic cues that could indicate autism. Our survey looks at all three domains. We define autism and which comorbidities might influence the correct detection of the disorder. We especially look at observations such as verbal and semantic fluency, prosodic features, but also disfluencies and speaking rate. We also show word-based approaches and describe machine learning and transformer-based approaches both on the audio data as well as the transcripts. Lastly, we conclude, while there already is a lot of research, female patients seem to be severely under-researched. Also, most NLP research focuses on traditional machine learning methods instead of transformers which could be beneficial in this context. Additionally, we were unable to find research combining both features from audio and transcripts.
GRAFFORD: A Benchmark Dataset for Testing the Knowledge of Object Affordances of Language and Vision Models
- Authors: Sayantan Adak, Daivik Agrawal, Animesh Mukherjee, Somak Aditya
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.12881
- Pdf link: https://arxiv.org/pdf/2402.12881
- Abstract We investigate the knowledge of object affordances in pre-trained language models (LMs) and pre-trained Vision-Language models (VLMs). Transformers-based large pre-trained language models (PTLM) learn contextual representation from massive amounts of unlabeled text and are shown to perform impressively in downstream NLU tasks. In parallel, a growing body of literature shows that PTLMs fail inconsistently and non-intuitively, showing a lack of reasoning and grounding. To take a first step toward quantifying the effect of grounding (or lack thereof), we curate a novel and comprehensive dataset of object affordances -- GrAFFORD, characterized by 15 affordance classes. Unlike affordance datasets collected in vision and language domains, we annotate in-the-wild sentences with objects and affordances. Experimental results reveal that PTLMs exhibit limited reasoning abilities when it comes to uncommon object affordances. We also observe that pre-trained VLMs do not necessarily capture object affordances effectively. Through few-shot fine-tuning, we demonstrate improvement in affordance knowledge in PTLMs and VLMs. Our research contributes a novel dataset for language grounding tasks, and presents insights into LM capabilities, advancing the understanding of object affordances. Codes and data are available at https://github.com/sayantan11995/Affordance
Cell Graph Transformer for Nuclei Classification
- Authors: Wei Lou, Guanbin Li, Xiang Wan, Haofeng Li
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.12946
- Pdf link: https://arxiv.org/pdf/2402.12946
- Abstract Nuclei classification is a critical step in computer-aided diagnosis with histopathology images. In the past, various methods have employed graph neural networks (GNN) to analyze cell graphs that model inter-cell relationships by considering nuclei as vertices. However, they are limited by the GNN mechanism that only passes messages among local nodes via fixed edges. To address the issue, we develop a cell graph transformer (CGT) that treats nodes and edges as input tokens to enable learnable adjacency and information exchange among all nodes. Nevertheless, training the transformer with a cell graph presents another challenge. Poorly initialized features can lead to noisy self-attention scores and inferior convergence, particularly when processing the cell graphs with numerous connections. Thus, we further propose a novel topology-aware pretraining method that leverages a graph convolutional network (GCN) to learn a feature extractor. The pre-trained features may suppress unreasonable correlations and hence ease the finetuning of CGT. Experimental results suggest that the proposed cell graph transformer with topology-aware pretraining significantly improves the nuclei classification results, and achieves the state-of-the-art performance. Code and models are available at https://github.com/lhaof/CGT
Conditional Logical Message Passing Transformer for Complex Query Answering
- Authors: Chongzhi Zhang, Zhiping Peng, Junhao Zheng, Qianli Ma
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
- Arxiv link: https://arxiv.org/abs/2402.12954
- Pdf link: https://arxiv.org/pdf/2402.12954
- Abstract Complex Query Answering (CQA) over Knowledge Graphs (KGs) is a challenging task. Given that KGs are usually incomplete, neural models are proposed to solve CQA by performing multi-hop logical reasoning. However, most of them cannot perform well on both one-hop and multi-hop queries simultaneously. Recent work proposes a logical message passing mechanism based on the pre-trained neural link predictors. While effective on both one-hop and multi-hop queries, it ignores the difference between the constant and variable nodes in a query graph. In addition, during the node embedding update stage, this mechanism cannot dynamically measure the importance of different messages, and whether it can capture the implicit logical dependencies related to a node and received messages remains unclear. In this paper, we propose Conditional Logical Message Passing Transformer (CLMPT), which considers the difference between constants and variables in the case of using pre-trained neural link predictors and performs message passing conditionally on the node type. We empirically verified that this approach can reduce computational costs without affecting performance. Furthermore, CLMPT uses the transformer to aggregate received messages and update the corresponding node embedding. Through the self-attention mechanism, CLMPT can assign adaptive weights to elements in an input set consisting of received messages and the corresponding node and explicitly model logical dependencies between various elements. Experimental results show that CLMPT is a new state-of-the-art neural CQA model.
Investigating the Impact of Model Instability on Explanations and Uncertainty
- Authors: Sara Vera Marjanović, Isabelle Augenstein, Christina Lioma
- Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.13006
- Pdf link: https://arxiv.org/pdf/2402.13006
- Abstract Explainable AI methods facilitate the understanding of model behaviour, yet, small, imperceptible perturbations to inputs can vastly distort explanations. As these explanations are typically evaluated holistically, before model deployment, it is difficult to assess when a particular explanation is trustworthy. Some studies have tried to create confidence estimators for explanations, but none have investigated an existing link between uncertainty and explanation quality. We artificially simulate epistemic uncertainty in text input by introducing noise at inference time. In this large-scale empirical study, we insert different levels of noise perturbations and measure the effect on the output of pre-trained language models and different uncertainty metrics. Realistic perturbations have minimal effect on performance and explanations, yet masking has a drastic effect. We find that high uncertainty doesn't necessarily imply low explanation plausibility; the correlation between the two metrics can be moderately positive when noise is exposed during the training process. This suggests that noise-augmented models may be better at identifying salient tokens when uncertain. Furthermore, when predictive and epistemic uncertainty measures are over-confident, the robustness of a saliency map to perturbation can indicate model stability issues. Integrated Gradients shows the overall greatest robustness to perturbation, while still showing model-specific patterns in performance; however, this phenomenon is limited to smaller Transformer-based language models.
Understanding the effects of language-specific class imbalance in multilingual fine-tuning
- Authors: Vincent Jung, Lonneke van der Plas
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.13016
- Pdf link: https://arxiv.org/pdf/2402.13016
- Abstract We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative features. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language and show that this helps mitigate those detrimental effects. These results create awareness of the negative effects of language-specific class imbalance in multilingual fine-tuning and the way in which the model learns to rely on the separation of languages to perform the task.
SiLLM: Large Language Models for Simultaneous Machine Translation
- Authors: Shoutao Guo, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.13036
- Pdf link: https://arxiv.org/pdf/2402.13036
- Abstract Simultaneous Machine Translation (SiMT) generates translations while reading the source sentence, necessitating a policy to determine the optimal timing for reading and generating words. Despite the remarkable performance achieved by Large Language Models (LLM) across various NLP tasks, existing SiMT methods predominantly focus on conventional transformers, employing a single model to concurrently determine the policy and generate the translations. However, given the complexity of SiMT, it is challenging to effectively address both tasks with a single model. Therefore, there is a need to decouple the SiMT task into policy-decision and translation sub-tasks. We propose SiLLM, which delegates the two sub-tasks to separate agents, thereby incorporating LLM into SiMT. The policy-decision agent is managed by a conventional SiMT model, responsible for determining the translation policy. The translation agent, leveraging the capabilities of LLM, generates translation using the partial source sentence. The two agents collaborate to accomplish SiMT. To facilitate the application of token-level policies determined by conventional SiMT models to LLM, we propose a word-level policy adapted for LLM. Experiments on two datasets demonstrate that, with a small amount of data for fine-tuning LLM, SiLLM attains state-of-the-art performance.
Identifying Semantic Induction Heads to Understand In-Context Learning
- Authors: Jie Ren, Qipeng Guo, Hang Yan, Dongrui Liu, Xipeng Qiu, Dahua Lin
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.13055
- Pdf link: https://arxiv.org/pdf/2402.13055
- Abstract Although large language models (LLMs) have demonstrated remarkable performance, the lack of transparency in their inference logic raises concerns about their trustworthiness. To gain a better understanding of LLMs, we conduct a detailed analysis of the operations of attention heads and aim to better understand the in-context learning of LLMs. Specifically, we investigate whether attention heads encode two types of relationships between tokens present in natural languages: the syntactic dependency parsed from sentences and the relation within knowledge graphs. We find that certain attention heads exhibit a pattern where, when attending to head tokens, they recall tail tokens and increase the output logits of those tail tokens. More crucially, the formulation of such semantic induction heads has a close correlation with the emergence of the in-context learning ability of language models. The study of semantic attention heads advances our understanding of the intricate operations of attention heads in transformers, and further provides new insights into the in-context learning of LLMs.
When Only Time Will Tell: Interpreting How Transformers Process Local Ambiguities Through the Lens of Restart-Incrementality
- Authors: Brielen Madureira, Patrick Kahardipraja, David Schlangen
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.13113
- Pdf link: https://arxiv.org/pdf/2402.13113
- Abstract Incremental models that process sentences one token at a time will sometimes encounter points where more than one interpretation is possible. Causal models are forced to output one interpretation and continue, whereas models that can revise may edit their previous output as the ambiguity is resolved. In this work, we look at how restart-incremental Transformers build and update internal states, in an effort to shed light on what processes cause revisions not viable in autoregressive models. We propose an interpretable way to analyse the incremental states, showing that their sequential structure encodes information on the garden path effect and its resolution. Our method brings insights on various bidirectional encoders for contextualised meaning representation and dependency parsing, contributing to show their advantage over causal models when it comes to revisions.
The Hidden Space of Transformer Language Adapters
- Authors: Jesujoba O. Alabi, Marius Mosbach, Matan Eyal, Dietrich Klakow, Mor Geva
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.13137
- Pdf link: https://arxiv.org/pdf/2402.13137
- Abstract We analyze the operation of transformer language adapters, which are small modules trained on top of a frozen language model to adapt its predictions to new target languages. We show that adapted predictions mostly evolve in the source language the model was trained on, while the target language becomes pronounced only in the very last layers of the model. Moreover, the adaptation process is gradual and distributed across layers, where it is possible to skip small groups of adapters without decreasing adaptation performance. Last, we show that adapters operate on top of the model's frozen representation space while largely preserving its structure, rather than on an 'isolated' subspace. Our findings provide a deeper view into the adaptation process of language models to new languages, showcasing the constraints imposed on it by the underlying model and introduces practical implications to enhance its efficiency.
OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog
- Authors: Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.13146
- Pdf link: https://arxiv.org/pdf/2402.13146
- Abstract We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into Large Language Models (LLMs) and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.
DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models
- Authors: Norman Di Palo, Edward Johns
- Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2402.13181
- Pdf link: https://arxiv.org/pdf/2402.13181
- Abstract We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.
Tiny Reinforcement Learning for Quadruped Locomotion using Decision Transformers
- Authors: Orhan Eren Akgün, Néstor Cuevas, Matheus Farias, Daniel Garces
- Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2402.13201
- Pdf link: https://arxiv.org/pdf/2402.13201
- Abstract Resource-constrained robotic platforms are particularly useful for tasks that require low-cost hardware alternatives due to the risk of losing the robot, like in search-and-rescue applications, or the need for a large number of devices, like in swarm robotics. For this reason, it is crucial to find mechanisms for adapting reinforcement learning techniques to the constraints imposed by lower computational power and smaller memory capacities of these ultra low-cost robotic platforms. We try to address this need by proposing a method for making imitation learning deployable onto resource-constrained robotic platforms. Here we cast the imitation learning problem as a conditional sequence modeling task and we train a decision transformer using expert demonstrations augmented with a custom reward. Then, we compress the resulting generative model using software optimization schemes, including quantization and pruning. We test our method in simulation using Isaac Gym, a realistic physics simulation environment designed for reinforcement learning. We empirically demonstrate that our method achieves natural looking gaits for Bittle, a resource-constrained quadruped robot. We also run multiple simulations to show the effects of pruning and quantization on the performance of the model. Our results show that quantization (down to 4 bits) and pruning reduce model size by around 30% while maintaining a competitive reward, making the model deployable in a resource-constrained system.
Keyword: scene understanding
There is no result
Keyword: visual reasoning
Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition Approach
- Authors: Guillermo Puebla, Jeffrey S. Bowers
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.12675
- Pdf link: https://arxiv.org/pdf/2402.12675
- Abstract Achieving visual reasoning is a long-term goal of artificial intelligence. In the last decade, several studies have applied deep neural networks (DNNs) to the task of learning visual relations from images, with modest results in terms of generalization of the relations learned. However, in recent years, object-centric representation learning has been put forward as a way to achieve visual reasoning within the deep learning framework. Object-centric models attempt to model input scenes as compositions of objects and relations between them. To this end, these models use several kinds of attention mechanisms to segregate the individual objects in a scene from the background and from other objects. In this work we tested relation learning and generalization in several object-centric models, as well as a ResNet-50 baseline. In contrast to previous research, which has focused heavily in the same-different task in order to asses relational reasoning in DNNs, we use a set of tasks -- with varying degrees of difficulty -- derived from the comparative cognition literature. Our results show that object-centric models are able to segregate the different objects in a scene, even in many out-of-distribution cases. In our simpler tasks, this improves their capacity to learn and generalize visual relations in comparison to the ResNet-50 baseline. However, object-centric models still struggle in our more difficult tasks and conditions. We conclude that abstract visual reasoning remains an open challenge for DNNs, including object-centric models.