arxiv-daily New submissions for Thursday, 6 June 2024 (showing 442 of 442 entries )

New submissions for Thursday, 6 June 2024 (showing 442 of 442 entries )

Open DongZhouGu opened this issue 8 months ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Title:

      ShadowRefiner: Towards Mask-free Shadow Removal via Fast Fourier Transformer

Authors: Wei Dong, Han Zhou, Yuqiong Tian, Jingke Sun, Xiaohong Liu, Guangtao Zhai, Jun Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Shadow-affected images often exhibit pronounced spatial discrepancies in color and illumination, consequently degrading various vision applications including object detection and segmentation systems. To effectively eliminate shadows in real-world images while preserving intricate details and producing visually compelling outcomes, we introduce a mask-free Shadow Removal and Refinement network (ShadowRefiner) via Fast Fourier Transformer. Specifically, the Shadow Removal module in our method aims to establish effective mappings between shadow-affected and shadow-free images via spatial and frequency representation learning. To mitigate the pixel misalignment and further improve the image quality, we propose a novel Fast-Fourier Attention based Transformer (FFAT) architecture, where an innovative attention mechanism is designed for meticulous refinement. Our method wins the championship in the Perceptual Track and achieves the second best performance in the Fidelity Track of NTIRE 2024 Image Shadow Removal Challenge. Besides, comprehensive experiment result also demonstrate the compelling effectiveness of our proposed method. The code is publicly available: this https URL.

Title:

      Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Authors: Siavash Golkar, Alberto Bietti, Mariel Pettee, Michael Eickenberg, Miles Cranmer, Keiya Hirashima, Geraud Krawezik, Nicholas Lourie, Michael McCabe, Rudy Morel, Ruben Ohana, Liam Holden Parker, Bruno Régaldo-Saint Blancard, Kyunghyun Cho, Shirley Ho
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.

Title:

      Enhanced Automotive Object Detection via RGB-D Fusion in a DiffusionDet Framework

Authors: Eliraz Orfaig, Inna Stainvas, Igal Bilik
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Vision-based autonomous driving requires reliable and efficient object detection. This work proposes a DiffusionDet-based framework that exploits data fusion from the monocular camera and depth sensor to provide the RGB and depth (RGB-D) data. Within this framework, ground truth bounding boxes are randomly reshaped as part of the training phase, allowing the model to learn the reverse diffusion process of noise addition. The system methodically enhances a randomly generated set of boxes at the inference stage, guiding them toward accurate final detections. By integrating the textural and color features from RGB images with the spatial depth information from the LiDAR sensors, the proposed framework employs a feature fusion that substantially enhances object detection of automotive targets. The $2.3$ AP gain in detecting automotive targets is achieved through comprehensive experiments using the KITTI dataset. Specifically, the improved performance of the proposed approach in detecting small objects is demonstrated.

Title:

      Situation Monitor: Diversity-Driven Zero-Shot Out-of-Distribution Detection using Budding Ensemble Architecture for Object Detection

Authors: Qutub Syed, Michael Paulitsch, Korbinian Hagn, Neslihan Kose Cihangir, Kay-Ulrich Scholl, Fabian Oboril, Gereon Hinz, Alois Knoll
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We introduce Situation Monitor, a novel zero-shot Out-of-Distribution (OOD) detection approach for transformer-based object detection models to enhance reliability in safety-critical machine learning applications such as autonomous driving. The Situation Monitor utilizes the Diversity-based Budding Ensemble Architecture (DBEA) and increases the OOD performance by integrating a diversity loss into the training process on top of the budding ensemble architecture, detecting Far-OOD samples and minimizing false positives on Near-OOD samples. Moreover, utilizing the resulting DBEA increases the model's OOD performance and improves the calibration of confidence scores, particularly concerning the intersection over union of the detected objects. The DBEA model achieves these advancements with a 14% reduction in trainable parameters compared to the vanilla model. This signifies a substantial improvement in efficiency without compromising the model's ability to detect OOD instances and calibrate the confidence scores accurately.

Title:

      Global Clipper: Enhancing Safety and Reliability of Transformer-based Object Detection Models

Authors: Qutub Syed Sha, Michael Paulitsch, Karthik Pattabiraman, Korbinian Hagn, Fabian Oboril, Cornelius Buerkle, Kay-Ulrich Scholl, Gereon Hinz, Alois Knoll
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract As transformer-based object detection models progress, their impact in critical sectors like autonomous vehicles and aviation is expected to grow. Soft errors causing bit flips during inference have significantly impacted DNN performance, altering predictions. Traditional range restriction solutions for CNNs fall short for transformers. This study introduces the Global Clipper and Global Hybrid Clipper, effective mitigation strategies specifically designed for transformer-based models. It significantly enhances their resilience to soft errors and reduces faulty inferences to ~ 0%. We also detail extensive testing across over 64 scenarios involving two transformer models (DINO-DETR and Lite-DETR) and two CNN models (YOLOv3 and SSD) using three datasets, totalling approximately 3.3 million inferences, to assess model robustness comprehensively. Moreover, the paper explores unique aspects of attention blocks in transformers and their operational differences from CNNs.

Title:

      LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

Authors: Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (this https URL).

Keyword: transformer

Title:

      ShadowRefiner: Towards Mask-free Shadow Removal via Fast Fourier Transformer

Authors: Wei Dong, Han Zhou, Yuqiong Tian, Jingke Sun, Xiaohong Liu, Guangtao Zhai, Jun Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Shadow-affected images often exhibit pronounced spatial discrepancies in color and illumination, consequently degrading various vision applications including object detection and segmentation systems. To effectively eliminate shadows in real-world images while preserving intricate details and producing visually compelling outcomes, we introduce a mask-free Shadow Removal and Refinement network (ShadowRefiner) via Fast Fourier Transformer. Specifically, the Shadow Removal module in our method aims to establish effective mappings between shadow-affected and shadow-free images via spatial and frequency representation learning. To mitigate the pixel misalignment and further improve the image quality, we propose a novel Fast-Fourier Attention based Transformer (FFAT) architecture, where an innovative attention mechanism is designed for meticulous refinement. Our method wins the championship in the Perceptual Track and achieves the second best performance in the Fidelity Track of NTIRE 2024 Image Shadow Removal Challenge. Besides, comprehensive experiment result also demonstrate the compelling effectiveness of our proposed method. The code is publicly available: this https URL.

Title:

      Pretrained Mobility Transformer: A Foundation Model for Human Mobility

Authors: Xinhua Wu, Haoyu He, Yanchao Wang, Qi Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Ubiquitous mobile devices are generating vast amounts of location-based service data that reveal how individuals navigate and utilize urban spaces in detail. In this study, we utilize these extensive, unlabeled sequences of user trajectories to develop a foundation model for understanding urban space and human mobility. We introduce the \textbf{P}retrained \textbf{M}obility \textbf{T}ransformer (PMT), which leverages the transformer architecture to process user trajectories in an autoregressive manner, converting geographical areas into tokens and embedding spatial and temporal information within these representations. Experiments conducted in three U.S. metropolitan areas over a two-month period demonstrate PMT's ability to capture underlying geographic and socio-demographic characteristics of regions. The proposed PMT excels across various downstream tasks, including next-location prediction, trajectory imputation, and trajectory generation. These results support PMT's capability and effectiveness in decoding complex patterns of human mobility, offering new insights into urban spatial functionality and individual mobility preferences.

Title:

      Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Authors: Siavash Golkar, Alberto Bietti, Mariel Pettee, Michael Eickenberg, Miles Cranmer, Keiya Hirashima, Geraud Krawezik, Nicholas Lourie, Michael McCabe, Rudy Morel, Ruben Ohana, Liam Holden Parker, Bruno Régaldo-Saint Blancard, Kyunghyun Cho, Shirley Ho
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.

Title:

      LOLAMEME: Logic, Language, Memory, Mechanistic Framework

Authors: Jay Desai, Xiaobo Guo, Srinivasan H. Sengamedu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The performance of Large Language Models has achieved superhuman breadth with unprecedented depth. At the same time, the language models are mostly black box models and the underlying mechanisms for performance have been evaluated using synthetic or mechanistic schemes. We extend current mechanistic schemes to incorporate Logic, memory, and nuances of Language such as latent structure. The proposed framework is called LOLAMEME and we provide two instantiations of LOLAMEME: LoLa and MeMe languages. We then consider two generative language model architectures: transformer-based GPT-2 and convolution-based Hyena. We propose the hybrid architecture T HEX and use LOLAMEME framework is used to compare three architectures. T HEX outperforms GPT-2 and Hyena on select tasks.

Title:

      Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

Authors: Andis Draguns, Andrew Gritsevskiy, Sumeet Ramesh Motwani, Charlie Rogers-Smith, Jeffrey Ladish, Christian Schroeder de Witt
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in autoregressive transformer models, that, in contrast to prior art, are unelicitable in nature. Unelicitability prevents the defender from triggering the backdoor, making it impossible to evaluate or detect ahead of deployment even if given full white-box access and using automated techniques, such as red-teaming or certain formal verification methods. We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties. We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. Additionally, we expand on previous work by showing that our universal backdoors, while not completely undetectable in white-box settings, can be harder to detect than some existing designs. By demonstrating the feasibility of seamlessly integrating backdoors into transformer models, this paper fundamentally questions the efficacy of pre-deployment detection strategies. This offers new insights into the offence-defence balance in AI safety and security.

Title:

      Progressive Inference: Explaining Decoder-Only Sequence Classification Models Using Intermediate Predictions

Authors: Sanjay Kariyappa, Freddy Lécué, Saumitra Mishra, Christopher Pond, Daniele Magazzeni, Manuela Veloso
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper proposes Progressive Inference - a framework to compute input attributions to explain the predictions of decoder-only sequence classification models. Our work is based on the insight that the classification head of a decoder-only Transformer model can be used to make intermediate predictions by evaluating them at different points in the input sequence. Due to the causal attention mechanism, these intermediate predictions only depend on the tokens seen before the inference point, allowing us to obtain the model's prediction on a masked input sub-sequence, with negligible computational overheads. We develop two methods to provide sub-sequence level attributions using this insight. First, we propose Single Pass-Progressive Inference (SP-PI), which computes attributions by taking the difference between consecutive intermediate predictions. Second, we exploit a connection with Kernel SHAP to develop Multi Pass-Progressive Inference (MP-PI). MP-PI uses intermediate predictions from multiple masked versions of the input to compute higher quality attributions. Our studies on a diverse set of models trained on text classification tasks show that SP-PI and MP-PI provide significantly better attributions compared to prior work.

Title:

      Block Transformer: Global-to-Local Language Modeling for Fast Inference

Authors: Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inference. We notice that these costs stem from applying self-attention on the global context, therefore we isolate the expensive bottlenecks of global modeling to lower layers and apply fast local modeling in upper layers. To mitigate the remaining costs in the lower layers, we aggregate input tokens into fixed size blocks and then apply self-attention at this coarse level. Context information is aggregated into a single embedding to enable upper layers to decode the next block of tokens, without global attention. Free of global attention bottlenecks, the upper layers can fully utilize the compute hardware to maximize inference throughput. By leveraging global and local modules, the Block Transformer architecture demonstrates 10-20x gains in inference throughput compared to vanilla transformers with equivalent perplexity. Our work introduces a new approach to optimize language model inference through novel application of global-to-local modeling. Code is available at this https URL.

Title:

      Multi-layer Learnable Attention Mask for Multimodal Tasks

Authors: Wayner Barrios, SouYoung Jin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the high computational demands of lengthy sequences. To address the challenges, we introduce the Learnable Attention Mask (LAM), strategically designed to globally regulate attention maps and prioritize critical tokens within the sequence. Leveraging the Self-Attention module in a BERT-like transformer network, our approach adeptly captures associations between tokens. The extension of the LAM to a multi-layer version accommodates the varied information aspects embedded at each layer of the Transformer network. Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT, demonstrates the efficacy of the LAM, exemplifying its ability to enhance model performance while mitigating redundant computations. This pioneering approach presents a significant advancement in enhancing the understanding of complex scenarios, such as in movie understanding.

Title:

      Spatial and social situation-aware transformer-based trajectory prediction of autonomous systems

Authors: Kathrin Donandt, Dirk Söffker
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Autonomous transportation systems such as road vehicles or vessels require the consideration of the static and dynamic environment to dislocate without collision. Anticipating the behavior of an agent in a given situation is required to adequately react to it in time. Developing deep learning-based models has become the dominant approach to motion prediction recently. The social environment is often considered through a CNN-LSTM-based sub-module processing a $\textit{social tensor}$ that includes information of the past trajectory of surrounding agents. For the proposed transformer-based trajectory prediction model, an alternative, computationally more efficient social tensor definition and processing is suggested. It considers the interdependencies between target and surrounding agents at each time step directly instead of relying on information of last hidden LSTM states of individually processed agents. A transformer-based sub-module, the Social Tensor Transformer, is integrated into the overall prediction model. It is responsible for enriching the target agent's dislocation features with social interaction information obtained from the social tensor. For the awareness of spatial limitations, dislocation features are defined in relation to the navigable area. This replaces additional, computationally expensive map processing sub-modules. An ablation study shows, that for longer prediction horizons, the deviation of the predicted trajectory from the ground truth is lower compared to a spatially and socially agnostic model. Even if the performance gain from a spatial-only to a spatial and social context-sensitive model is small in terms of common error measures, by visualizing the results it can be shown that the proposed model in fact is able to predict reactions to surrounding agents and explicitely allows an interpretable behavior.

Title:

      Short-term Inland Vessel Trajectory Prediction with Encoder-Decoder Models

Authors: Kathrin Donandt, Karim Böttger, Dirk Söffker
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Accurate vessel trajectory prediction is necessary for save and efficient navigation. Deep learning-based prediction models, esp. encoder-decoders, are rarely applied to inland navigation specifically. Approaches from the maritime domain cannot directly be transferred to river navigation due to specific driving behavior influencing factors. Different encoder-decoder architectures, including a transformer encoder-decoder, are compared herein for predicting the next positions of inland vessels, given not only spatio-temporal information from AIS, but also river specific features. The results show that the reformulation of the regression task as classification problem and the inclusion of river specific features yield the lowest displacement errors. The standard LSTM encoder-decoder outperforms the transformer encoder-decoder for the data considered, but is computationally more expensive. In this study for the first time a transformer-based encoder-decoder model is applied to the problem of predicting the ship trajectory. Here, a feature vector using the river-specific context of navigation input parameters is established. Future studies can built on the proposed models, investigate the improvement of the computationally more efficient transformer, e.g. through further hyper-parameter optimization, and use additional river-specific information in the context representation to further increase prediction accuracy.

Title:

      Improved context-sensitive transformer model for inland vessel trajectory prediction

Authors: Kathrin Donandt, Karim Böttger, Dirk Söffker
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Physics-related and model-based vessel trajectory prediction is highly accurate but requires specific knowledge of the vessel under consideration which is not always practical. Machine learning-based trajectory prediction models do not require expert knowledge, but rely on the implicit knowledge extracted from massive amounts of data. Several deep learning (DL) methods for vessel trajectory prediction have recently been suggested. The DL models developed typically only process information about the (dis)location of vessels defined with respect to a global reference system. In the context of inland navigation, this can be problematic, since without knowledge of the limited navigable space, irrealistic trajectories are likely to be determined. If spatial constraintes are introduced, e.g., by implementing an additional submodule to process map data, however, overall complexity increases. Instead of processing the vessel displacement information on the one hand and the spatial information on the other hand, the paper proposes the merging of both information. Here, fairway-related and navigation-related displacement information are used directly. In this way, the previously proposed context-sensitive Classification Transformer (CSCT) shows an improved spatial awareness. Additionally, the CSCT is adapted to assess the model uncertainty by enabling dropout during inference. This approach is trained on different inland waterways to analyze its generalizability. As the improved CSCT obtains lower prediction errors and enables to estimate the trustworthiness of each prediction, it is more suitable for safety-critical applications in inland navigation than previously developed models.

Title:

      Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies

Authors: Changye Li, Zhecheng Sheng, Trevor Cohen, Serguei Pakhomov
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract As artificial neural networks grow in complexity, understanding their inner workings becomes increasingly challenging, which is particularly important in healthcare applications. The intrinsic evaluation metrics of autoregressive neural language models (NLMs), perplexity (PPL), can reflect how "surprised" an NLM model is at novel input. PPL has been widely used to understand the behavior of NLMs. Previous findings show that changes in PPL when masking attention layers in pre-trained transformer-based NLMs reflect linguistic anomalies associated with Alzheimer's disease dementia. Building upon this, we explore a novel bidirectional attention head ablation method that exhibits properties attributed to the concepts of cognitive and brain reserve in human brain studies, which postulate that people with more neurons in the brain and more efficient processing are more resilient to neurodegeneration. Our results show that larger GPT-2 models require a disproportionately larger share of attention heads to be masked/ablated to display degradation of similar magnitude to masking in smaller models. These results suggest that the attention mechanism in transformer models may present an analogue to the notions of cognitive and brain reserve and could potentially be used to model certain aspects of the progression of neurodegenerative disorders and aging.

Title:

      Exact Conversion of In-Context Learning to Model Weights

Authors: Brian K Chen, Tianyang Hu, Hui Jin, Hwee Kuan Lee, Kenji Kawaguchi
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In-Context Learning (ICL) has been a powerful emergent property of large language models that has attracted increasing attention in recent years. In contrast to regular gradient-based learning, ICL is highly interpretable and does not require parameter updates. In this paper, we show that, for linearized transformer networks, ICL can be made explicit and permanent through the inclusion of bias terms. We mathematically demonstrate the equivalence between a model with ICL demonstration prompts and the same model with the additional bias terms. Our algorithm (ICLCA) allows for exact conversion in an inexpensive manner. Existing methods are not exact and require expensive parameter updates. We demonstrate the efficacy of our approach through experiments that show the exact incorporation of ICL tokens into a linear transformer. We further suggest how our method can be adapted to achieve cheap approximate conversion of ICL tokens, even in regular transformer networks that are not linearized. Our experiments on GPT-2 show that, even though the conversion is only approximate, the model still gains valuable context from the included bias terms.

Title:

      Towards Interactive Autonomous Vehicle Testing: Vehicle-Under-Test-Centered Traffic Simulation

Authors: Yiru Liu, Xiaocong Zhao, Jian Sun
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The simulation-based testing is essential for safely implementing autonomous vehicles (AVs) on roads, necessitating simulated traffic environments that dynamically interact with the Vehicle Under Test (VUT). This study introduces a VUT-Centered environmental Dynamics Inference (VCDI) model for realistic, interactive, and diverse background traffic simulation. VCDI is built on a Transformer-based trajectory inference model to generate trajectories for background objects. Serving the purpose of AV testing, VCDI additionally considers VUT-centered interactivity and scenario diversity using a conditional inference framework. First, the VUT future motion is taken as an augmented model input to bridge the interaction between VUT and background objects. Second, to enrich the scenario diversity, a Bayesian-network-based cost function module is designed. The module, learned in a distributional form, captures the uncertainty of the VUT's strategy, triggering various scenario evolution. Experimental results validate VCDI's trajectory-level simulation precision which outperforms the state-of-the-art trajectory prediction work. The flexibility of the distributional cost function allows VCDI to provide diverse-yet-realistic scenarios for AV testing. We demonstrate such capability by modifying the anticipation to VUT's cost-based strategy and thus achieve multiple testing scenarios with explainable background traffic evolution.

Title:

      P2PFormer: A Primitive-to-polygon Method for Regular Building Contour Extraction from Remote Sensing Images

Authors: Tao Zhang, Shiqing Wei, Yikang Zhou, Muying Luo, Wenling You, Shunping Ji
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Extracting building contours from remote sensing imagery is a significant challenge due to buildings' complex and diverse shapes, occlusions, and noise. Existing methods often struggle with irregular contours, rounded corners, and redundancy points, necessitating extensive post-processing to produce regular polygonal building contours. To address these challenges, we introduce a novel, streamlined pipeline that generates regular building contours without post-processing. Our approach begins with the segmentation of generic geometric primitives (which can include vertices, lines, and corners), followed by the prediction of their sequence. This allows for the direct construction of regular building contours by sequentially connecting the segmented primitives. Building on this pipeline, we developed P2PFormer, which utilizes a transformer-based architecture to segment geometric primitives and predict their order. To enhance the segmentation of primitives, we introduce a unique representation called group queries. This representation comprises a set of queries and a singular query position, which improve the focus on multiple midpoints of primitives and their efficient linkage. Furthermore, we propose an innovative implicit update strategy for the query position embedding aimed at sharpening the focus of queries on the correct positions and, consequently, enhancing the quality of primitive segmentation. Our experiments demonstrate that P2PFormer achieves new state-of-the-art performance on the WHU, CrowdAI, and WHU-Mix datasets, surpassing the previous SOTA PolyWorld by a margin of 2.7 AP and 6.5 AP75 on the largest CrowdAI dataset. We intend to make the code and trained weights publicly available to promote their use and facilitate further research.

Title:

      Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment

Authors: Wenliang Zhong, Wenyi Wu, Qi Li, Rob Barton, Boxin Du, Shioulin Sam, Karim Bouyarmane, Ismail Tutar, Junzhou Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Multimodal Large Language Models (MLLMs) have achieved SOTA performance in various visual language tasks by fusing the visual representations with LLMs leveraging some visual adapters. In this paper, we first establish that adapters using query-based Transformers such as Q-former is a simplified Multi-instance Learning method without considering instance heterogeneity/correlation. We then propose a general component termed Multi-instance Visual Prompt Generator (MIVPG) to incorporate enriched visual representations into LLMs by taking advantage of instance correlation between images or patches for the same sample. Quantatitive evaluation on three public vision-language (VL) datasets from different scenarios shows that the proposed MIVPG improves Q-former in main VL tasks.

Title:

      Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification

Authors: Gexin Huang, Chenfei Wu, Mingjie Li, Xiaojun Chang, Ling Chen, Ying Sun, Shen Zhao, Xiaodan Liang, Liang Lin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Predicting genetic mutations from whole slide images is indispensable for cancer diagnosis. However, existing work training multiple binary classification models faces two challenges: (a) Training multiple binary classifiers is inefficient and would inevitably lead to a class imbalance problem. (b) The biological relationships among genes are overlooked, which limits the prediction performance. To tackle these challenges, we innovatively design a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances. BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules: (a) A gene graph whose node features are the genes' linguistic descriptions and the cancer phenotype, with edges modeled by genes' pathway associations and mutation consistencies. (b) A knowledge association module that fuses linguistic and biomedical knowledge into gene priors by transformer-based graph representation learning, capturing the intrinsic relationships between different genes' mutations. BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules: (a) A modality fusion module that firstly fuses the gene priors with critical regions in WSIs and obtains gene-wise mutation logits. (b) A comparative multi-label loss that emphasizes the inherent comparisons among mutation status to enhance the discrimination capabilities. Sufficient experiments on The Cancer Genome Atlas benchmark demonstrate that BPGT outperforms the state-of-the-art.

Title:

      Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction

Authors: Pengjie Wang, Kaile Zhang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. However, due to the great antiquity of the era, a large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in the field of paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$^3$), to decipher these enigmatic characters through radical reconstruction. We deconstruct OBI into foundational strokes and radicals, then employ a Transformer model to reconstruct them into their modern (conterpart)\textcolor{blue}{counterparts}, offering a groundbreaking solution to ancient script analysis. To further this endeavor, a new Ancient Chinese Character Puzzles (ACCP) dataset was developed, comprising an extensive collection of character images from seven key historical stages, annotated with detailed radical sequences. The experiments have showcased considerable promising insights, underscoring the potential and effectiveness of our approach in deciphering the intricacies of ancient Chinese scripts. Through this novel dataset and methodology, we aim to bridge the gap between traditional philology and modern document analysis techniques, offering new insights into the rich history of Chinese linguistic heritage.

Title:

      Population Transformer: Learning Population-level Representations of Intracranial Activity

Authors: Geeling Chau, Christopher Wang, Sabera Talukder, Vighnesh Subramaniam, Saraswati Soedarmadji, Yisong Yue, Boris Katz, Andrei Barbu
Subjects: Subjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We present a self-supervised framework that learns population-level codes for intracranial neural recordings at scale, unlocking the benefits of representation learning for a key neuroscience recording modality. The Population Transformer (PopT) lowers the amount of data required for decoding experiments, while increasing accuracy, even on never-before-seen subjects and tasks. We address two key challenges in developing PopT: sparse electrode distribution and varying electrode location across patients. PopT stacks on top of pretrained representations and enhances downstream tasks by enabling learned aggregation of multiple spatially-sparse data channels. Beyond decoding, we interpret the pretrained PopT and fine-tuned models to show how it can be used to provide neuroscience insights learned from massive amounts of data. We release a pretrained PopT to enable off-the-shelf improvements in multi-channel intracranial data decoding and interpretability, and code is available at this https URL.

Title:

      How Truncating Weights Improves Reasoning in Language Models

Authors: Lei Chen, Joan Bruna, Alberto Bietti
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In addition to the ability to generate fluent text in various languages, large language models have been successful at tasks that involve basic forms of logical "reasoning" over their context. Recent work found that selectively removing certain components from weight matrices in pre-trained models can improve such reasoning capabilities. We investigate this phenomenon further by carefully studying how certain global associations tend to be stored in specific weight components or Transformer blocks, in particular feed-forward layers. Such associations may hurt predictions in reasoning tasks, and removing the corresponding components may then improve performance. We analyze how this arises during training, both empirically and theoretically, on a two-layer Transformer trained on a basic reasoning task with noise, a toy associative memory model, and on the Pythia family of pre-trained models tested on simple reasoning tasks.

Title:

      Local to Global: Learning Dynamics and Effect of Initialization for Transformers

Authors: Ashok Vardhan Makkuva, Marco Bondaschi, Chanakya Ekbote, Adway Girish, Alliot Nagle, Hyeji Kim, Michael Gastpar
Subjects: Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In recent years, transformer-based models have revolutionized deep learning, particularly in sequence modeling. To better understand this phenomenon, there is a growing interest in using Markov input processes to study transformers. However, our current understanding in this regard remains limited with many fundamental questions about how transformers learn Markov chains still unanswered. In this paper, we address this by focusing on first-order Markov chains and single-layer transformers, providing a comprehensive characterization of the learning dynamics in this context. Specifically, we prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima, contingent on the initialization and the Markovian data properties, and we characterize the precise conditions under which this occurs. To the best of our knowledge, this is the first result of its kind highlighting the role of initialization. We further demonstrate that our theoretical findings are corroborated by empirical evidence. Based on these insights, we provide guidelines for the initialization of transformer parameters and demonstrate their effectiveness. Finally, we outline several open problems in this arena. Code is available at: \url{https://anonymous.4open.science/r/Local-to-Global-C70B/}.

Title:

      Enhancing 3D Lane Detection and Topology Reasoning with 2D Lane Priors

Authors: Han Li, Zehao Huang, Zitian Wang, Wenge Rong, Naiyan Wang, Si Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract 3D lane detection and topology reasoning are essential tasks in autonomous driving scenarios, requiring not only detecting the accurate 3D coordinates on lane lines, but also reasoning the relationship between lanes and traffic elements. Current vision-based methods, whether explicitly constructing BEV features or not, all establish the lane anchors/queries in 3D space while ignoring the 2D lane priors. In this study, we propose Topo2D, a novel framework based on Transformer, leveraging 2D lane instances to initialize 3D queries and 3D positional embeddings. Furthermore, we explicitly incorporate 2D lane features into the recognition of topology relationships among lane centerlines and between lane centerlines and traffic elements. Topo2D achieves 44.5% OLS on multi-view topology reasoning benchmark OpenLane-V2 and 62.6% F-Socre on single-view 3D lane detection benchmark OpenLane, exceeding the performance of existing state-of-the-art methods.

Title:

      Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models

Authors: Jerry Yao-Chieh Hu, Maojiang Su, En-Jui Kuo, Zhao Song, Han Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We study the computational limits of Low-Rank Adaptation (LoRA) update for finetuning transformer-based models using fine-grained complexity theory. Our key observation is that the existence of low-rank decompositions within the gradient computation of LoRA adaptation leads to possible algorithmic speedup. This allows us to (i) identify a phase transition behavior and (ii) prove the existence of nearly linear algorithms by controlling the LoRA update computation term by term, assuming the Strong Exponential Time Hypothesis (SETH). For the former, we identify a sharp transition in the efficiency of all possible rank-$r$ LoRA update algorithms for transformers, based on specific norms resulting from the multiplications of the input sequence $\mathbf{X}$, pretrained weights $\mathbf{W^\star}$, and adapter matrices $\alpha \mathbf{B} \mathbf{A} / r$. Specifically, we derive a shared upper bound threshold for such norms and show that efficient (sub-quadratic) approximation algorithms of LoRA exist only below this threshold. For the latter, we prove the existence of nearly linear approximation algorithms for LoRA adaptation by utilizing the hierarchical low-rank structures of LoRA gradients and approximating the gradients with a series of chained low-rank approximations. To showcase our theory, we consider two practical scenarios: partial (e.g., only $\mathbf{W}_V$ and $\mathbf{W}_Q$) and full adaptations (e.g., $\mathbf{W}_Q$, $\mathbf{W}_V$, and $\mathbf{W}_K$) of weights in attention heads.

Title:

      A Frame-based Attention Interpretation Method for Relevant Acoustic Feature Extraction in Long Speech Depression Detection

Authors: Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia
Subjects: Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Speech-based depression detection tools could help early screening of depression. Here, we address two issues that may hinder the clinical practicality of such tools: segment-level labelling noise and a lack of model interpretability. We propose a speech-level Audio Spectrogram Transformer to avoid segment-level labelling. We observe that the proposed model significantly outperforms a segment-level model, providing evidence for the presence of segment-level labelling noise in audio modality and the advantage of longer-duration speech analysis for depression detection. We introduce a frame-based attention interpretation method to extract acoustic features from prediction-relevant waveform signals for interpretation by clinicians. Through interpretation, we observe that the proposed model identifies reduced loudness and F0 as relevant signals of depression, which aligns with the speech characteristics of depressed patients documented in clinical studies.

Title:

      Aligning Transformers with Weisfeiler-Leman

Authors: Luis Müller, Christopher Morris
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Graph neural network architectures aligned with the $k$-dimensional Weisfeiler--Leman ($k$-WL) hierarchy offer theoretically well-understood expressive power. However, these architectures often fail to deliver state-of-the-art predictive performance on real-world graphs, limiting their practical utility. While recent works aligning graph transformer architectures with the $k$-WL hierarchy have shown promising empirical results, employing transformers for higher orders of $k$ remains challenging due to a prohibitive runtime and memory complexity of self-attention as well as impractical architectural assumptions, such as an infeasible number of attention heads. Here, we advance the alignment of transformers with the $k$-WL hierarchy, showing stronger expressivity results for each $k$, making them more feasible in practice. In addition, we develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE. We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the state-of-the-art and demonstrating strong downstream performance when fine-tuning them on small-scale molecular datasets. Our code is available at this https URL.

Title:

      Situation Monitor: Diversity-Driven Zero-Shot Out-of-Distribution Detection using Budding Ensemble Architecture for Object Detection

Authors: Qutub Syed, Michael Paulitsch, Korbinian Hagn, Neslihan Kose Cihangir, Kay-Ulrich Scholl, Fabian Oboril, Gereon Hinz, Alois Knoll
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We introduce Situation Monitor, a novel zero-shot Out-of-Distribution (OOD) detection approach for transformer-based object detection models to enhance reliability in safety-critical machine learning applications such as autonomous driving. The Situation Monitor utilizes the Diversity-based Budding Ensemble Architecture (DBEA) and increases the OOD performance by integrating a diversity loss into the training process on top of the budding ensemble architecture, detecting Far-OOD samples and minimizing false positives on Near-OOD samples. Moreover, utilizing the resulting DBEA increases the model's OOD performance and improves the calibration of confidence scores, particularly concerning the intersection over union of the detected objects. The DBEA model achieves these advancements with a 14% reduction in trainable parameters compared to the vanilla model. This signifies a substantial improvement in efficiency without compromising the model's ability to detect OOD instances and calibrate the confidence scores accurately.

Title:

      Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need

Authors: Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Recent Continual Learning (CL) methods have combined pretrained Transformers with prompt tuning, a parameter-efficient fine-tuning (PEFT) technique. We argue that the choice of prompt tuning in prior works was an undefended and unablated decision, which has been uncritically adopted by subsequent research, but warrants further research to understand its implications. In this paper, we conduct this research and find that the choice of prompt tuning as a PEFT method hurts the overall performance of the CL system. To illustrate this, we replace prompt tuning with LoRA in two state-of-the-art continual learning methods: Learning to Prompt and S-Prompts. These variants consistently achieve higher accuracy across a wide range of domain-incremental and class-incremental benchmarks, while being competitive in inference speed. Our work highlights a crucial argument: unexamined choices can hinder progress in the field, and rigorous ablations, such as the PEFT method, are required to drive meaningful adoption of CL techniques in real-world applications.

Title:

      Global Clipper: Enhancing Safety and Reliability of Transformer-based Object Detection Models

Authors: Qutub Syed Sha, Michael Paulitsch, Karthik Pattabiraman, Korbinian Hagn, Fabian Oboril, Cornelius Buerkle, Kay-Ulrich Scholl, Gereon Hinz, Alois Knoll
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract As transformer-based object detection models progress, their impact in critical sectors like autonomous vehicles and aviation is expected to grow. Soft errors causing bit flips during inference have significantly impacted DNN performance, altering predictions. Traditional range restriction solutions for CNNs fall short for transformers. This study introduces the Global Clipper and Global Hybrid Clipper, effective mitigation strategies specifically designed for transformer-based models. It significantly enhances their resilience to soft errors and reduces faulty inferences to ~ 0%. We also detail extensive testing across over 64 scenarios involving two transformer models (DINO-DETR and Lite-DETR) and two CNN models (YOLOv3 and SSD) using three datasets, totalling approximately 3.3 million inferences, to assess model robustness comprehensively. Moreover, the paper explores unique aspects of attention blocks in transformers and their operational differences from CNNs.

Title:

      Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

Authors: Amelia Kawasaki, Andrew Davis, Houssam Abbas
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The widespread adoption of Large Language Models (LLMs), exemplified by OpenAI's ChatGPT, brings to the forefront the imperative to defend against adversarial threats on these models. These attacks, which manipulate an LLM's output by introducing malicious inputs, undermine the model's integrity and the trust users place in its outputs. In response to this challenge, our paper presents an innovative defensive strategy, given white box access to an LLM, that harnesses residual activation analysis between transformer layers of the LLM. We apply an established methodology for analyzing distinctive activation patterns in the residual streams for a novel result of attack prompt classification. We curate multiple datasets to demonstrate how this method of classification has high accuracy across multiple types of attack scenarios, including our newly-created attack dataset. Furthermore, we enhance the model's resilience by integrating safety fine-tuning techniques for LLMs in order to measure its effect on our capability to detect attacks. The results underscore the effectiveness of our approach in enhancing the detection and mitigation of adversarial inputs, advancing the security framework within which LLMs operate.

Title:

      Learning Visual Prompts for Guiding the Attention of Vision Transformers

Authors: Razieh Rezaei, Masoud Jalili Sabet, Jindong Gu, Daniel Rueckert, Philip Torr, Ashkan Khakzar
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Visual prompting infuses visual information into the input image to adapt models toward specific predictions and tasks. Recently, manually crafted markers such as red circles are shown to guide the model to attend to a target region on the image. However, these markers only work on models trained with data containing those markers. Moreover, finding these prompts requires guesswork or prior knowledge of the domain on which the model is trained. This work circumvents manual design constraints by proposing to learn the visual prompts for guiding the attention of vision transformers. The learned visual prompt, added to any input image would redirect the attention of the pre-trained vision transformer to its spatial location on the image. Specifically, the prompt is learned in a self-supervised manner without requiring annotations and without fine-tuning the vision transformer. Our experiments demonstrate the effectiveness of the proposed optimization-based visual prompting strategy across various pre-trained vision encoders.

Title:

      The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches

Authors: Bhashithe Abeysinghe, Ruhan Circi
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Chatbots have been an interesting application of natural language generation since its inception. With novel transformer based Generative AI methods, building chatbots have become trivial. Chatbots which are targeted at specific domains such as medicine, psychology, and general information retrieval are implemented rapidly. This, however, should not distract from the need to evaluate the chatbot responses. Especially because the natural language generation community does not entirely agree upon how to effectively evaluate such applications. With this work we discuss the issue further with the increasingly popular LLM based evaluations and how they correlate with human evaluations. Additionally, we introduce a comprehensive factored evaluation mechanism that can be utilized in conjunction with both human and LLM-based evaluations. We present the results of an experimental evaluation conducted using this scheme in one of our chatbot implementations, and subsequently compare automated, traditional human evaluation, factored human evaluation, and factored LLM evaluation. Results show that factor based evaluation produces better insights on which aspects need to be improved in LLM applications and further strengthens the argument to use human evaluation in critical spaces where main functionality is not direct retrieval.

Title:

      Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

Authors: Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs. However, Audio Spectrogram Transformers (ASTs) exhibit quadratic scaling due to self-attention. The removal of this quadratic self-attention cost presents an appealing direction. Recently, state space models (SSMs), such as Mamba, have demonstrated potential in language and vision tasks in this regard. In this study, we explore whether reliance on self-attention is necessary for audio classification tasks. By introducing Audio Mamba (AuM), the first self-attention-free, purely SSM-based model for audio classification, we aim to address this question. We evaluate AuM on various audio datasets - comprising six different benchmarks - where it achieves comparable or better performance compared to well-established AST model.

Title:

      Learning Long Range Dependencies on Graphs via Random Walks

Authors: Dexiong Chen, Till Hendrik Schulz, Karsten Borgwardt
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Message-passing graph neural networks (GNNs), while excelling at capturing local relationships, often struggle with long-range dependencies on graphs. Conversely, graph transformers (GTs) enable information exchange between all nodes but oversimplify the graph structure by treating them as a set of fixed-length vectors. This work proposes a novel architecture, NeuralWalker, that overcomes the limitations of both methods by combining random walks with message passing. NeuralWalker achieves this by treating random walks as sequences, allowing for the application of recent advances in sequence models in order to capture long-range dependencies within these walks. Based on this concept, we propose a framework that offers (1) more expressive graph representations through random walk sequences, (2) the ability to utilize any sequence model for capturing long-range dependencies, and (3) the flexibility by integrating various GNN and GT architectures. Our experimental evaluations demonstrate that NeuralWalker achieves significant performance improvements on 19 graph and node benchmark datasets, notably outperforming existing methods by up to 13% on the PascalVoc-SP and COCO-SP datasets. Code is available at this https URL.

Title:

      CSI-GPT: Integrating Generative Pre-Trained Transformer with Federated-Tuning to Acquire Downlink Massive MIMO Channels

Authors: Ye Zeng, Li Qiao, Zhen Gao, Tong Qin, Zhonghuai Wu, Sheng Chen, Mohsen Guizani
Subjects: Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In massive multiple-input multiple-output (MIMO) systems, how to reliably acquire downlink channel state information (CSI) with low overhead is challenging. In this work, by integrating the generative pre-trained Transformer (GPT) with federated-tuning, we propose a CSI-GPT approach to realize efficient downlink CSI acquisition. Specifically, we first propose a Swin Transformer-based channel acquisition network (SWTCAN) to acquire downlink CSI, where pilot signals, downlink channel estimation, and uplink CSI feedback are jointly designed. Furthermore, to solve the problem of insufficient training data, we propose a variational auto-encoder-based channel sample generator (VAE-CSG), which can generate sufficient CSI samples based on a limited number of high-quality CSI data obtained from the current cell. The CSI dataset generated from VAE-CSG will be used for pre-training SWTCAN. To fine-tune the pre-trained SWTCAN for improved performance, we propose an online federated-tuning method, where only a small amount of SWTCAN parameters are unfrozen and updated using over-the-air computation, avoiding the high communication overhead caused by aggregating the complete CSI samples from user equipment (UEs) to the BS for centralized fine-tuning. Simulation results verify the advantages of the proposed SWTCAN and the communication efficiency of the proposed federated-tuning method.

Title:

      Pre-trained Large Language Models Use Fourier Features to Compute Addition

Authors: Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.

Title:

      LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

Authors: Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (this https URL).

Title:

      SpikeZIP-TF: Conversion is All You Need for Transformer-based SNN

Authors: Kang You, Zekai Xu, Chen Nie, Zhijie Deng, Qinghai Guo, Xiang Wang, Zhezhi He
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Spiking neural network (SNN) has attracted great attention due to its characteristic of high efficiency and accuracy. Currently, the ANN-to-SNN conversion methods can obtain ANN on-par accuracy SNN with ultra-low latency (8 time-steps) in CNN structure on computer vision (CV) tasks. However, as Transformer-based networks have achieved prevailing precision on both CV and natural language processing (NLP), the Transformer-based SNNs are still encounting the lower accuracy w.r.t the ANN counterparts. In this work, we introduce a novel ANN-to-SNN conversion method called SpikeZIP-TF, where ANN and SNN are exactly equivalent, thus incurring no accuracy degradation. SpikeZIP-TF achieves 83.82% accuracy on CV dataset (ImageNet) and 93.79% accuracy on NLP dataset (SST-2), which are higher than SOTA Transformer-based SNNs. The code is available in GitHub: this https URL

Title:

      Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review

Authors: Sonia Bbouzidi, Ghazala Hcini, Imen Jdey, Fadoua Drira
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Our review explores the comparative analysis between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the domain of image classification, with a particular focus on clothing classification within the e-commerce sector. Utilizing the Fashion MNIST dataset, we delve into the unique attributes of CNNs and ViTs. While CNNs have long been the cornerstone of image classification, ViTs introduce an innovative self-attention mechanism enabling nuanced weighting of different input data components. Historically, transformers have primarily been associated with Natural Language Processing (NLP) tasks. Through a comprehensive examination of existing literature, our aim is to unveil the distinctions between ViTs and CNNs in the context of image classification. Our analysis meticulously scrutinizes state-of-the-art methodologies employing both architectures, striving to identify the factors influencing their performance. These factors encompass dataset characteristics, image dimensions, the number of target classes, hardware infrastructure, and the specific architectures along with their respective top results. Our key goal is to determine the most appropriate architecture between ViT and CNN for classifying images in the Fashion MNIST dataset within the e-commerce industry, while taking into account specific conditions and needs. We highlight the importance of combining these two architectures with different forms to enhance overall performance. By uniting these architectures, we can take advantage of their unique strengths, which may lead to more precise and reliable models for e-commerce applications. CNNs are skilled at recognizing local patterns, while ViTs are effective at grasping overall context, making their combination a promising strategy for boosting image classification performance.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

Jun 06 '24 02:06 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Thursday, 6 June 2024 (showing 442 of 442 entries )

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

Title:

Title:

Title:

Title:

Title:

Title:

Keyword: transformer

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Keyword: scene understanding

Keyword: visual reasoning

arxiv-daily
arxiv-daily copied to clipboard