arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Mon, 26 Feb 24

Open DongZhouGu opened this issue 1 year ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

EMIFF: Enhanced Multi-scale Image Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

  • Authors: Zhe Wang, Siqi Fan, Xiaoliang Huo, Tongda Xu, Yan Wang, Jingjing Liu, Yilun Chen, Ya-Qin Zhang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2402.15272
  • Pdf link: https://arxiv.org/pdf/2402.15272
  • Abstract In autonomous driving, cooperative perception makes use of multi-view cameras from both vehicles and infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Currently, two major challenges persist in vehicle-infrastructure cooperative 3D (VIC3D) object detection: $1)$ inherent pose errors when fusing multi-view images, caused by time asynchrony across cameras; $2)$ information loss in transmission process resulted from limited communication bandwidth. To address these issues, we propose a novel camera-based 3D detection framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF). To fully exploit holistic perspectives from both vehicles and infrastructure, we propose Multi-scale Cross Attention (MCA) and Camera-aware Channel Masking (CCM) modules to enhance infrastructure and vehicle features at scale, spatial, and channel levels to correct the pose error introduced by camera asynchrony. We also introduce a Feature Compression (FC) module with channel and spatial compression blocks for transmission efficiency. Experiments show that EMIFF achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous early-fusion and late-fusion methods with comparable transmission costs.

A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends

  • Authors: Abolfazl Younesi, Mohsen Ansari, MohammadAmin Fazli, Alireza Ejlali, Muhammad Shafique, Jörg Henkel
  • Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
  • Arxiv link: https://arxiv.org/abs/2402.15490
  • Pdf link: https://arxiv.org/pdf/2402.15490
  • Abstract In today's digital age, Convolutional Neural Networks (CNNs), a subset of Deep Learning (DL), are widely used for various computer vision tasks such as image classification, object detection, and image segmentation. There are numerous types of CNNs designed to meet specific needs and requirements, including 1D, 2D, and 3D CNNs, as well as dilated, grouped, attention, depthwise convolutions, and NAS, among others. Each type of CNN has its unique structure and characteristics, making it suitable for specific tasks. It's crucial to gain a thorough understanding and perform a comparative analysis of these different CNN types to understand their strengths and weaknesses. Furthermore, studying the performance, limitations, and practical applications of each type of CNN can aid in the development of new and improved architectures in the future. We also dive into the platforms and frameworks that researchers utilize for their research or development from various perspectives. Additionally, we explore the main research fields of CNN like 6D vision, generative models, and meta-learning. This survey paper provides a comprehensive examination and comparison of various CNN architectures, highlighting their architectural differences and emphasizing their respective advantages, disadvantages, applications, challenges, and future trends.

Keyword: transformer

MSynFD: Multi-hop Syntax aware Fake News Detection

  • Authors: Liang Xiao, Qi Zhang, Chongyang Shi, Shoujin Wang, Usman Naseem, Liang Hu
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
  • Arxiv link: https://arxiv.org/abs/2402.14834
  • Pdf link: https://arxiv.org/pdf/2402.14834
  • Abstract The proliferation of social media platforms has fueled the rapid dissemination of fake news, posing threats to our real-life society. Existing methods use multimodal data or contextual information to enhance the detection of fake news by analyzing news content and/or its social context. However, these methods often overlook essential textual news content (articles) and heavily rely on sequential modeling and global attention to extract semantic information. These existing methods fail to handle the complex, subtle twists in news articles, such as syntax-semantics mismatches and prior biases, leading to lower performance and potential failure when modalities or social context are missing. To bridge these significant gaps, we propose a novel multi-hop syntax aware fake news detection (MSynFD) method, which incorporates complementary syntax information to deal with subtle twists in fake news. Specifically, we introduce a syntactical dependency graph and design a multi-hop subgraph aggregation mechanism to capture multi-hop syntax. It extends the effect of word perception, leading to effective noise filtering and adjacent relation enhancement. Subsequently, a sequential relative position-aware Transformer is designed to capture the sequential information, together with an elaborate keyword debiasing module to mitigate the prior bias. Extensive experimental results on two public benchmark datasets verify the effectiveness and superior performance of our proposed MSynFD over state-of-the-art detection models.

Asynchronous and Segmented Bidirectional Encoding for NMT

  • Authors: Jingpu Yang, Zehua Han, Mengyu Xiang, Helin Wang, Yuxiao Huang, Miao Fang
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.14849
  • Pdf link: https://arxiv.org/pdf/2402.14849
  • Abstract With the rapid advancement of Neural Machine Translation (NMT), enhancing translation efficiency and quality has become a focal point of research. Despite the commendable performance of general models such as the Transformer in various aspects, they still fall short in processing long sentences and fully leveraging bidirectional contextual information. This paper introduces an improved model based on the Transformer, implementing an asynchronous and segmented bidirectional decoding strategy aimed at elevating translation efficiency and accuracy. Compared to traditional unidirectional translations from left-to-right or right-to-left, our method demonstrates heightened efficiency and improved translation quality, particularly in handling long sentences. Experimental results on the IWSLT2017 dataset confirm the effectiveness of our approach in accelerating translation and increasing accuracy, especially surpassing traditional unidirectional strategies in long sentence translation. Furthermore, this study analyzes the impact of sentence length on decoding outcomes and explores the model's performance in various scenarios. The findings of this research not only provide an effective encoding strategy for the NMT field but also pave new avenues and directions for future studies.

Technical Report on the Checkfor.ai AI-Generated Text Classifier

  • Authors: Bradley Emi, Max Spero
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2402.14873
  • Pdf link: https://arxiv.org/pdf/2402.14873
  • Abstract We present the Checkfor.ai text classifier, a transformer-based neural network trained to distinguish text written by large language models from text written by humans. Checkfor.ai outperforms zero-shot methods such as DetectGPT as well as leading commercial AI detection tools with over 9 times lower error rates on a comprehensive benchmark comprised of ten text domains (student writing, creative writing, scientific writing, books, encyclopedias, news, email, scientific papers, short-form Q&A) and 8 open- and closed-source large language models. We propose a training algorithm, hard negative mining with synthetic mirrors, that enables our classifier to achieve orders of magnitude lower false positive rates on high-data domains such as reviews. Finally, we show that Checkfor.ai is not biased against nonnative English speakers and generalizes to domains and models unseen during training.

Path Planning based on 2D Object Bounding-box

  • Authors: Yanliang Huang, Liguo Zhou, Chang Liu, Alois Knoll
  • Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2402.14933
  • Pdf link: https://arxiv.org/pdf/2402.14933
  • Abstract The implementation of Autonomous Driving (AD) technologies within urban environments presents significant challenges. These challenges necessitate the development of advanced perception systems and motion planning algorithms capable of managing situations of considerable complexity. Although the end-to-end AD method utilizing LiDAR sensors has achieved significant success in this scenario, we argue that its drawbacks may hinder its practical application. Instead, we propose the vision-centric AD as a promising alternative offering a streamlined model without compromising performance. In this study, we present a path planning method that utilizes 2D bounding boxes of objects, developed through imitation learning in urban driving scenarios. This is achieved by integrating high-definition (HD) map data with images captured by surrounding cameras. Subsequent perception tasks involve bounding-box detection and tracking, while the planning phase employs both local embeddings via Graph Neural Network (GNN) and global embeddings via Transformer for temporal-spatial feature aggregation, ultimately producing optimal path planning information. We evaluated our model on the nuPlan planning task and observed that it performs competitively in comparison to existing vision-centric methods.

Enhancing Power Quality Event Classification with AI Transformer Models

  • Authors: Ahmad Mohammad Saber, Amr Youssef, Davor Svetinovic, Hatem Zeineldin, Deepa Kundur, Ehab El-Saadany
  • Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2402.14949
  • Pdf link: https://arxiv.org/pdf/2402.14949
  • Abstract Recently, there has been a growing interest in utilizing machine learning for accurate classification of power quality events (PQEs). However, most of these studies are performed assuming an ideal situation, while in reality, we can have measurement noise, DC offset, and variations in the voltage signal's amplitude and frequency. Building on the prior PQE classification works using deep learning, this paper proposes a deep-learning framework that leverages attention-enabled Transformers as a tool to accurately classify PQEs under the aforementioned considerations. The proposed framework can operate directly on the voltage signals with no need for a separate feature extraction or calculation phase. Our results show that the proposed framework outperforms recently proposed learning-based techniques. It can accurately classify PQEs under the aforementioned conditions with an accuracy varying between 99.81%$-$91.43% depending on the signal-to-noise ratio, DC offsets, and variations in the signal amplitude and frequency.

MultiLS: A Multi-task Lexical Simplification Framework

  • Authors: Kai North, Tharindu Ranasinghe, Matthew Shardlow, Marcos Zampieri
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2402.14972
  • Pdf link: https://arxiv.org/pdf/2402.14972
  • Abstract Lexical Simplification (LS) automatically replaces difficult to read words for easier alternatives while preserving a sentence's original meaning. LS is a precursor to Text Simplification with the aim of improving text accessibility to various target demographics, including children, second language learners, individuals with reading disabilities or low literacy. Several datasets exist for LS. These LS datasets specialize on one or two sub-tasks within the LS pipeline. However, as of this moment, no single LS dataset has been developed that covers all LS sub-tasks. We present MultiLS, the first LS framework that allows for the creation of a multi-task LS dataset. We also present MultiLS-PT, the first dataset to be created using the MultiLS framework. We demonstrate the potential of MultiLS-PT by carrying out all LS sub-tasks of (1). lexical complexity prediction (LCP), (2). substitute generation, and (3). substitute ranking for Portuguese. Model performances are reported, ranging from transformer-based models to more recent large language models (LLMs).

Unsupervised Domain Adaptation within Deep Foundation Latent Spaces

  • Authors: Dmitry Kangin, Plamen Angelov
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.14976
  • Pdf link: https://arxiv.org/pdf/2402.14976
  • Abstract The vision transformer-based foundation models, such as ViT or Dino-V2, are aimed at solving problems with little or no finetuning of features. Using a setting of prototypical networks, we analyse to what extent such foundation models can solve unsupervised domain adaptation without finetuning over the source or target domain. Through quantitative analysis, as well as qualitative interpretations of decision making, we demonstrate that the suggested method can improve upon existing baselines, as well as showcase the limitations of such approach yet to be solved.

Descripción automática de secciones delgadas de rocas: una aplicación Web

  • Authors: Stalyn Paucar, Christian Mejía-Escobar y Víctor Collaguazo
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.15039
  • Pdf link: https://arxiv.org/pdf/2402.15039
  • Abstract The identification and characterization of various rock types is one of the fundamental activities for geology and related areas such as mining, petroleum, environment, industry and construction. Traditionally, a human specialist is responsible for analyzing and explaining details about the type, composition, texture, shape and other properties using rock samples collected in-situ or prepared in a laboratory. The results become subjective based on experience, in addition to consuming a large investment of time and effort. The present proposal uses artificial intelligence techniques combining computer vision and natural language processing to generate a textual and verbal description from a thin section image of rock. We build a dataset of images and their respective textual descriptions for the training of a model that associates the relevant features of the image extracted by EfficientNetB7 with the textual description generated by a Transformer network, reaching an accuracy value of 0.892 and a BLEU value of 0.71. This model can be a useful resource for research, professional and academic work, so it has been deployed through a Web application for public use.

Fiducial Focus Augmentation for Facial Landmark Detection

  • Authors: Purbayan Kar, Vishal Chudasama, Naoyuki Onoe, Pankaj Wasnik, Vineeth Balasubramanian
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.15044
  • Pdf link: https://arxiv.org/pdf/2402.15044
  • Abstract Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model's inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model's understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.

Multimodal Transformer With a Low-Computational-Cost Guarantee

  • Authors: Sungjin Park, Edward Choi
  • Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2402.15096
  • Pdf link: https://arxiv.org/pdf/2402.15096
  • Abstract Transformer-based models have significantly improved performance across a range of multimodal understanding tasks, such as visual question answering and action recognition. However, multimodal Transformers significantly suffer from a quadratic complexity of the multi-head attention with the input sequence length, especially as the number of modalities increases. To address this, we introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss. Specifically, by assigning different multimodal attention patterns to each attention head, LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants. Experimental results on two multimodal datasets, namely Audioset and MedVidCL demonstrate that LoCoMT not only reduces GFLOPs but also matches or even outperforms established models.

Spatially-Aware Transformer Memory for Embodied Agents

  • Authors: Junmo Cho, Jaesik Yoon, Sungjin Ahn
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2402.15160
  • Pdf link: https://arxiv.org/pdf/2402.15160
  • Abstract Episodic memory plays a crucial role in various cognitive processes, such as the ability to mentally recall past events. While cognitive science emphasizes the significance of spatial context in the formation and retrieval of episodic memory, the current primary approach to implementing episodic memory in AI systems is through transformers that store temporally ordered experiences, which overlooks the spatial dimension. As a result, it is unclear how the underlying structure could be extended to incorporate the spatial axis beyond temporal order alone and thereby what benefits can be obtained. To address this, this paper explores the use of Spatially-Aware Transformer models that incorporate spatial information. These models enable the creation of place-centric episodic memory that considers both temporal and spatial dimensions. Adopting this approach, we demonstrate that memory utilization efficiency can be improved, leading to enhanced accuracy in various place-centric downstream tasks. Additionally, we propose the Adaptive Memory Allocator, a memory management method based on reinforcement learning that aims to optimize efficiency of memory utilization. Our experiments demonstrate the advantages of our proposed model in various environments and across multiple downstream tasks, including prediction, generation, reasoning, and reinforcement learning. The source code for our models and experiments will be available at https://github.com/junmokane/spatially-aware-transformer.

MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models

  • Authors: Nathanaël Carraz Rakotonirina, Marco Baroni
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.15268
  • Pdf link: https://arxiv.org/pdf/2402.15268
  • Abstract Transformer-based language models (LMs) track contextual information through large, hard-coded input windows. We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors, akin to soft prompts, without requiring LM finetuning. Tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-augmented LM outperforms much larger LMs that have access to the full input history. We also test MemoryPrompt on a long-distance dialogue dataset, where its performance is comparable to that of a model conditioned on the entire conversation history. In both experiments we also observe that, unlike full-finetuning approaches, MemoryPrompt does not suffer from catastrophic forgetting when adapted to new tasks, thus not disrupting the generalist capabilities of the underlying LM.

A Survey of Music Generation in the Context of Interaction

  • Authors: Ismael Agchar, Ilja Baumann, Franziska Braun, Paula Andrea Perez-Toro, Korbinian Riedhammer, Sebastian Trump, Martin Ullrich
  • Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2402.15294
  • Pdf link: https://arxiv.org/pdf/2402.15294
  • Abstract In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amounts of recorded or transcribed music, which in turn also allows for fairly straight-forward "performance" evaluation. However, most of these models are not suitable for human-machine co-creation through live interaction, neither is clear, how such models and resulting creations would be evaluated. This article presents a thorough review of music representation, feature analysis, heuristic algorithms, statistical and parametric modelling, and human and automatic evaluation measures, along with a discussion of which approaches and models seem most suitable for live interaction.

Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling

  • Authors: Hui Lin, Zhiheng Ma, Rongrong Ji, Yaowei Wang, Zhou Su, Xiaopeng Hong, Deyu Meng
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.15297
  • Pdf link: https://arxiv.org/pdf/2402.15297
  • Abstract This paper focuses on semi-supervised crowd counting, where only a small portion of the training data are labeled. We formulate the pixel-wise density value to regress as a probability distribution, instead of a single deterministic value. On this basis, we propose a semi-supervised crowd-counting model. Firstly, we design a pixel-wise distribution matching loss to measure the differences in the pixel-wise density distributions between the prediction and the ground truth; Secondly, we enhance the transformer decoder by using density tokens to specialize the forwards of decoders w.r.t. different density intervals; Thirdly, we design the interleaving consistency self-supervised learning mechanism to learn from unlabeled data efficiently. Extensive experiments on four datasets are performed to show that our method clearly outperforms the competitors by a large margin under various labeled ratio settings. Code will be released at https://github.com/LoraLinH/Semi-supervised-Counting-via-Pixel-by-pixel-Density-Distribution-Modelling.

ArabianGPT: Native Arabic GPT-based Large Language

  • Authors: Anis Koubaa, Adel Ammar, Lahouari Ghouti, Omar Najar, Serry Sibaee
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.15313
  • Pdf link: https://arxiv.org/pdf/2402.15313
  • Abstract The predominance of English and Latin-based large language models (LLMs) has led to a notable deficit in native Arabic LLMs. This discrepancy is accentuated by the prevalent inclusion of English tokens in existing Arabic models, detracting from their efficacy in processing native Arabic's intricate morphology and syntax. Consequently, there is a theoretical and practical imperative for developing LLMs predominantly focused on Arabic linguistic elements. To address this gap, this paper proposes ArabianGPT, a series of transformer-based models within the ArabianLLM suite designed explicitly for Arabic. These models, including ArabianGPT-0.1B and ArabianGPT-0.3B, vary in size and complexity, aligning with the nuanced linguistic characteristics of Arabic. The AraNizer tokenizer, integral to these models, addresses the unique morphological aspects of Arabic script, ensuring more accurate text processing. Empirical results from fine-tuning the models on tasks like sentiment analysis and summarization demonstrate significant improvements. For sentiment analysis, the fine-tuned ArabianGPT-0.1B model achieved a remarkable accuracy of 95%, a substantial increase from the base model's 56%. Similarly, in summarization tasks, fine-tuned models showed enhanced F1 scores, indicating improved precision and recall in generating concise summaries. Comparative analysis of fine-tuned ArabianGPT models against their base versions across various benchmarks reveals nuanced differences in performance, with fine-tuning positively impacting specific tasks like question answering and summarization. These findings underscore the efficacy of fine-tuning in aligning ArabianGPT models more closely with specific NLP tasks, highlighting the potential of tailored transformer architectures in advancing Arabic NLP.

TransFlower: An Explainable Transformer-Based Model with Flow-to-Flow Attention for Commuting Flow Prediction

  • Authors: Yan Luo, Zhuoyue Wan, Yuzhong Chen, Gengchen Mai, Fu-lai Chung, Kent Larson
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
  • Arxiv link: https://arxiv.org/abs/2402.15398
  • Pdf link: https://arxiv.org/pdf/2402.15398
  • Abstract Understanding the link between urban planning and commuting flows is crucial for guiding urban development and policymaking. This research, bridging computer science and urban studies, addresses the challenge of integrating these fields with their distinct focuses. Traditional urban studies methods, like the gravity and radiation models, often underperform in complex scenarios due to their limited handling of multiple variables and reliance on overly simplistic and unrealistic assumptions, such as spatial isotropy. While deep learning models offer improved accuracy, their black-box nature poses a trade-off between performance and explainability -- both vital for analyzing complex societal phenomena like commuting flows. To address this, we introduce TransFlower, an explainable, transformer-based model employing flow-to-flow attention to predict urban commuting patterns. It features a geospatial encoder with an anisotropy-aware relative location encoder for nuanced flow representation. Following this, the transformer-based flow predictor enhances this by leveraging attention mechanisms to efficiently capture flow interactions. Our model outperforms existing methods by up to 30.8% Common Part of Commuters, offering insights into mobility dynamics crucial for urban planning and policy decisions.

The Impact of LoRA on the Emergence of Clusters in Transformers

  • Authors: Hugo Koubbi, Matthieu Boussard, Louis Hernandez
  • Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2402.15415
  • Pdf link: https://arxiv.org/pdf/2402.15415
  • Abstract In this paper, we employ the mathematical framework on Transformers developed by \citet{sander2022sinkformers,geshkovski2023emergence,geshkovski2023mathematical} to explore how variations in attention parameters and initial token values impact the structural dynamics of token clusters. Our analysis demonstrates that while the clusters within a modified attention matrix dynamics can exhibit significant divergence from the original over extended periods, they maintain close similarities over shorter intervals, depending on the parameter differences. This work contributes to the fine-tuning field through practical applications to the LoRA algorithm \cite{hu2021lora,peft}, enhancing our understanding of the behavior of LoRA-enhanced Transformer models.

Benchmarking the Robustness of Panoptic Segmentation for Automated Driving

  • Authors: Yiting Wang, Haonan Zhao, Daniel Gummadi, Mehrdad Dianati, Kurt Debattista, Valentina Donzella
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2402.15469
  • Pdf link: https://arxiv.org/pdf/2402.15469
  • Abstract Precise situational awareness is required for the safe decision-making of assisted and automated driving (AAD) functions. Panoptic segmentation is a promising perception technique to identify and categorise objects, impending hazards, and driveable space at a pixel level. While segmentation quality is generally associated with the quality of the camera data, a comprehensive understanding and modelling of this relationship are paramount for AAD system designers. Motivated by such a need, this work proposes a unifying pipeline to assess the robustness of panoptic segmentation models for AAD, correlating it with traditional image quality. The first step of the proposed pipeline involves generating degraded camera data that reflects real-world noise factors. To this end, 19 noise factors have been identified and implemented with 3 severity levels. Of these factors, this work proposes novel models for unfavourable light and snow. After applying the degradation models, three state-of-the-art CNN- and vision transformers (ViT)-based panoptic segmentation networks are used to analyse their robustness. The variations of the segmentation performance are then correlated to 8 selected image quality metrics. This research reveals that: 1) certain specific noise factors produce the highest impact on panoptic segmentation, i.e. droplets on lens and Gaussian noise; 2) the ViT-based panoptic segmentation backbones show better robustness to the considered noise factors; 3) some image quality metrics (i.e. LPIPS and CW-SSIM) correlate strongly with panoptic segmentation performance and therefore they can be used as predictive metrics for network performance.

Transformers are Expressive, But Are They Expressive Enough for Regression?

  • Authors: Swaroop Nath, Harshad Khadilkar, Pushpak Bhattacharyya
  • Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2402.15478
  • Pdf link: https://arxiv.org/pdf/2402.15478
  • Abstract Transformers have become pivotal in Natural Language Processing, demonstrating remarkable success in applications like Machine Translation and Summarization. Given their widespread adoption, several works have attempted to analyze the expressivity of Transformers. Expressivity of a neural network is the class of functions it can approximate. A neural network is fully expressive if it can act as a universal function approximator. We attempt to analyze the same for Transformers. Contrary to existing claims, our findings reveal that Transformers struggle to reliably approximate continuous functions, relying on piecewise constant approximations with sizable intervals. The central question emerges as: "\textit{Are Transformers truly Universal Function Approximators}?" To address this, we conduct a thorough investigation, providing theoretical insights and supporting evidence through experiments. Our contributions include a theoretical analysis pinpointing the root of Transformers' limitation in function approximation and extensive experiments to verify the limitation. By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities.

Keyword: scene understanding

OpenSUN3D: 1st Workshop Challenge on Open-Vocabulary 3D Scene Understanding

  • Authors: Francis Engelmann, Ayca Takmaz, Jonas Schult, Elisabetta Fedele, Johanna Wald, Songyou Peng, Xi Wang, Or Litany, Siyu Tang, Federico Tombari, Marc Pollefeys, Leonidas Guibas, Hongbo Tian, Chunjie Wang, Xiaosheng Yan, Bingwen Wang, Xuanyang Zhang, Xiao Liu, Phuc Nguyen, Khoi Nguyen, Anh Tran, Cuong Pham, Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan Lasenby
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.15321
  • Pdf link: https://arxiv.org/pdf/2402.15321
  • Abstract This report provides an overview of the challenge hosted at the OpenSUN3D Workshop on Open-Vocabulary 3D Scene Understanding held in conjunction with ICCV 2023. The goal of this workshop series is to provide a platform for exploration and discussion of open-vocabulary 3D scene understanding tasks, including but not limited to segmentation, detection and mapping. We provide an overview of the challenge hosted at the workshop, present the challenge dataset, the evaluation methodology, and brief descriptions of the winning methods. For additional details, please see https://opensun3d.github.io/index_iccv23.html.

Keyword: visual reasoning

There is no result

DongZhouGu avatar Feb 26 '24 02:02 DongZhouGu