arxiv-daily
arxiv-daily copied to clipboard
New submissions for Wed, 31 Jan 24
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design
- Authors: Seokju Yun, Youngmin Ro
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.16456
- Pdf link: https://arxiv.org/pdf/2401.16456
- Abstract Recently, efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally, they use 4x4 patch embeddings and a 4-stage structure at the macro level, while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore, our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions, and several attention heads in the latter stages are computationally redundant. To handle this, we introduce a single-head attention module that inherently prevents head redundancy and simultaneously boosts accuracy by parallelly combining global and local information. Building upon our solutions, we introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device, respectively, while being 1.3% more accurate. For object detection and instance segmentation on MS COCO using Mask-RCNN head, our model achieves performance comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone latency on GPU and mobile device, respectively.
The Why, When, and How to Use Active Learning in Large-Data-Driven 3D Object Detection for Safe Autonomous Driving: An Empirical Exploration
- Authors: Ross Greer, Bjørk Antoniussen, Mathias V. Andersen, Andreas Møgelmose, Mohan M. Trivedi
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.16634
- Pdf link: https://arxiv.org/pdf/2401.16634
- Abstract Active learning strategies for 3D object detection in autonomous driving datasets may help to address challenges of data imbalance, redundancy, and high-dimensional data. We demonstrate the effectiveness of entropy querying to select informative samples, aiming to reduce annotation costs and improve model performance. We experiment using the BEVFusion model for 3D object detection on the nuScenes dataset, comparing active learning to random sampling and demonstrating that entropy querying outperforms in most cases. The method is particularly effective in reducing the performance gap between majority and minority classes. Class-specific analysis reveals efficient allocation of annotated resources for limited data budgets, emphasizing the importance of selecting diverse and informative data for model training. Our findings suggest that entropy querying is a promising strategy for selecting data that enhances model learning in resource-constrained environments.
Characterization of Magnetic Labyrinthine Structures through Junctions and Terminals Detection using Template Matching and CNN
- Authors: Vinícius Yu Okubo, Kotaro Shimizu, B. S. Shivaram, Hae Yong Kim
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.16688
- Pdf link: https://arxiv.org/pdf/2401.16688
- Abstract In material sciences, characterizing faults in periodic structures is vital for understanding material properties. To characterize magnetic labyrinthine patterns, it is necessary to accurately identify junctions and terminals, often featuring over a thousand closely packed defects per image. This study introduces a new technique called TM-CNN (Template Matching - Convolutional Neural Network) designed to detect a multitude of small objects in images, such as defects in magnetic labyrinthine patterns. TM-CNN was used to identify these structures in 444 experimental images, and the results were explored to deepen the understanding of magnetic materials. It employs a two-stage detection approach combining template matching, used in initial detection, with a convolutional neural network, used to eliminate incorrect identifications. To train a CNN classifier, it is necessary to create a large number of training images. This difficulty prevents the use of CNN in many practical applications. TM-CNN significantly reduces the manual workload for creating training images by automatically making most of the annotations and leaving only a small number of corrections to human reviewers. In testing, TM-CNN achieved an impressive F1 score of 0.988, far outperforming traditional template matching and CNN-based object detection algorithms.
LF Tracy: A Unified Single-Pipeline Approach for Salient Object Detection in Light Field Cameras
- Authors: Fei Teng, Jiaming Zhang, Jiawei Liu, Kunyu Peng, Xina Cheng, Zhiyong Li, Kailun Yang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2401.16712
- Pdf link: https://arxiv.org/pdf/2401.16712
- Abstract Leveraging the rich information extracted from light field (LF) cameras is instrumental for dense prediction tasks. However, adapting light field data to enhance Salient Object Detection (SOD) still follows the traditional RGB methods and remains under-explored in the community. Previous approaches predominantly employ a custom two-stream design to discover the implicit angular feature within light field cameras, leading to significant information isolation between different LF representations. In this study, we propose an efficient paradigm (LF Tracy) to address this limitation. We eschew the conventional specialized fusion and decoder architecture for a dual-stream backbone in favor of a unified, single-pipeline approach. This comprises firstly a simple yet effective data augmentation strategy called MixLD to bridge the connection of spatial, depth, and implicit angular information under different LF representations. A highly efficient information aggregation (IA) module is then introduced to boost asymmetric feature-wise information fusion. Owing to this innovative approach, our model surpasses the existing state-of-the-art methods, particularly demonstrating a 23% improvement over previous results on the latest large-scale PKU dataset. By utilizing only 28.9M parameters, the model achieves a 10% increase in accuracy with 3M additional parameters compared to its backbone using RGB images and an 86% rise to its backbone using LF images. The source code will be made publicly available at https://github.com/FeiBryantkit/LF-Tracy.
A Bearing-Angle Approach for Unknown Target Motion Analysis Based on Visual Measurements
- Authors: Zian Ning, Yin Zhang, Jianan Li, Zhang Chen, Shiyu Zhao
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2401.17117
- Pdf link: https://arxiv.org/pdf/2401.17117
- Abstract Vision-based estimation of the motion of a moving target is usually formulated as a bearing-only estimation problem where the visual measurement is modeled as a bearing vector. Although the bearing-only approach has been studied for decades, a fundamental limitation of this approach is that it requires extra lateral motion of the observer to enhance the target's observability. Unfortunately, the extra lateral motion conflicts with the desired motion of the observer in many tasks. It is well-known that, once a target has been detected in an image, a bounding box that surrounds the target can be obtained. Surprisingly, this common visual measurement especially its size information has not been well explored up to now. In this paper, we propose a new bearing-angle approach to estimate the motion of a target by modeling its image bounding box as bearing-angle measurements. Both theoretical analysis and experimental results show that this approach can significantly enhance the observability without relying on additional lateral motion of the observer. The benefit of the bearing-angle approach comes with no additional cost because a bounding box is a standard output of object detection algorithms. The approach simply exploits the information that has not been fully exploited in the past. No additional sensing devices or special detection algorithms are required.
YOLO-World: Real-Time Open-Vocabulary Object Detection
- Authors: Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.17270
- Pdf link: https://arxiv.org/pdf/2401.17270
- Abstract The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.
Keyword: transformer
Do deep neural networks utilize the weight space efficiently?
- Authors: Onur Can Koyun, Behçet Uğur Töreyin
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.16438
- Pdf link: https://arxiv.org/pdf/2401.16438
- Abstract Deep learning models like Transformers and Convolutional Neural Networks (CNNs) have revolutionized various domains, but their parameter-intensive nature hampers deployment in resource-constrained settings. In this paper, we introduce a novel concept utilizes column space and row space of weight matrices, which allows for a substantial reduction in model parameters without compromising performance. Leveraging this paradigm, we achieve parameter-efficient deep learning models.. Our approach applies to both Bottleneck and Attention layers, effectively halving the parameters while incurring only minor performance degradation. Extensive experiments conducted on the ImageNet dataset with ViT and ResNet50 demonstrate the effectiveness of our method, showcasing competitive performance when compared to traditional models. This approach not only addresses the pressing demand for parameter efficient deep learning solutions but also holds great promise for practical deployment in real-world scenarios.
Context-Former: Stitching via Latent Conditioned Sequence Modeling
- Authors: Ziqi Zhang, Jingzehua Xu, Zifeng Zhuang, Jinxin Liu, Donglin wang
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.16452
- Pdf link: https://arxiv.org/pdf/2401.16452
- Abstract Offline reinforcement learning (RL) algorithms can improve the decision making via stitching sub-optimal trajectories to obtain more optimal ones. This capability is a crucial factor in enabling RL to learn policies that are superior to the behavioral policy. On the other hand, Decision Transformer (DT) abstracts the decision-making as sequence modeling, showcasing competitive performance on offline RL benchmarks, however, recent studies demonstrate that DT lacks of stitching capability, thus exploit stitching capability for DT is vital to further improve its performance. In order to endow stitching capability to DT, we abstract trajectory stitching as expert matching and introduce our approach, ContextFormer, which integrates contextual information-based imitation learning (IL) and sequence modeling to stitch sub-optimal trajectory fragments by emulating the representations of a limited number of expert trajectories. To validate our claim, we conduct experiments from two perspectives: 1) We conduct extensive experiments on D4RL benchmarks under the settings of IL, and experimental results demonstrate ContextFormer can achieve competitive performance in multi-IL settings. 2) More importantly, we conduct a comparison of ContextFormer with diverse competitive DT variants using identical training datasets. The experimental results unveiled ContextFormer's superiority, as it outperformed all other variants, showcasing its remarkable performance.
Hybrid Transformer and Spatial-Temporal Self-Supervised Learning for Long-term Traffic Prediction
- Authors: Wang Zhu, Doudou Zhang, Baichao Long, Jianli Xiao
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.16453
- Pdf link: https://arxiv.org/pdf/2401.16453
- Abstract Long-term traffic prediction has always been a challenging task due to its dynamic temporal dependencies and complex spatial dependencies. In this paper, we propose a model that combines hybrid Transformer and spatio-temporal self-supervised learning. The model enhances its robustness by applying adaptive data augmentation techniques at the sequence-level and graph-level of the traffic data. It utilizes Transformer to overcome the limitations of recurrent neural networks in capturing long-term sequences, and employs Chebyshev polynomial graph convolution to capture complex spatial dependencies. Furthermore, considering the impact of spatio-temporal heterogeneity on traffic speed, we design two self-supervised learning tasks to model the temporal and spatial heterogeneity, thereby improving the accuracy and generalization ability of the model. Experimental evaluations are conducted on two real-world datasets, PeMS04 and PeMS08, and the results are visualized and analyzed, demonstrating the superior performance of the proposed model.
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design
- Authors: Seokju Yun, Youngmin Ro
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.16456
- Pdf link: https://arxiv.org/pdf/2401.16456
- Abstract Recently, efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally, they use 4x4 patch embeddings and a 4-stage structure at the macro level, while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore, our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions, and several attention heads in the latter stages are computationally redundant. To handle this, we introduce a single-head attention module that inherently prevents head redundancy and simultaneously boosts accuracy by parallelly combining global and local information. Building upon our solutions, we introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device, respectively, while being 1.3% more accurate. For object detection and instance segmentation on MS COCO using Mask-RCNN head, our model achieves performance comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone latency on GPU and mobile device, respectively.
Validation, Robustness, and Accuracy of Perturbation-Based Sensitivity Analysis Methods for Time-Series Deep Learning Models
- Authors: Zhengguang Wang
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.16521
- Pdf link: https://arxiv.org/pdf/2401.16521
- Abstract This work undertakes studies to evaluate Interpretability Methods for Time-Series Deep Learning. Sensitivity analysis assesses how input changes affect the output, constituting a key component of interpretation. Among the post-hoc interpretation methods such as back-propagation, perturbation, and approximation, my work will investigate perturbation-based sensitivity Analysis methods on modern Transformer models to benchmark their performances. Specifically, my work answers three research questions: 1) Do different sensitivity analysis (SA) methods yield comparable outputs and attribute importance rankings? 2) Using the same sensitivity analysis method, do different Deep Learning (DL) models impact the output of the sensitivity analysis? 3) How well do the results from sensitivity analysis methods align with the ground truth?
GuReT: Distinguishing Guilt and Regret related Text
- Authors: Sabur Butt, Fazlourrahman Balouchzahi, Abdul Gafar Manuel Meque, Maaz Amjad, Hector G. Ceballos Cancino, Grigori Sidorov, Alexander Gelbukh
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.16541
- Pdf link: https://arxiv.org/pdf/2401.16541
- Abstract The intricate relationship between human decision-making and emotions, particularly guilt and regret, has significant implications on behavior and well-being. Yet, these emotions subtle distinctions and interplay are often overlooked in computational models. This paper introduces a dataset tailored to dissect the relationship between guilt and regret and their unique textual markers, filling a notable gap in affective computing research. Our approach treats guilt and regret recognition as a binary classification task and employs three machine learning and six transformer-based deep learning techniques to benchmark the newly created dataset. The study further implements innovative reasoning methods like chain-of-thought and tree-of-thought to assess the models interpretive logic. The results indicate a clear performance edge for transformer-based models, achieving a 90.4% macro F1 score compared to the 85.3% scored by the best machine learning classifier, demonstrating their superior capability in distinguishing complex emotional states.
Deep Learning for Multi-Label Learning: A Comprehensive Survey
- Authors: Adane Nega Tarekegn, Mohib Ullah, Faouzi Alaya Cheikh
- Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.16549
- Pdf link: https://arxiv.org/pdf/2401.16549
- Abstract Multi-label learning is a rapidly growing research area that aims to predict multiple labels from a single input data point. In the era of big data, tasks involving multi-label classification (MLC) or ranking present significant and intricate challenges, capturing considerable attention in diverse domains. Inherent difficulties in MLC include dealing with high-dimensional data, addressing label correlations, and handling partial labels, for which conventional methods prove ineffective. Recent years have witnessed a notable increase in adopting deep learning (DL) techniques to address these challenges more effectively in MLC. Notably, there is a burgeoning effort to harness the robust learning capabilities of DL for improved modelling of label dependencies and other challenges in MLC. However, it is noteworthy that comprehensive studies specifically dedicated to DL for multi-label learning are limited. Thus, this survey aims to thoroughly review recent progress in DL for multi-label learning, along with a summary of open research problems in MLC. The review consolidates existing research efforts in DL for MLC,including deep neural networks, transformers, autoencoders, and convolutional and recurrent architectures. Finally, the study presents a comparative analysis of the existing methods to provide insightful observations and stimulate future research directions in this domain.
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
- Authors: Ivana Beňová, Jana Košecká, Michal Gregor, Martin Tamajka, Marcel Veselý, Marián Šimko
- Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.16575
- Pdf link: https://arxiv.org/pdf/2401.16575
- Abstract The dominant probing approaches rely on the zero-shot performance of image-text matching tasks to gain a finer-grained understanding of the representations learned by recent multimodal image-language transformer models. The evaluation is carried out on carefully curated datasets focusing on counting, relations, attributes, and others. This work introduces an alternative probing strategy called guided masking. The proposed approach ablates different modalities using masking and assesses the model's ability to predict the masked word with high accuracy. We focus on studying multimodal models that consider regions of interest (ROI) features obtained by object detectors as input tokens. We probe the understanding of verbs using guided masking on ViLBERT, LXMERT, UNITER, and VisualBERT and show that these models can predict the correct verb with high accuracy. This contrasts with previous conclusions drawn from image-text matching probing techniques that frequently fail in situations requiring verb understanding. The code for all experiments will be publicly available https://github.com/ivana-13/guided_masking.
Attention-based Reinforcement Learning for Combinatorial Optimization: Application to Job Shop Scheduling Problem
- Authors: Jaejin Lee, Seho Kee, Mani Janakiram, George Runger
- Subjects: Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.16580
- Pdf link: https://arxiv.org/pdf/2401.16580
- Abstract Job shop scheduling problems are one of the most important and challenging combinatorial optimization problems that have been tackled mainly by exact or approximate solution approaches. However, finding an exact solution can be infeasible for real-world problems, and even with an approximate solution approach, it can require a prohibitive amount of time to find a near-optimal solution, and the found solutions are not applicable to new problems in general. To address these challenges, we propose an attention-based reinforcement learning method for the class of job shop scheduling problems by integrating policy gradient reinforcement learning with a modified transformer architecture. An important result is that our trained learners in the proposed method can be reused to solve large-scale problems not used in training and demonstrate that our approach outperforms the results of recent studies and widely adopted heuristic rules.
Breaking Free Transformer Models: Task-specific Context Attribution Promises Improved Generalizability Without Fine-tuning Pre-trained LLMs
- Authors: Stepan Tytarenko, Mohammad Ruhul Amin
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.16638
- Pdf link: https://arxiv.org/pdf/2401.16638
- Abstract Fine-tuning large pre-trained language models (LLMs) on particular datasets is a commonly employed strategy in Natural Language Processing (NLP) classification tasks. However, this approach usually results in a loss of models generalizability. In this paper, we present a framework that allows for maintaining generalizability, and enhances the performance on the downstream task by utilizing task-specific context attribution. We show that a linear transformation of the text representation from any transformer model using the task-specific concept operator results in a projection onto the latent concept space, referred to as context attribution in this paper. The specific concept operator is optimized during the supervised learning stage via novel loss functions. The proposed framework demonstrates that context attribution of the text representation for each task objective can improve the capacity of the discriminator function and thus achieve better performance for the classification task. Experimental results on three datasets, namely HateXplain, IMDB reviews, and Social Media Attributions, illustrate that the proposed model attains superior accuracy and generalizability. Specifically, for the non-fine-tuned BERT on the HateXplain dataset, we observe 8% improvement in accuracy and 10% improvement in F1-score. Whereas for the IMDB dataset, fine-tuned state-of-the-art XLNet is outperformed by 1% for both accuracy and F1-score. Furthermore, in an out-of-domain cross-dataset test, DistilBERT fine-tuned on the IMDB dataset in conjunction with the proposed model improves the F1-score on the HateXplain dataset by 7%. For the Social Media Attributions dataset of YouTube comments, we observe 5.2% increase in F1-metric. The proposed framework is implemented with PyTorch and provided open-source on GitHub.
Using Motion Forecasting for Behavior-Based Virtual Reality (VR) Authentication
- Authors: Mingjun Li, Natasha Kholgade Banerjee, Sean Banerjee
- Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2401.16649
- Pdf link: https://arxiv.org/pdf/2401.16649
- Abstract Task-based behavioral biometric authentication of users interacting in virtual reality (VR) environments enables seamless continuous authentication by using only the motion trajectories of the person's body as a unique signature. Deep learning-based approaches for behavioral biometrics show high accuracy when using complete or near complete portions of the user trajectory, but show lower performance when using smaller segments from the start of the task. Thus, any systems designed with existing techniques are vulnerable while waiting for future segments of motion trajectories to become available. In this work, we present the first approach that predicts future user behavior using Transformer-based forecasting and using the forecasted trajectory to perform user authentication. Our work leverages the notion that given the current trajectory of a user in a task-based environment we can predict the future trajectory of the user as they are unlikely to dramatically shift their behavior since it would preclude the user from successfully completing their task goal. Using the publicly available 41-subject ball throwing dataset of Miller et al. we show improvement in user authentication when using forecasted data. When compared to no forecasting, our approach reduces the authentication equal error rate (EER) by an average of 23.85% and a maximum reduction of 36.14%.
ILBiT: Imitation Learning for Robot Using Position and Torque Information based on Bilateral Control with Transformer
- Authors: Masato Kobayashi, Thanpimon Buamanee, Yuki Uranishi, Haruo Takemura
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2401.16653
- Pdf link: https://arxiv.org/pdf/2401.16653
- Abstract Autonomous manipulation in robot arms is a complex and evolving field of study in robotics. This paper introduces an innovative approach to this challenge by focusing on imitation learning (IL). Unlike traditional imitation methods, our approach uses IL based on bilateral control, allowing for more precise and adaptable robot movements. The conventional IL based on bilateral control method have relied on Long Short-Term Memory (LSTM) networks. In this paper, we present the IL for robot using position and torque information based on Bilateral control with Transformer (ILBiT). This proposed method employs the Transformer model, known for its robust performance in handling diverse datasets and its capability to surpass LSTM's limitations, especially in tasks requiring detailed force adjustments. A standout feature of ILBiT is its high-frequency operation at 100 Hz, which significantly improves the system's adaptability and response to varying environments and objects of different hardness levels. The effectiveness of the Transformer-based ILBiT method can be seen through comprehensive real-world experiments.
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
- Authors: Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe
- Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2401.16658
- Pdf link: https://arxiv.org/pdf/2401.16658
- Abstract Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed. We publicly release the data preparation scripts, pre-trained models and training logs.
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
- Authors: Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair
- Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.16677
- Pdf link: https://arxiv.org/pdf/2401.16677
- Abstract Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG.
Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers
- Authors: Jianbin Jiao, Xina Cheng, Weijie Chen, Xiaoting Yin, Hao Shi, Kailun Yang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2401.16700
- Pdf link: https://arxiv.org/pdf/2401.16700
- Abstract 3D human pose estimation captures the human joint points in three-dimensional space while keeping the depth information and physical structure. That is essential for applications that require precise pose information, such as human-computer interaction, scene understanding, and rehabilitation training. Due to the challenges in data collection, mainstream datasets of 3D human pose estimation are primarily composed of multi-view video data collected in laboratory environments, which contains rich spatial-temporal correlation information besides the image frame content. Given the remarkable self-attention mechanism of transformers, capable of capturing the spatial-temporal correlation from multi-view video datasets, we propose a multi-stage framework for 3D sequence-to-sequence (seq2seq) human pose detection. Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships and 3D spatial positional relationship features between the multi-perspective images. Secondly, the self-attention mechanism is adopted to eliminate the interference from non-human body parts and reduce computing resources. Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset. Experimental results demonstrate that our approach achieves state-of-the-art performance on this dataset.
OptiState: State Estimation of Legged Robots using Gated Networks with Transformer-based Vision and Kalman Filtering
- Authors: Alexander Schperberg, Yusuke Tanaka, Saviz Mowlavi, Feng Xu, Bharathan Balaji, Dennis Hong
- Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2401.16719
- Pdf link: https://arxiv.org/pdf/2401.16719
- Abstract State estimation for legged robots is challenging due to their highly dynamic motion and limitations imposed by sensor accuracy. By integrating Kalman filtering, optimization, and learning-based modalities, we propose a hybrid solution that combines proprioception and exteroceptive information for estimating the state of the robot's trunk. Leveraging joint encoder and IMU measurements, our Kalman filter is enhanced through a single-rigid body model that incorporates ground reaction force control outputs from convex Model Predictive Control optimization. The estimation is further refined through Gated Recurrent Units, which also considers semantic insights and robot height from a Vision Transformer autoencoder applied on depth images. This framework not only furnishes accurate robot state estimates, including uncertainty evaluations, but can minimize the nonlinear errors that arise from sensor measurements and model simplifications through learning. The proposed methodology is evaluated in hardware using a quadruped robot on various terrains, yielding a 65% improvement on the Root Mean Squared Error compared to our VIO SLAM baseline. Code example: https://github.com/AlexS28/OptiState
Towards Generating Informative Textual Description for Neurons in Language Models
- Authors: Shrayani Mondal, Rishabh Garodia, Arbaaz Qureshi, Taesung Lee, Youngja Park
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.16731
- Pdf link: https://arxiv.org/pdf/2401.16731
- Abstract Recent developments in transformer-based language models have allowed them to capture a wide variety of world knowledge that can be adapted to downstream tasks with limited resources. However, what pieces of information are understood in these models is unclear, and neuron-level contributions in identifying them are largely unknown. Conventional approaches in neuron explainability either depend on a finite set of pre-defined descriptors or require manual annotations for training a secondary model that can then explain the neurons of the primary model. In this paper, we take BERT as an example and we try to remove these constraints and propose a novel and scalable framework that ties textual descriptions to neurons. We leverage the potential of generative language models to discover human-interpretable descriptors present in a dataset and use an unsupervised approach to explain neurons with these descriptors. Through various qualitative and quantitative analyses, we demonstrate the effectiveness of this framework in generating useful data-specific descriptors with little human involvement in identifying the neurons that encode these descriptors. In particular, our experiment shows that the proposed approach achieves 75% precision@2, and 50% recall@2
Engineering A Large Language Model From Scratch
- Authors: Abiodun Finbarrs Oketunji
- Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2401.16736
- Pdf link: https://arxiv.org/pdf/2401.16736
- Abstract The proliferation of deep learning in natural language processing (NLP) has led to the development and release of innovative technologies capable of understanding and generating human language with remarkable proficiency. Atinuke, a Transformer-based neural network, optimises performance across various language tasks by utilising a unique configuration. The architecture interweaves layers for processing sequential data with attention mechanisms to draw meaningful affinities between inputs and outputs. Due to the configuration of its topology and hyperparameter tuning, it can emulate human-like language by extracting features and learning complex mappings. Atinuke is modular, extensible, and integrates seamlessly with existing machine learning pipelines. Advanced matrix operations like softmax, embeddings, and multi-head attention enable nuanced handling of textual, acoustic, and visual signals. By unifying modern deep learning techniques with software design principles and mathematical theory, the system achieves state-of-the-art results on natural language tasks whilst remaining interpretable and robust.
LATENTPATCH: A Non-Parametric Approach for Face Generation and Editing
- Authors: Benjamin Samuth (UNICAEN, ENSICAEN), Julien Rabin (ENSICAEN, UNICAEN), David Tschumperlé (ENSICAEN, UNICAEN), Frédéric Jurie (ENSICAEN, UNICAEN)
- Subjects: Multimedia (cs.MM); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
- Arxiv link: https://arxiv.org/abs/2401.16830
- Pdf link: https://arxiv.org/pdf/2401.16830
- Abstract This paper presents LatentPatch, a new method for generating realistic images from a small dataset of only a few images. We use a lightweight model with only a few thousand parameters. Unlike traditional few-shot generation methods that finetune pre-trained large-scale generative models, our approach is computed directly on the latent distribution by sequential feature matching, and is explainable by design. Avoiding large models based on transformers, recursive networks, or self-attention, which are not suitable for small datasets, our method is inspired by non-parametric texture synthesis and style transfer models, and ensures that generated image features are sampled from the source distribution. We extend previous single-image models to work with a few images and demonstrate that our method can generate realistic images, as well as enable conditional sampling and image editing. We conduct experiments on face datasets and show that our simplistic model is effective and versatile.
Node Flux-Linkage Synchronizing Control of Power Systems with 100% Wind Power Generation Based on Capacitor Voltage Balancing Scheme
- Authors: Yang Liu, Yanshan Chen, Yuexi Yang, Xiangyu Pei, Feng Ji
- Subjects: Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2401.16871
- Pdf link: https://arxiv.org/pdf/2401.16871
- Abstract This paper proposes a node flux-linkage synchronizing control method (NFSCM) for power systems with 100% wind power generation based on a capacitor voltage balancing scheme (CVBS). Different from the conventional grid-forming controllers, NFSCM is designed to regulate inverters as virtual flux-linkage sources. Auto-synchronization of flux-linkage vectors is achieved through the CVBS-based NFSCM. The mismatch among the angular frequencies of flux-linkage vectors is eliminated by regulating the tracking errors of DC-link voltages, which establishes a negative feedback between the output frequency and active power of the inverter. NFSCM is adaptive to weak and strong grids. It avoids the excitation inrush currents in the step-up transformer of wind power generators. It also eliminates the DC components of the three-phase currents, and avoids low-frequency oscillations in active power. In order to limit the short-circuit current of inverters, a logic-based bang-bang funnel control (LBFC) is designed to control the switches of inverter bridges when over-current is detected. LBFC is able to restrict various fault currents within an acceptable range within the shortest time. LBFC and NFSCM are designed to operate in a switched manner according to a state-dependent switching strategy. Time-domain simulations were conducted on a 100% wind power generation test system, and the performance of NFSCM and LBFC were investigated.
CAFCT: Contextual and Attentional Feature Fusions of Convolutional Neural Networks and Transformer for Liver Tumor Segmentation
- Authors: Ming Kang, Chee-Ming Ting, Fung Fung Ting, Raphaël Phan
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Applications (stat.AP)
- Arxiv link: https://arxiv.org/abs/2401.16886
- Pdf link: https://arxiv.org/pdf/2401.16886
- Abstract Medical image semantic segmentation techniques can help identify tumors automatically from computed tomography (CT) scans. In this paper, we propose a Contextual and Attentional feature Fusions enhanced Convolutional Neural Network (CNN) and Transformer hybrid network (CAFCT) model for liver tumor segmentation. In the proposed model, three other modules are introduced in the network architecture: Attentional Feature Fusion (AFF), Atrous Spatial Pyramid Pooling (ASPP) of DeepLabv3, and Attention Gates (AGs) to improve contextual information related to tumor boundaries for accurate segmentation. Experimental results show that the proposed CAFCT achieves a mean Intersection over Union (IoU) of 90.38% and Dice score of 86.78%, respectively, on the Liver Tumor Segmentation Benchmark (LiTS) dataset, outperforming pure CNN or Transformer methods, e.g., Attention U-Net, and PVTFormer.
Deep 3D World Models for Multi-Image Super-Resolution Beyond Optical Flow
- Authors: Luca Savant Aira, Diego Valsesia, Andrea Bordone Molini, Giulia Fracastoro, Enrico Magli, Andrea Mirabile
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2401.16972
- Pdf link: https://arxiv.org/pdf/2401.16972
- Abstract Multi-image super-resolution (MISR) allows to increase the spatial resolution of a low-resolution (LR) acquisition by combining multiple images carrying complementary information in the form of sub-pixel offsets in the scene sampling, and can be significantly more effective than its single-image counterpart. Its main difficulty lies in accurately registering and fusing the multi-image information. Currently studied settings, such as burst photography, typically involve assumptions of small geometric disparity between the LR images and rely on optical flow for image registration. We study a MISR method that can increase the resolution of sets of images acquired with arbitrary, and potentially wildly different, camera positions and orientations, generalizing the currently studied MISR settings. Our proposed model, called EpiMISR, moves away from optical flow and explicitly uses the epipolar geometry of the acquisition process, together with transformer-based processing of radiance feature fields to substantially improve over state-of-the-art MISR methods in presence of large disparities in the LR images.
SAL-PIM: A Subarray-level Processing-in-Memory Architecture with LUT-based Linear Interpolation for Transformer-based Text Generation
- Authors: Wontak Han, Hyunjun Cho, Donghyuk Kim, Joo-Young Kim
- Subjects: Hardware Architecture (cs.AR)
- Arxiv link: https://arxiv.org/abs/2401.17005
- Pdf link: https://arxiv.org/pdf/2401.17005
- Abstract Text generation is a compelling sub-field of natural language processing, aiming to generate human-readable text from input words. In particular, the decoder-only generative models, such as generative pre-trained transformer (GPT), are widely used for text generation, with two major computational stages: summarization and generation. Unlike the summarization stage, which can process the input tokens in parallel, the generation stage is difficult to accelerate due to its sequential generation of output tokens through iteration. Moreover, each iteration requires reading a whole model with little data reuse opportunity. Therefore, the workload of transformer-based text generation is severely memory-bound, making the external memory bandwidth system bottleneck. In this paper, we proposed a subarray-level processing-in-memory architecture named SAL-PIM, HBM-based PIM architecture for the end-to-end acceleration of transformer-based text generation. The SAL-PIM architecture includes three architectural features. First, the SAL-PIM architecture utilizes higher internal bandwidth by integrating multiple subarray-level arithmetic logic units with optimized data mapping schemes. Second, the SAL-PIM architecture adopts LUT-based linear interpolation to perform complex non-linear functions in PIM. Third, the SAL-PIM architecture accelerates end-to-end inference on PIM in text generation. Furthermore, to validate the SAL-PIM architecture, we built cycle-accurate simulator and implemented the SAL-PIM's logic units in 28-nm CMOS technology. As a result, when the input size is from 32 to 128 and the output size is from 1 to 256, SAL-PIM achieves a maximum of 4.72 times speedup and an average of 1.83 times speedup for the text generation based on the GPT-2 medium model compared to the server-level GPU.
Taking Action Towards Graceful Interaction: The Effects of Performing Actions on Modelling Policies for Instruction Clarification Requests
- Authors: Brielen Madureira, David Schlangen
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2401.17039
- Pdf link: https://arxiv.org/pdf/2401.17039
- Abstract Clarification requests are a mechanism to help solve communication problems, e.g. due to ambiguity or underspecification, in instruction-following interactions. Despite their importance, even skilful models struggle with producing or interpreting such repair acts. In this work, we test three hypotheses concerning the effects of action taking as an auxiliary task in modelling iCR policies. Contrary to initial expectations, we conclude that its contribution to learning an iCR policy is limited, but some information can still be extracted from prediction uncertainty. We present further evidence that even well-motivated, Transformer-based models fail to learn good policies for when to ask Instruction CRs (iCRs), while the task of determining what to ask about can be more successfully modelled. Considering the implications of these findings, we further discuss the shortcomings of the data-driven paradigm for learning meta-communication acts.
Forecasting VIX using Bayesian Deep Learning
- Authors: Héctor J. Hortúa, Andrés Mora-Valencia
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.17042
- Pdf link: https://arxiv.org/pdf/2401.17042
- Abstract Recently, deep learning techniques are gradually replacing traditional statistical and machine learning models as the first choice for price forecasting tasks. In this paper, we leverage probabilistic deep learning for inferring the volatility index VIX. We employ the probabilistic counterpart of WaveNet, Temporal Convolutional Network (TCN), and Transformers. We show that TCN outperforms all models with an RMSE around 0.189. In addition, it has been well known that modern neural networks provide inaccurate uncertainty estimates. For solving this problem, we use the standard deviation scaling to calibrate the networks. Furthermore, we found out that MNF with Gaussian prior outperforms Reparameterization Trick and Flipout models in terms of precision and uncertainty predictions. Finally, we claim that MNF with Cauchy and LogUniform prior distributions yield well calibrated TCN and WaveNet networks being the former that best infer the VIX values.
ViTree: Single-path Neural Tree for Step-wise Interpretable Fine-grained Visual Categorization
- Authors: Danning Lao, Qi Liu, Jiazi Bu, Junchi Yan, Wei Shen
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.17050
- Pdf link: https://arxiv.org/pdf/2401.17050
- Abstract As computer vision continues to advance and finds widespread applications across various domains, the need for interpretability in deep learning models becomes paramount. Existing methods often resort to post-hoc techniques or prototypes to explain the decision-making process, which can be indirect and lack intrinsic illustration. In this research, we introduce ViTree, a novel approach for fine-grained visual categorization that combines the popular vision transformer as a feature extraction backbone with neural decision trees. By traversing the tree paths, ViTree effectively selects patches from transformer-processed features to highlight informative local regions, thereby refining representations in a step-wise manner. Unlike previous tree-based models that rely on soft distributions or ensembles of paths, ViTree selects a single tree path, offering a clearer and simpler decision-making process. This patch and path selectivity enhances model interpretability of ViTree, enabling better insights into the model's inner workings. Remarkably, extensive experimentation validates that this streamlined approach surpasses various strong competitors and achieves state-of-the-art performance while maintaining exceptional interpretability which is proved by multi-perspective methods. Code can be found at https://github.com/SJTU-DeepVisionLab/ViTree.
Making Parametric Anomaly Detection on Tabular Data Non-Parametric Again
- Authors: Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.17052
- Pdf link: https://arxiv.org/pdf/2401.17052
- Abstract Deep learning for tabular data has garnered increasing attention in recent years, yet employing deep models for structured data remains challenging. While these models excel with unstructured data, their efficacy with structured data has been limited. Recent research has introduced retrieval-augmented models to address this gap, demonstrating promising results in supervised tasks such as classification and regression. In this work, we investigate using retrieval-augmented models for anomaly detection on tabular data. We propose a reconstruction-based approach in which a transformer model learns to reconstruct masked features of \textit{normal} samples. We test the effectiveness of KNN-based and attention-based modules to select relevant samples to help in the reconstruction process of the target sample. Our experiments on a benchmark of 31 tabular datasets reveal that augmenting this reconstruction-based anomaly detection (AD) method with non-parametric relationships via retrieval modules may significantly boost performance.
Nested Construction of Polar Codes via Transformers
- Authors: Sravan Kumar Ankireddy, S Ashwin Hebbar, Heping Wan, Joonyoung Cho, Charlie Zhang
- Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.17188
- Pdf link: https://arxiv.org/pdf/2401.17188
- Abstract Tailoring polar code construction for decoding algorithms beyond successive cancellation has remained a topic of significant interest in the field. However, despite the inherent nested structure of polar codes, the use of sequence models in polar code construction is understudied. In this work, we propose using a sequence modeling framework to iteratively construct a polar code for any given length and rate under various channel conditions. Simulations show that polar codes designed via sequential modeling using transformers outperform both 5G-NR sequence and Density Evolution based approaches for both AWGN and Rayleigh fading channels.
Keyword: scene understanding
Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers
- Authors: Jianbin Jiao, Xina Cheng, Weijie Chen, Xiaoting Yin, Hao Shi, Kailun Yang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2401.16700
- Pdf link: https://arxiv.org/pdf/2401.16700
- Abstract 3D human pose estimation captures the human joint points in three-dimensional space while keeping the depth information and physical structure. That is essential for applications that require precise pose information, such as human-computer interaction, scene understanding, and rehabilitation training. Due to the challenges in data collection, mainstream datasets of 3D human pose estimation are primarily composed of multi-view video data collected in laboratory environments, which contains rich spatial-temporal correlation information besides the image frame content. Given the remarkable self-attention mechanism of transformers, capable of capturing the spatial-temporal correlation from multi-view video datasets, we propose a multi-stage framework for 3D sequence-to-sequence (seq2seq) human pose detection. Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships and 3D spatial positional relationship features between the multi-perspective images. Secondly, the self-attention mechanism is adopted to eliminate the interference from non-human body parts and reduce computing resources. Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset. Experimental results demonstrate that our approach achieves state-of-the-art performance on this dataset.
Non-central panorama indoor dataset
- Authors: Bruno Berenguel-Baeta, Jesus Bermudez-Cameo, Jose J. Guerrero
- Subjects: Databases (cs.DB); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.17075
- Pdf link: https://arxiv.org/pdf/2401.17075
- Abstract Omnidirectional images are one of the main sources of information for learning based scene understanding algorithms. However, annotated datasets of omnidirectional images cannot keep the pace of these learning based algorithms development. Among the different panoramas and in contrast to standard central ones, non-central panoramas provide geometrical information in the distortion of the image from which we can retrieve 3D information of the environment [2]. However, due to the lack of commercial non-central devices, up until now there was no dataset of these kinds of panoramas. In this data paper, we present the first dataset of non-central panoramas for indoor scene understanding. The dataset is composed by {\bf 2574} RGB non-central panoramas taken in around 650 different rooms. Each panorama has associated a depth map and annotations to obtain the layout of the room from the image as a structural edge map, list of corners in the image, the 3D corners of the room and the camera pose. The images are taken from photorealistic virtual environments and pixel-wise automatically annotated.
Keyword: visual reasoning
There is no result