arxiv-daily
arxiv-daily copied to clipboard
New submissions for Fri, 2 Feb 24
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
Real-time Traffic Object Detection for Autonomous Driving
- Authors: Abdul Hannan Khan, Syed Tahseen Raza Rizvi, Andreas Dengel
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.00128
- Pdf link: https://arxiv.org/pdf/2402.00128
- Abstract With recent advances in computer vision, it appears that autonomous driving will be part of modern society sooner rather than later. However, there are still a significant number of concerns to address. Although modern computer vision techniques demonstrate superior performance, they tend to prioritize accuracy over efficiency, which is a crucial aspect of real-time applications. Large object detection models typically require higher computational power, which is achieved by using more sophisticated onboard hardware. For autonomous driving, these requirements translate to increased fuel costs and, ultimately, a reduction in mileage. Further, despite their computational demands, the existing object detectors are far from being real-time. In this research, we assess the robustness of our previously proposed, highly efficient pedestrian detector LSFM on well-established autonomous driving benchmarks, including diverse weather conditions and nighttime scenes. Moreover, we extend our LSFM model for general object detection to achieve real-time object detection in traffic scenes. We evaluate its performance, low latency, and generalizability on traffic object detection datasets. Furthermore, we discuss the inadequacy of the current key performance indicator employed by object detection systems in the context of autonomous driving and propose a more suitable alternative that incorporates real-time requirements.
Improving Object Detection Quality in Football Through Super-Resolution Techniques
- Authors: Karolina Seweryn, Gabriel Chęć, Szymon Łukasik, Anna Wróblewska
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.00163
- Pdf link: https://arxiv.org/pdf/2402.00163
- Abstract This study explores the potential of super-resolution techniques in enhancing object detection accuracy in football. Given the sport's fast-paced nature and the critical importance of precise object (e.g. ball, player) tracking for both analysis and broadcasting, super-resolution could offer significant improvements. We investigate how advanced image processing through super-resolution impacts the accuracy and reliability of object detection algorithms in processing football match footage. Our methodology involved applying state-of-the-art super-resolution techniques to a diverse set of football match videos from SoccerNet, followed by object detection using Faster R-CNN. The performance of these algorithms, both with and without super-resolution enhancement, was rigorously evaluated in terms of detection accuracy. The results indicate a marked improvement in object detection accuracy when super-resolution preprocessing is applied. The improvement of object detection through the integration of super-resolution techniques yields significant benefits, especially for low-resolution scenarios, with a notable 12% increase in mean Average Precision (mAP) at an IoU (Intersection over Union) range of 0.50:0.95 for 320x240 size images when increasing the resolution fourfold using RLFN. As the dimensions increase, the magnitude of improvement becomes more subdued; however, a discernible improvement in the quality of detection is consistently evident. Additionally, we discuss the implications of these findings for real-time sports analytics, player tracking, and the overall viewing experience. The study contributes to the growing field of sports technology by demonstrating the practical benefits and limitations of integrating super-resolution techniques in football analytics and broadcasting.
Capacity Constraint Analysis Using Object Detection for Smart Manufacturing
- Authors: Hafiz Mughees Ahmad, Afshin Rahimi, Khizer Hayat
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.00243
- Pdf link: https://arxiv.org/pdf/2402.00243
- Abstract The increasing popularity of Deep Learning (DL) based Object Detection (OD) methods and their real-world applications have opened new venues in smart manufacturing. Traditional industries struck by capacity constraints after Coronavirus Disease (COVID-19) require non-invasive methods for in-depth operations' analysis to optimize and increase their revenue. In this study, we have initially developed a Convolutional Neural Network (CNN) based OD model to tackle this issue. This model is trained to accurately identify the presence of chairs and individuals on the production floor. The identified objects are then passed to the CNN based tracker, which tracks them throughout their life cycle in the workstation. The extracted meta-data is further processed through a novel framework for the capacity constraint analysis. We identified that the Station C is only 70.6% productive through 6 months. Additionally, the time spent at each station is recorded and aggregated for each object. This data proves helpful in conducting annual audits and effectively managing labor and material over time.
FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation
- Authors: Takuma Yagi, Misaki Ohashi, Yifei Huang, Ryosuke Furuta, Shungo Adachi, Toutai Mitsuyama, Yoichi Sato
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.00293
- Pdf link: https://arxiv.org/pdf/2402.00293
- Abstract In the development of science, accurate and reproducible documentation of the experimental process is crucial. Automatic recognition of the actions in experiments from videos would help experimenters by complementing the recording of experiments. Towards this goal, we propose FineBio, a new fine-grained video dataset of people performing biological experiments. The dataset consists of multi-view videos of 32 participants performing mock biological experiments with a total duration of 14.5 hours. One experiment forms a hierarchical structure, where a protocol consists of several steps, each further decomposed into a set of atomic operations. The uniqueness of biological experiments is that while they require strict adherence to steps described in each protocol, there is freedom in the order of atomic operations. We provide hierarchical annotation on protocols, steps, atomic operations, object locations, and their manipulation states, providing new challenges for structured activity understanding and hand-object interaction recognition. To find out challenges on activity understanding in biological experiments, we introduce baseline models and results on four different tasks, including (i) step segmentation, (ii) atomic operation detection (iii) object detection, and (iv) manipulated/affected object detection. Dataset and code are available from https://github.com/aistairc/FineBio.
Night-Rider: Nocturnal Vision-aided Localization in Streetlight Maps Using Invariant Extended Kalman Filtering
- Authors: Tianxiao Gao, Mingle Zhao, Chengzhong Xu, Hui Kong
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2402.00330
- Pdf link: https://arxiv.org/pdf/2402.00330
- Abstract Vision-aided localization for low-cost mobile robots in diverse environments has attracted widespread attention recently. Although many current systems are applicable in daytime environments, nocturnal visual localization is still an open problem owing to the lack of stable visual information. An insight from most nocturnal scenes is that the static and bright streetlights are reliable visual information for localization. Hence we propose a nocturnal vision-aided localization system in streetlight maps with a novel data association and matching scheme using object detection methods. We leverage the Invariant Extended Kalman Filter (InEKF) to fuse IMU, odometer, and camera measurements for consistent state estimation at night. Furthermore, a tracking recovery module is also designed for tracking failures. Experiments on multiple real nighttime scenes validate that the system can achieve remarkably accurate and robust localization in nocturnal environments.
A Manifold Representation of the Key in Vision Transformers
- Authors: Li Meng, Morten Goodwin, Anis Yazidi, Paal Engelstad
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2402.00534
- Pdf link: https://arxiv.org/pdf/2402.00534
- Abstract Vision Transformers implement multi-head self-attention (MSA) via stacking multiple attention blocks. The query, key, and value are often intertwined and generated within those blocks via a single, shared linear transformation. This paper explores the concept of disentangling the key from the query and value, and adopting a manifold representation for the key. Our experiments reveal that decoupling and endowing the key with a manifold structure can enhance the model performance. Specifically, ViT-B exhibits a 0.87% increase in top-1 accuracy, while Swin-T sees a boost of 0.52% in top-1 accuracy on the ImageNet-1K dataset, with eight charts in the manifold key. Our approach also yields positive results in object detection and instance segmentation tasks on the COCO dataset. Through detailed ablation studies, we establish that these performance gains are not merely due to the simplicity of adding more parameters and computations. Future research may investigate strategies for cutting the budget of such representations and aim for further performance improvements based on our findings.
Vehicle Perception from Satellite
- Authors: Bin Zhao, Pengfei Han, Xuelong Li
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.00703
- Pdf link: https://arxiv.org/pdf/2402.00703
- Abstract Satellites are capable of capturing high-resolution videos. It makes vehicle perception from satellite become possible. Compared to street surveillance, drive recorder or other equipments, satellite videos provide a much broader city-scale view, so that the global dynamic scene of the traffic are captured and displayed. Traffic monitoring from satellite is a new task with great potential applications, including traffic jams prediction, path planning, vehicle dispatching, \emph{etc.}. Practically, limited by the resolution and view, the captured vehicles are very tiny (a few pixels) and move slowly. Worse still, these satellites are in Low Earth Orbit (LEO) to capture such high-resolution videos, so the background is also moving. Under this circumstance, traffic monitoring from the satellite view is an extremely challenging task. To attract more researchers into this field, we build a large-scale benchmark for traffic monitoring from satellite. It supports several tasks, including tiny object detection, counting and density estimation. The dataset is constructed based on 12 satellite videos and 14 synthetic videos recorded from GTA-V. They are separated into 408 video clips, which contain 7,336 real satellite images and 1,960 synthetic images. 128,801 vehicles are annotated totally, and the number of vehicles in each image varies from 0 to 101. Several classic and state-of-the-art approaches in traditional computer vision are evaluated on the datasets, so as to compare the performance of different approaches, analyze the challenges in this task, and discuss the future prospects. The dataset is available at: https://github.com/Chenxi1510/Vehicle-Perception-from-Satellite-Videos.
Keyword: transformer
LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition
- Authors: Youbing Hu, Yun Cheng, Anqi Lu, Zhiqiang Cao, Dawei Wei, Jie Liu, Zhijun Li
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.00033
- Pdf link: https://arxiv.org/pdf/2402.00033
- Abstract The Vision Transformer (ViT) excels in accuracy when handling high-resolution images, yet it confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements. To address this, we present the Localization and Focus Vision Transformer (LF-ViT). This model operates by strategically curtailing computational demands without impinging on performance. In the Localization phase, a reduced-resolution image is processed; if a definitive prediction remains elusive, our pioneering Neighborhood Global Class Attention (NGCA) mechanism is triggered, effectively identifying and spotlighting class-discriminative regions based on initial findings. Subsequently, in the Focus phase, this designated region is used from the original image to enhance recognition. Uniquely, LF-ViT employs consistent parameters across both phases, ensuring seamless end-to-end optimization. Our empirical tests affirm LF-ViT's prowess: it remarkably decreases Deit-S's FLOPs by 63% and concurrently amplifies throughput twofold. Code of this project is at https://github.com/edgeai1/LF-ViT.git.
TrackGPT -- A generative pre-trained transformer for cross-domain entity trajectory forecasting
- Authors: Nicholas Stroh
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.00066
- Pdf link: https://arxiv.org/pdf/2402.00066
- Abstract The forecasting of entity trajectories at future points in time is a critical capability gap in applications across both Commercial and Defense sectors. Transformers, and specifically Generative Pre-trained Transformer (GPT) networks have recently revolutionized several fields of Artificial Intelligence, most notably Natural Language Processing (NLP) with the advent of Large Language Models (LLM) like OpenAI's ChatGPT. In this research paper, we introduce TrackGPT, a GPT-based model for entity trajectory forecasting that has shown utility across both maritime and air domains, and we expect to perform well in others. TrackGPT stands as a pioneering GPT model capable of producing accurate predictions across diverse entity time series datasets, demonstrating proficiency in generating both long-term forecasts with sustained accuracy and short-term forecasts with high precision. We present benchmarks against state-of-the-art deep learning techniques, showing that TrackGPT's forecasting capability excels in terms of accuracy, reliability, and modularity. Importantly, TrackGPT achieves these results while remaining domain-agnostic and requiring minimal data features (only location and time) compared to models achieving similar performance. In conclusion, our findings underscore the immense potential of applying GPT architectures to the task of entity trajectory forecasting, exemplified by the innovative TrackGPT model.
D-Nikud: Enhancing Hebrew Diacritization with LSTM and Pretrained Models
- Authors: Adi Rosenthal, Nadav Shaked
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.00075
- Pdf link: https://arxiv.org/pdf/2402.00075
- Abstract D-Nikud, a novel approach to Hebrew diacritization that integrates the strengths of LSTM networks and BERT-based (transformer) pre-trained model. Inspired by the methodologies employed in Nakdimon, we integrate it with the TavBERT pre-trained model, our system incorporates advanced architectural choices and diverse training data. Our experiments showcase state-of-the-art results on several benchmark datasets, with a particular emphasis on modern texts and more specified diacritization like gender.
Common Sense Reasoning for Deep Fake Detection
- Authors: Yue Zhang, Ben Colman, Ali Shahriyari, Gaurav Bharaj
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.00126
- Pdf link: https://arxiv.org/pdf/2402.00126
- Abstract State-of-the-art approaches rely on image-based features extracted via neural networks for the deepfake detection binary classification. While these approaches trained in the supervised sense extract likely fake features, they may fall short in representing unnatural `non-physical' semantic facial attributes -- blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. However, such facial attributes are generally easily perceived by humans via common sense reasoning. Furthermore, image-based feature extraction methods that provide visual explanation via saliency maps can be hard to be interpreted by humans. To address these challenges, we propose the use of common sense reasoning to model deepfake detection, and extend it to the Deepfake Detection VQA (DD-VQA) task with the aim to model human intuition in explaining the reason behind labeling an image as either real or fake. To this end, we introduce a new dataset that provides answers to the questions related to the authenticity of an image, along with its corresponding explanations. We also propose a Vision and Language Transformer-based framework for the DD-VQA task, incorporating text and image aware feature alignment formulations. Finally, we evaluate our method on both the performance of deepfake detection and the quality of the generated explanations. We hope that this task inspires researchers to explore new avenues for enhancing language-based interpretability and cross-modality applications in the realm of deepfake detection.
Exploring the limits of decoder-only models trained on public speech recognition corpora
- Authors: Ankit Gupta, George Saon, Brian Kingsbury
- Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2402.00235
- Pdf link: https://arxiv.org/pdf/2402.00235
- Abstract The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains unclear if decoder-only models trained on public data alone can deliver competitive performance. In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. Our Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. We release our codebase and model checkpoints under permissive license.
Positional Encoding Helps Recurrent Neural Networks Handle a Large Vocabulary
- Authors: Takashi Morita
- Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
- Arxiv link: https://arxiv.org/abs/2402.00236
- Pdf link: https://arxiv.org/pdf/2402.00236
- Abstract This study discusses the effects of positional encoding on recurrent neural networks (RNNs) utilizing synthetic benchmarks. Positional encoding "time-stamps" data points in time series and complements the capabilities of Transformer neural networks, which lack an inherent mechanism for representing the data order. By contrast, RNNs can encode the temporal information of data points on their own, rendering their use of positional encoding seemingly "redundant". Nonetheless, empirical investigations reveal the effectiveness of positional encoding even when coupled with RNNs, specifically for handling a large vocabulary that yields diverse observations. These findings pave the way for a new line of research on RNNs, concerning the combination of input-driven and autonomous time representation. Additionally, biological implications of the computational/simulational results are discussed, in the light of the affinity between the sinusoidal implementation of positional encoding and neural oscillations in biological brains.
LRDif: Diffusion Models for Under-Display Camera Emotion Recognition
- Authors: Zhifeng Wang, Kaihao Zhang, Ramesh Sankaranarayana
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.00250
- Pdf link: https://arxiv.org/pdf/2402.00250
- Abstract This study introduces LRDif, a novel diffusion-based framework designed specifically for facial expression recognition (FER) within the context of under-display cameras (UDC). To address the inherent challenges posed by UDC's image degradation, such as reduced sharpness and increased noise, LRDif employs a two-stage training strategy that integrates a condensed preliminary extraction network (FPEN) and an agile transformer network (UDCformer) to effectively identify emotion labels from UDC images. By harnessing the robust distribution mapping capabilities of Diffusion Models (DMs) and the spatial dependency modeling strength of transformers, LRDif effectively overcomes the obstacles of noise and distortion inherent in UDC environments. Comprehensive experiments on standard FER datasets including RAF-DB, KDEF, and FERPlus, LRDif demonstrate state-of-the-art performance, underscoring its potential in advancing FER applications. This work not only addresses a significant gap in the literature by tackling the UDC challenge in FER but also sets a new benchmark for future research in the field.
Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues
- Authors: Soufiane Belharbi, Marco Pedersoli, Alessandro Lameiras Koerich, Simon Bacon, Eric Granger
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.00281
- Pdf link: https://arxiv.org/pdf/2402.00281
- Abstract While state-of-the-art facial expression recognition (FER) classifiers achieve a high level of accuracy, they lack interpretability, an important aspect for end-users. To recognize basic facial expressions, experts resort to a codebook associating a set of spatial action units to a facial expression. In this paper, we follow the same expert footsteps, and propose a learning strategy that allows us to explicitly incorporate spatial action units (aus) cues into the classifier's training to build a deep interpretable model. In particular, using this aus codebook, input image expression label, and facial landmarks, a single action units heatmap is built to indicate the most discriminative regions of interest in the image w.r.t the facial expression. We leverage this valuable spatial cue to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with \aus map. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with aus maps, simulating the experts' decision process. This is achieved using only the image class expression as supervision and without any extra manual annotations. Moreover, our method is generic. It can be applied to any CNN- or transformer-based deep classifier without the need for architectural change or adding significant training time. Our extensive evaluation on two public benchmarks RAFDB, and AFFECTNET datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on Class-Activation Mapping methods (CAMs), and we show that our training technique improves the CAM interpretability.
Merging Multi-Task Models via Weight-Ensembling Mixture of Experts
- Authors: Anke Tang, Li Shen, Yong Luo, Nan Yin, Lefei Zhang, Dacheng Tao
- Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.00433
- Pdf link: https://arxiv.org/pdf/2402.00433
- Abstract Merging various task-specific Transformer-based models trained on different tasks into a single unified model can execute all the tasks concurrently. Previous methods, exemplified by task arithmetic, have been proven to be both effective and scalable. Existing methods have primarily focused on seeking a static optimal solution within the original model parameter space. A notable challenge is mitigating the interference between parameters of different models, which can substantially deteriorate performance. In this paper, we propose to merge most of the parameters while upscaling the MLP of the Transformer layers to a weight-ensembling mixture of experts (MoE) module, which can dynamically integrate shared and task-specific knowledge based on the input, thereby providing a more flexible solution that can adapt to the specific needs of each instance. Our key insight is that by identifying and separating shared knowledge and task-specific knowledge, and then dynamically integrating them, we can mitigate the parameter interference problem to a great extent. We conduct the conventional multi-task model merging experiments and evaluate the generalization and robustness of our method. The results demonstrate the effectiveness of our method and provide a comprehensive understanding of our method. The code is available at https://anonymous.4open.science/r/weight-ensembling_MoE-67C9/
Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling
- Authors: Mingze Wang, Weinan E
- Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2402.00522
- Pdf link: https://arxiv.org/pdf/2402.00522
- Abstract We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads, and these insights also provide natural suggestions for alternative architectures.
A Manifold Representation of the Key in Vision Transformers
- Authors: Li Meng, Morten Goodwin, Anis Yazidi, Paal Engelstad
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2402.00534
- Pdf link: https://arxiv.org/pdf/2402.00534
- Abstract Vision Transformers implement multi-head self-attention (MSA) via stacking multiple attention blocks. The query, key, and value are often intertwined and generated within those blocks via a single, shared linear transformation. This paper explores the concept of disentangling the key from the query and value, and adopting a manifold representation for the key. Our experiments reveal that decoupling and endowing the key with a manifold structure can enhance the model performance. Specifically, ViT-B exhibits a 0.87% increase in top-1 accuracy, while Swin-T sees a boost of 0.52% in top-1 accuracy on the ImageNet-1K dataset, with eight charts in the manifold key. Our approach also yields positive results in object detection and instance segmentation tasks on the COCO dataset. Through detailed ablation studies, we establish that these performance gains are not merely due to the simplicity of adding more parameters and computations. Future research may investigate strategies for cutting the budget of such representations and aim for further performance improvements based on our findings.
Dynamic Texture Transfer using PatchMatch and Transformers
- Authors: Guo Pu, Shiyao Xu, Xixin Cao, Zhouhui Lian
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.00606
- Pdf link: https://arxiv.org/pdf/2402.00606
- Abstract How to automatically transfer the dynamic texture of a given video to the target still image is a challenging and ongoing problem. In this paper, we propose to handle this task via a simple yet effective model that utilizes both PatchMatch and Transformers. The key idea is to decompose the task of dynamic texture transfer into two stages, where the start frame of the target video with the desired dynamic texture is synthesized in the first stage via a distance map guided texture transfer module based on the PatchMatch algorithm. Then, in the second stage, the synthesized image is decomposed into structure-agnostic patches, according to which their corresponding subsequent patches can be predicted by exploiting the powerful capability of Transformers equipped with VQ-VAE for processing long discrete sequences. After getting all those patches, we apply a Gaussian weighted average merging strategy to smoothly assemble them into each frame of the target stylized video. Experimental results demonstrate the effectiveness and superiority of the proposed method in dynamic texture transfer compared to the state of the art.
Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks
- Authors: Zhongxin Liu, Zhijie Tang, Junwei Zhang, Xin Xia, Xiaohu Yang
- Subjects: Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2402.00657
- Pdf link: https://arxiv.org/pdf/2402.00657
- Abstract Vulnerability analysis is crucial for software security. This work focuses on using pre-training techniques to enhance the understanding of vulnerable code and boost vulnerability analysis. The code understanding ability of a pre-trained model is highly related to its pre-training objectives. The semantic structure, e.g., control and data dependencies, of code is important for vulnerability analysis. However, existing pre-training objectives either ignore such structure or focus on learning to use it. The feasibility and benefits of learning the knowledge of analyzing semantic structure have not been investigated. To this end, this work proposes two novel pre-training objectives, namely Control Dependency Prediction (CDP) and Data Dependency Prediction (DDP), which aim to predict the statement-level control dependencies and token-level data dependencies, respectively, in a code snippet only based on its source code. During pre-training, CDP and DDP can guide the model to learn the knowledge required for analyzing fine-grained dependencies in code. After pre-training, the pre-trained model can boost the understanding of vulnerable code during fine-tuning and can directly be used to perform dependence analysis for both partial and complete functions. To demonstrate the benefits of our pre-training objectives, we pre-train a Transformer model named PDBERT with CDP and DDP, fine-tune it on three vulnerability analysis tasks, i.e., vulnerability detection, vulnerability classification, and vulnerability assessment, and also evaluate it on program dependence analysis. Experimental results show that PDBERT benefits from CDP and DDP, leading to state-of-the-art performance on the three downstream tasks. Also, PDBERT achieves F1-scores of over 99% and 94% for predicting control and data dependencies, respectively, in partial and complete functions.
Uncertainty-Aware Guidance for Target Tracking subject to Intermittent Measurements using Motion Model Learning
- Authors: Andres Pulido, Kyle Volle, Kristy Waters, Zachary I. Bell, Prashant Ganesh, Jane Shin
- Subjects: Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2402.00671
- Pdf link: https://arxiv.org/pdf/2402.00671
- Abstract This letter presents a novel guidance law for target tracking applications where the target motion model is unknown and sensor measurements are intermittent due to unknown environmental conditions and low measurement update rate. In this work, the target motion model is represented by a transformer-based neural network and trained by previous target position measurements. This neural network (NN)-based motion model serves as the prediction step in a particle filter for target state estimation and uncertainty quantification. Then this estimation uncertainty is utilized in the information-driven guidance law to compute a path for the mobile agent to travel to a position with maximum expected entropy reduction (EER). The computation of EER is performed in real-time by approximating the probability distribution of the state using the particle representation from particle filter. Simulation and hardware experiments are performed with a quadcopter agent and TurtleBot target to demonstrate that the presented guidance law outperforms two other baseline guidance methods.
Comparative Study of Large Language Model Architectures on Frontier
- Authors: Junqi Yin, Avishek Bose, Guojing Cong, Isaac Lyngaas, Quentin Anthony
- Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2402.00691
- Pdf link: https://arxiv.org/pdf/2402.00691
- Abstract Large language models (LLMs) have garnered significant attention in both the AI community and beyond. Among these, the Generative Pre-trained Transformer (GPT) has emerged as the dominant architecture, spawning numerous variants. However, these variants have undergone pre-training under diverse conditions, including variations in input data, data preprocessing, and training methodologies, resulting in a lack of controlled comparative studies. Here we meticulously examine two prominent open-sourced GPT architectures, GPT-NeoX and LLaMA, leveraging the computational power of Frontier, the world's first Exascale supercomputer. Employing the same materials science text corpus and a comprehensive end-to-end pipeline, we conduct a comparative analysis of their training and downstream performance. Our efforts culminate in achieving state-of-the-art performance on a challenging materials science benchmark. Furthermore, we investigate the computation and energy efficiency, and propose a computationally efficient method for architecture design. To our knowledge, these pre-trained models represent the largest available for materials science. Our findings provide practical guidance for building LLMs on HPC platforms.
Improving Semantic Control in Discrete Latent Spaces with Transformer Quantized Variational Autoencoders
- Authors: Yingji Zhang, Danilo S. Carvalho, Marco Valentino, Ian Pratt-Hartmann, Andre Freitas
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2402.00723
- Pdf link: https://arxiv.org/pdf/2402.00723
- Abstract Achieving precise semantic control over the latent spaces of Variational AutoEncoders (VAEs) holds significant value for downstream tasks in NLP as the underlying generative mechanisms could be better localised, explained and improved upon. Recent research, however, has struggled to achieve consistent results, primarily due to the inevitable loss of semantic information in the variational bottleneck and limited control over the decoding mechanism. To overcome these challenges, we investigate discrete latent spaces in Vector Quantized Variational AutoEncoders (VQVAEs) to improve semantic control and generation in Transformer-based VAEs. In particular, We propose T5VQVAE, a novel model that leverages the controllability of VQVAEs to guide the self-attention mechanism in T5 at the token-level, exploiting its full generalization capabilities. Experimental results indicate that T5VQVAE outperforms existing state-of-the-art VAE models, including Optimus, in terms of controllability and preservation of semantic information across different tasks such as auto-encoding of sentences and mathematical expressions, text transfer, and inference. Moreover, T5VQVAE exhibits improved inference capabilities, suggesting potential applications for downstream natural language and symbolic reasoning tasks.
Benefits of Transformer: In-Context Learning in Linear Regression Tasks with Unstructured Data
- Authors: Yue Xing, Xiaofeng Lin, Namjoon Suh, Qifan Song, Guang Cheng
- Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2402.00743
- Pdf link: https://arxiv.org/pdf/2402.00743
- Abstract In practice, it is observed that transformer-based models can learn concepts in context in the inference stage. While existing literature, e.g., \citet{zhang2023trained,huang2023context}, provide theoretical explanations on this in-context learning ability, they assume the input $x_i$ and the output $y_i$ for each sample are embedded in the same token (i.e., structured data). However, in reality, they are presented in two tokens (i.e., unstructured data \cite{wibisono2023role}). In this case, this paper conducts experiments in linear regression tasks to study the benefits of the architecture of transformers and provides some corresponding theoretical intuitions to explain why the transformer can learn from unstructured data. We study the exact components in a transformer that facilitate the in-context learning. In particular, we observe that (1) a transformer with two layers of softmax (self-)attentions with look-ahead attention mask can learn from the prompt if $y_i$ is in the token next to $x_i$ for each example; (2) positional encoding can further improve the performance; and (3) multi-head attention with a high input embedding dimension has a better prediction performance than single-head attention.
Dense Reward for Free in Reinforcement Learning from Human Feedback
- Authors: Alex J. Chan, Hao Sun, Samuel Holt, Mihaela van der Schaar
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2402.00782
- Pdf link: https://arxiv.org/pdf/2402.00782
- Abstract Reinforcement Learning from Human Feedback (RLHF) has been credited as the key advance that has allowed Large Language Models (LLMs) to effectively follow instructions and produce useful assistance. Classically, this involves generating completions from the LLM in response to a query before using a separate reward model to assign a score to the full completion. As an auto-regressive process, the LLM has to take many "actions" (selecting individual tokens) and only receives a single, sparse reward at the end of an episode, a setup that is known to be difficult to optimise in traditional reinforcement learning. In this work we leverage the fact that the reward model contains more information than just its scalar output, in particular, it calculates an attention map over tokens as part of the transformer architecture. We use these attention weights to redistribute the reward along the whole completion, effectively densifying the signal and highlighting the most important tokens, all without incurring extra computational cost or requiring any additional modelling. We demonstrate that, theoretically, this approach is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces
- Authors: Chloe Wang, Oleksii Tsepa, Jun Ma, Bo Wang
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.00789
- Pdf link: https://arxiv.org/pdf/2402.00789
- Abstract Attention mechanisms have been widely used to capture long-range dependencies among nodes in Graph Transformers. Bottlenecked by the quadratic computational cost, attention mechanisms fail to scale in large graphs. Recent improvements in computational efficiency are mainly achieved by attention sparsification with random or heuristic-based graph subsampling, which falls short in data-dependent context reasoning. State space models (SSMs), such as Mamba, have gained prominence for their effectiveness and efficiency in modeling long-range dependencies in sequential data. However, adapting SSMs to non-sequential graph data presents a notable challenge. In this work, we introduce Graph-Mamba, the first attempt to enhance long-range context modeling in graph networks by integrating a Mamba block with the input-dependent node selection mechanism. Specifically, we formulate graph-centric node prioritization and permutation strategies to enhance context-aware reasoning, leading to a substantial improvement in predictive performance. Extensive experiments on ten benchmark datasets demonstrate that Graph-Mamba outperforms state-of-the-art methods in long-range graph prediction tasks, with a fraction of the computational cost in both FLOPs and GPU memory consumption. The code and models are publicly available at https://github.com/bowang-lab/Graph-Mamba.
Common errors in Generative AI systems used for knowledge extraction in the climate action domain
- Authors: Denis Havlik, Marcelo Pias
- Subjects: Computers and Society (cs.CY)
- Arxiv link: https://arxiv.org/abs/2402.00830
- Pdf link: https://arxiv.org/pdf/2402.00830
- Abstract Large Language Models (LLMs) and, more specifically, the Generative Pre-Trained Transformers (GPT) can help stakeholders in climate action explore digital knowledge bases and extract and utilize climate action knowledge in a sustainable manner. However, LLMs are "probabilistic models of knowledge bases" that excel at generating convincing texts but cannot be entirely relied upon due to the probabilistic nature of the information produced. This brief report illustrates the problem space with examples of LLM responses to some of the questions of relevance to climate action.
ALISON: Fast and Effective Stylometric Authorship Obfuscation
- Authors: Eric Xing, Saranya Venkatraman, Thai Le, Dongwon Lee
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2402.00835
- Pdf link: https://arxiv.org/pdf/2402.00835
- Abstract Authorship Attribution (AA) and Authorship Obfuscation (AO) are two competing tasks of increasing importance in privacy research. Modern AA leverages an author's consistent writing style to match a text to its author using an AA classifier. AO is the corresponding adversarial task, aiming to modify a text in such a way that its semantics are preserved, yet an AA model cannot correctly infer its authorship. To address privacy concerns raised by state-of-the-art (SOTA) AA methods, new AO methods have been proposed but remain largely impractical to use due to their prohibitively slow training and obfuscation speed, often taking hours. To this challenge, we propose a practical AO method, ALISON, that (1) dramatically reduces training/obfuscation time, demonstrating more than 10x faster obfuscation than SOTA AO methods, (2) achieves better obfuscation success through attacking three transformer-based AA methods on two benchmark datasets, typically performing 15% better than competing methods, (3) does not require direct signals from a target AA classifier during obfuscation, and (4) utilizes unique stylometric features, allowing sound model interpretation for explainable obfuscation. We also demonstrate that ALISON can effectively prevent four SOTA AA methods from accurately determining the authorship of ChatGPT-generated texts, all while minimally changing the original text semantics. To ensure the reproducibility of our findings, our code and data are available at: https://github.com/EricX003/ALISON.
Keyword: scene understanding
There is no result
Keyword: visual reasoning
There is no result