arxiv-daily
arxiv-daily copied to clipboard
New submissions for Wed, 20 Dec 23
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
Anomaly detection for automated inspection of power line insulators
- Authors: Laya Das, Blazhe Gjorgiev, Giovanni Sansavini
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2312.11470
- Pdf link: https://arxiv.org/pdf/2312.11470
- Abstract Inspection of insulators is important to ensure reliable operation of the power system. Deep learning has recently been explored to automate the inspection process by leveraging aerial images captured by drones along with powerful object detection models. However, a purely object detection-based approach exhibits class imbalance-induced poor detection accuracy for faulty insulators, especially for incipient faults. In order to address this issue in a data-efficient manner, this article proposes a two-stage approach that leverages object detection in conjunction with anomaly detection to reliably detect faults in insulators. The article adopts an explainable deep neural network-based one-class classifier for anomaly detection, that reduces the reliance on plentifully available images of faulty insulators, that might be difficult to obtain in real-life applications. The anomaly detection model is trained with two datasets -- representing data abundant and data scarce scenarios -- in unsupervised and semi-supervised manner. The results suggest that including as few as six real anomalies in the training dataset significantly improves the performance of the model, and enables reliable detection of rarely occurring faults in insulators. An analysis of the explanations provided by the anomaly detection model reveals that the model is able to accurately identify faulty regions on the insulator disks, while also exhibiting some false predictions.
Diffusion-Based Particle-DETR for BEV Perception
- Authors: Asen Nachkov, Martin Danelljan, Danda Pani Paudel, Luc Van Gool
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.11578
- Pdf link: https://arxiv.org/pdf/2312.11578
- Abstract The Bird-Eye-View (BEV) is one of the most widely-used scene representations for visual perception in Autonomous Vehicles (AVs) due to its well suited compatibility to downstream tasks. For the enhanced safety of AVs, modeling perception uncertainty in BEV is crucial. Recent diffusion-based methods offer a promising approach to uncertainty modeling for visual perception but fail to effectively detect small objects in the large coverage of the BEV. Such degradation of performance can be attributed primarily to the specific network architectures and the matching strategy used when training. Here, we address this problem by combining the diffusion paradigm with current state-of-the-art 3D object detectors in BEV. We analyze the unique challenges of this approach, which do not exist with deterministic detectors, and present a simple technique based on object query interpolation that allows the model to learn positional dependencies even in the presence of the diffusion noise. Based on this, we present a diffusion-based DETR model for object detection that bears similarities to particle methods. Abundant experimentation on the NuScenes dataset shows equal or better performance for our generative approach, compared to deterministic state-of-the-art methods. Our source code will be made publicly available.
Squeezed Edge YOLO: Onboard Object Detection on Edge Devices
- Authors: Edward Humes, Mozhgan Navardi, Tinoosh Mohsenin
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.11716
- Pdf link: https://arxiv.org/pdf/2312.11716
- Abstract Demand for efficient onboard object detection is increasing due to its key role in autonomous navigation. However, deploying object detection models such as YOLO on resource constrained edge devices is challenging due to the high computational requirements of such models. In this paper, an compressed object detection model named Squeezed Edge YOLO is examined. This model is compressed and optimized to kilobytes of parameters in order to fit onboard such edge devices. To evaluate Squeezed Edge YOLO, two use cases - human and shape detection - are used to show the model accuracy and performance. Moreover, the model is deployed onboard a GAP8 processor with 8 RISC-V cores and an NVIDIA Jetson Nano with 4GB of memory. Experimental results show Squeezed Edge YOLO model size is optimized by a factor of 8x which leads to 76% improvements in energy efficiency and 3.3x faster throughout.
Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment
- Authors: Hamza Mukhtar, Muhammad Usman Ghani Khan
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.11929
- Pdf link: https://arxiv.org/pdf/2312.11929
- Abstract Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics. Despite considerable advancements, existing MOT methodologies tend to falter when faced with non-uniform movements, occlusions, and appearance-reappearance scenarios of the objects. Recognizing this inadequacy, we put forward an integrated MOT method that not only marries object detection and identity linkage within a singular, end-to-end trainable framework but also equips the model with the ability to maintain object identity links over long periods of time. Our proposed model, named STMMOT, is built around four key modules: 1) candidate proposal generation, which generates object proposals via a vision-transformer encoder-decoder architecture that detects the object from each frame in the video; 2) scale variant pyramid, a progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; 3) spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and 4) spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT lies in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with attention mechanisms and eradicates the need for post-processing.
Diffusing More Objects for Semi-Supervised Domain Adaptation with Less Labeling
- Authors: Leander van den Heuvel, Gertjan Burghouts, David W. Zhang, Gwenn Englebienne, Sabina B. van Rooij
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2312.12000
- Pdf link: https://arxiv.org/pdf/2312.12000
- Abstract For object detection, it is possible to view the prediction of bounding boxes as a reverse diffusion process. Using a diffusion model, the random bounding boxes are iteratively refined in a denoising step, conditioned on the image. We propose a stochastic accumulator function that starts each run with random bounding boxes and combines the slightly different predictions. We empirically verify that this improves detection performance. The improved detections are leveraged on unlabelled images as weighted pseudo-labels for semi-supervised learning. We evaluate the method on a challenging out-of-domain test set. Our method brings significant improvements and is on par with human-selected pseudo-labels, while not requiring any human involvement.
Object-Aware Domain Generalization for Object Detection
- Authors: Wooju Lee, Dasol Hong, Hyungtae Lim, Hyun Myung
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.12133
- Pdf link: https://arxiv.org/pdf/2312.12133
- Abstract Single-domain generalization (S-DG) aims to generalize a model to unseen environments with a single-source domain. However, most S-DG approaches have been conducted in the field of classification. When these approaches are applied to object detection, the semantic features of some objects can be damaged, which can lead to imprecise object localization and misclassification. To address these problems, we propose an object-aware domain generalization (OA-DG) method for single-domain generalization in object detection. Our method consists of data augmentation and training strategy, which are called OA-Mix and OA-Loss, respectively. OA-Mix generates multi-domain data with multi-level transformation and object-aware mixing strategy. OA-Loss enables models to learn domain-invariant representations for objects and backgrounds from the original and OA-Mixed images. Our proposed method outperforms state-of-the-art works on standard benchmarks. Our code is available at https://github.com/WoojuLee24/OA-DG.
First qualitative observations on deep learning vision model YOLO and DETR for automated driving in Austria
- Authors: Stefan Schoder
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.12314
- Pdf link: https://arxiv.org/pdf/2312.12314
- Abstract This study investigates the application of single and two-stage 2D-object detection algorithms like You Only Look Once (YOLO), Real-Time DEtection TRansformer (RT-DETR) algorithm for automated object detection to enhance road safety for autonomous driving on Austrian roads. The YOLO algorithm is a state-of-the-art real-time object detection system known for its efficiency and accuracy. In the context of driving, its potential to rapidly identify and track objects is crucial for advanced driver assistance systems (ADAS) and autonomous vehicles. The research focuses on the unique challenges posed by the road conditions and traffic scenarios in Austria. The country's diverse landscape, varying weather conditions, and specific traffic regulations necessitate a tailored approach for reliable object detection. The study utilizes a selective dataset comprising images and videos captured on Austrian roads, encompassing urban, rural, and alpine environments.
LASA: Instance Reconstruction from Real Scans using A Large-scale Aligned Shape Annotation Dataset
- Authors: Haolin Liu, Chongjie Ye, Yinyu Nie, Yingfan He, Xiaoguang Han
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.12418
- Pdf link: https://arxiv.org/pdf/2312.12418
- Abstract Instance shape reconstruction from a 3D scene involves recovering the full geometries of multiple objects at the semantic instance level. Many methods leverage data-driven learning due to the intricacies of scene complexity and significant indoor occlusions. Training these methods often requires a large-scale, high-quality dataset with aligned and paired shape annotations with real-world scans. Existing datasets are either synthetic or misaligned, restricting the performance of data-driven methods on real data. To this end, we introduce LASA, a Large-scale Aligned Shape Annotation Dataset comprising 10,412 high-quality CAD annotations aligned with 920 real-world scene scans from ArkitScenes, created manually by professional artists. On this top, we propose a novel Diffusion-based Cross-Modal Shape Reconstruction (DisCo) method. It is empowered by a hybrid feature aggregation design to fuse multi-modal inputs and recover high-fidelity object geometries. Besides, we present an Occupancy-Guided 3D Object Detection (OccGOD) method and demonstrate that our shape annotations provide scene occupancy clues that can further improve 3D object detection. Supported by LASA, extensive experiments show that our methods achieve state-of-the-art performance in both instance-level scene reconstruction and 3D object detection tasks.
The Endoscapes Dataset for Surgical Scene Segmentation, Object Detection, and Critical View of Safety Assessment: Official Splits and Benchmark
- Authors: Aditya Murali, Deepak Alapatt, Pietro Mascagni, Armine Vardazaryan, Alain Garcia, Nariaki Okamoto, Guido Costamagna, Didier Mutter, Jacques Marescaux, Bernard Dallemagne, Nicolas Padoy
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.12429
- Pdf link: https://arxiv.org/pdf/2312.12429
- Abstract This technical report provides a detailed overview of Endoscapes, a dataset of laparoscopic cholecystectomy (LC) videos with highly intricate annotations targeted at automated assessment of the Critical View of Safety (CVS). Endoscapes comprises 201 LC videos with frames annotated sparsely but regularly with segmentation masks, bounding boxes, and CVS assessment by three different clinical experts. Altogether, there are 11090 frames annotated with CVS and 1933 frames annotated with tool and anatomy bounding boxes from the 201 videos, as well as an additional 422 frames from 50 of the 201 videos annotated with tool and anatomy segmentation masks. In this report, we provide detailed dataset statistics (size, class distribution, dataset splits, etc.) and a comprehensive performance benchmark for instance segmentation, object detection, and CVS prediction. The dataset and model checkpoints are publically available at https://github.com/CAMMA-public/Endoscapes.
Weakly Supervised Open-Vocabulary Object Detection
- Authors: Jianghang Lin, Yunhang Shen, Bingquan Wang, Shaohui Lin, Ke Li, Liujuan Cao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.12437
- Pdf link: https://arxiv.org/pdf/2312.12437
- Abstract Despite weakly supervised object detection (WSOD) being a promising step toward evading strong instance-level annotations, its capability is confined to closed-set categories within a single training dataset. In this paper, we propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize diverse datasets with only image-level annotations. To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment. First, we perform data-aware feature extraction to produce an input-conditional coefficient, which is leveraged into dataset attribute prototypes to identify dataset bias and help achieve cross-dataset generalization. Second, a customized location-oriented weakly supervised region proposal network is proposed to utilize high-level semantic layouts from the category-agnostic segment anything model to distinguish object boundaries. Lastly, we introduce a proposal-concept synchronized multiple-instance network, i.e., object mining and refinement with visual-semantic alignment, to discover objects matched to the text embeddings of concepts. Extensive experiments on Pascal VOC and MS COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art compared with previous WSOD methods in both close-set object localization and detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary learning to achieve on-par or even better performance than well-established fully-supervised open-vocabulary object detection (FSOVOD).
Keyword: transformer
Extracting Interpretable Local and Global Representations from Attention on Time Series
- Authors: Leonid Schwenke, Martin Atzmueller
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.11466
- Pdf link: https://arxiv.org/pdf/2312.11466
- Abstract This paper targets two transformer attention based interpretability methods working with local abstraction and global representation, in the context of time series data. We distinguish local and global contexts, and provide a comprehensive framework for both general interpretation options. We discuss their specific instantiation via different methods in detail, also outlining their respective computational implementation and abstraction variants. Furthermore, we provide extensive experimentation demonstrating the efficacy of the presented approaches. In particular, we perform our experiments using a selection of univariate datasets from the UCR UEA time series repository where we both assess the performance of the proposed approaches, as well as their impact on explainability and interpretability/complexity. Here, with an extensive analysis of hyperparameters, the presented approaches demonstrate an significant improvement in interpretability/complexity, while capturing many core decisions of and maintaining a similar performance to the baseline model. Finally, we draw general conclusions outlining and guiding the application of the presented methods.
Labrador: Exploring the Limits of Masked Language Modeling for Laboratory Data
- Authors: David R. Bellamy, Bhawesh Kumar, Cindy Wang, Andrew Beam
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.11502
- Pdf link: https://arxiv.org/pdf/2312.11502
- Abstract In this work we introduce Labrador, a pre-trained Transformer model for laboratory data. Labrador and BERT were pre-trained on a corpus of 100 million lab test results from electronic health records (EHRs) and evaluated on various downstream outcome prediction tasks. Both models demonstrate mastery of the pre-training task but neither consistently outperform XGBoost on downstream supervised tasks. Our ablation studies reveal that transfer learning shows limited effectiveness for BERT and achieves marginal success with Labrador. We explore the reasons for the failure of transfer learning and suggest that the data generating process underlying each patient cannot be characterized sufficiently using labs alone, among other factors. We encourage future work to focus on joint modeling of multiple EHR data categories and to include tree-based baselines in their evaluations.
Explain To Decide: A Human-Centric Review on the Role of Explainable Artificial Intelligence in AI-assisted Decision Making
- Authors: Milad Rogha
- Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.11507
- Pdf link: https://arxiv.org/pdf/2312.11507
- Abstract The unprecedented performance of machine learning models in recent years, particularly Deep Learning and transformer models, has resulted in their application in various domains such as finance, healthcare, and education. However, the models are error-prone and cannot be used autonomously, especially in decision-making scenarios where, technically or ethically, the cost of error is high. Moreover, because of the black-box nature of these models, it is frequently difficult for the end user to comprehend the models' outcomes and underlying processes to trust and use the model outcome to make a decision. Explainable Artificial Intelligence (XAI) aids end-user understanding of the model by utilizing approaches, including visualization techniques, to explain and interpret the inner workings of the model and how it arrives at a result. Although numerous research studies have been conducted recently focusing on the performance of models and the XAI approaches, less work has been done on the impact of explanations on human-AI team performance. This paper surveyed the recent empirical studies on XAI's impact on human-AI decision-making, identified the challenges, and proposed future research directions.
QuadAttack: A Quadratic Programming Approach to Ordered Top-K Attacks
- Authors: Thomas Paniagua, Ryan Grainger, Tianfu Wu
- Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.11510
- Pdf link: https://arxiv.org/pdf/2312.11510
- Abstract The adversarial vulnerability of Deep Neural Networks (DNNs) has been well-known and widely concerned, often under the context of learning top-$1$ attacks (e.g., fooling a DNN to classify a cat image as dog). This paper shows that the concern is much more serious by learning significantly more aggressive ordered top-$K$ clear-box~\footnote{ This is often referred to as white/black-box attacks in the literature. We choose to adopt neutral terminology, clear/opaque-box attacks in this paper, and omit the prefix clear-box for simplicity.} targeted attacks proposed in Adversarial Distillation. We propose a novel and rigorous quadratic programming (QP) method of learning ordered top-$K$ attacks with low computing cost, dubbed as \textbf{QuadAttac$K$}. Our QuadAttac$K$ directly solves the QP to satisfy the attack constraint in the feature embedding space (i.e., the input space to the final linear classifier), which thus exploits the semantics of the feature embedding space (i.e., the principle of class coherence). With the optimized feature embedding vector perturbation, it then computes the adversarial perturbation in the data space via the vanilla one-step back-propagation. In experiments, the proposed QuadAttac$K$ is tested in the ImageNet-1k classification using ResNet-50, DenseNet-121, and Vision Transformers (ViT-B and DEiT-S). It successfully pushes the boundary of successful ordered top-$K$ attacks from $K=10$ up to $K=20$ at a cheap budget ($1\times 60$) and further improves attack success rates for $K=5$ for all tested models, while retaining the performance for $K=1$.
Unlocking Musculoskeletal Disorder Risk Factors: NLP-Based Classification and Mode-Based Ranking
- Authors: Md Abrar Jahin, Subrata Talapatra
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.11517
- Pdf link: https://arxiv.org/pdf/2312.11517
- Abstract This research delves into the intricate landscape of Musculoskeletal Disorder (MSD) risk factors, employing a novel fusion of Natural Language Processing (NLP) techniques and mode-based ranking methodologies. The primary objective is to advance the comprehension of MSD risk factors, their classification, and their relative severity, facilitating more targeted preventive and management interventions. The study utilizes eight diverse models, integrating pre-trained transformers, cosine similarity, and various distance metrics to classify risk factors into personal, biomechanical, workplace, psychological, and organizational classes. Key findings reveal that the BERT model with cosine similarity attains an overall accuracy of 28%, while the sentence transformer, coupled with Euclidean, Bray-Curtis, and Minkowski distances, achieves a flawless accuracy score of 100%. In tandem with the classification efforts, the research employs a mode-based ranking approach on survey data to discern the severity hierarchy of MSD risk factors. Intriguingly, the rankings align precisely with the previous literature, reaffirming the consistency and reliability of the approach. ``Working posture" emerges as the most severe risk factor, emphasizing the critical role of proper posture in preventing MSDs. The collective perceptions of survey participants underscore the significance of factors like "Job insecurity," "Effort reward imbalance," and "Poor employee facility" in contributing to MSD risks. The convergence of rankings provides actionable insights for organizations aiming to reduce the prevalence of MSDs. The study concludes with implications for targeted interventions, recommendations for improving workplace conditions, and avenues for future research.
Large Language Models are Complex Table Parsers
- Authors: Bowen Zhao, Changkai Ji, Yuejie Zhang, Wen He, Yingwen Wang, Qing Wang, Rui Feng, Xiaobo Zhang
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2312.11521
- Pdf link: https://arxiv.org/pdf/2312.11521
- Abstract With the Generative Pre-trained Transformer 3.5 (GPT-3.5) exhibiting remarkable reasoning and comprehension abilities in Natural Language Processing (NLP), most Question Answering (QA) research has primarily centered around general QA tasks based on GPT, neglecting the specific challenges posed by Complex Table QA. In this paper, we propose to incorporate GPT-3.5 to address such challenges, in which complex tables are reconstructed into tuples and specific prompt designs are employed for dialogues. Specifically, we encode each cell's hierarchical structure, position information, and content as a tuple. By enhancing the prompt template with an explanatory description of the meaning of each tuple and the logical reasoning process of the task, we effectively improve the hierarchical structure awareness capability of GPT-3.5 to better parse the complex tables. Extensive experiments and results on Complex Table QA datasets, i.e., the open-domain dataset HiTAB and the aviation domain dataset AIT-QA show that our approach significantly outperforms previous work on both datasets, leading to state-of-the-art (SOTA) performance.
Time-Transformer: Integrating Local and Global Features for Better Time Series Generation
- Authors: Yuansan Liu, Sudanthi Wijewickrema, Ang Li, Christofer Bester, Stephen O'Leary, James Bailey
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2312.11714
- Pdf link: https://arxiv.org/pdf/2312.11714
- Abstract Generating time series data is a promising approach to address data deficiency problems. However, it is also challenging due to the complex temporal properties of time series data, including local correlations as well as global dependencies. Most existing generative models have failed to effectively learn both the local and global properties of time series data. To address this open problem, we propose a novel time series generative model named 'Time-Transformer AAE', which consists of an adversarial autoencoder (AAE) and a newly designed architecture named 'Time-Transformer' within the decoder. The Time-Transformer first simultaneously learns local and global features in a layer-wise parallel design, combining the abilities of Temporal Convolutional Networks and Transformer in extracting local features and global dependencies respectively. Second, a bidirectional cross attention is proposed to provide complementary guidance across the two branches and achieve proper fusion between local and global features. Experimental results demonstrate that our model can outperform existing state-of-the-art models in 5 out of 6 datasets, specifically on those with data containing both global and local properties. Furthermore, we highlight our model's advantage on handling this kind of data via an artificial dataset. Finally, we show our model's ability to address a real-world problem: data augmentation to support learning with small datasets and imbalanced datasets.
Assessing Logical Reasoning Capabilities of Encoder-Only Transformer Models
- Authors: Paulo Pirozelli, Marcos M. José, Paulo de Tarso P. Filho, Anarosa A. F. Brandão, Fabio G. Cozman
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2312.11720
- Pdf link: https://arxiv.org/pdf/2312.11720
- Abstract Logical reasoning is central to complex human activities, such as thinking, debating, and planning; it is also a central component of many AI systems as well. In this paper, we investigate the extent to which encoder-only transformer language models (LMs) can reason according to logical rules. We ask whether those LMs can deduce theorems in propositional calculus and first-order logic; if their relative success in these problems reflects general logical capabilities; and which layers contribute the most to the task. First, we show for several encoder-only LMs that they can be trained, to a reasonable degree, to determine logical validity on various datasets. Next, by cross-probing fine-tuned models on these datasets, we show that LMs have difficulty in transferring their putative logical reasoning ability, which suggests that they may have learned dataset-specific features, instead of a general capability. Finally, we conduct a layerwise probing experiment, which shows that the hypothesis classification task is mostly solved through higher layers.
Stronger Graph Transformer with Regularized Attention Scores
- Authors: Eugene Ku, Swetha Arunraj
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.11730
- Pdf link: https://arxiv.org/pdf/2312.11730
- Abstract Graph Neural Networks are notorious for its memory consumption. A recent Transformer based GNN called Graph Transformer are shown to obtain superior performances when long range dependencies exist. However, combining graph data and Transformer architecture led to a combinationally worse memory issue. We propose a novel version of "edge regularization technique" that alleviates the need for Positional Encoding and ultimately alleviate GT's out of memory issue. We observe that it is not clear whether having an edge regularization on top of positional encoding is helpful. However, it seems evident when no positional encoding is applied, edge regularization technique indeed stably improves GT's performance.
TPTO: A Transformer-PPO based Task Offloading Solution for Edge Computing Environments
- Authors: Niloofar Gholipour, Marcos Dias de Assuncao, Pranav Agarwal, julien gascon-samson, Rajkumar Buyya
- Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2312.11739
- Pdf link: https://arxiv.org/pdf/2312.11739
- Abstract Emerging applications in healthcare, autonomous vehicles, and wearable assistance require interactive and low-latency data analysis services. Unfortunately, cloud-centric architectures cannot fulfill the low-latency demands of these applications, as user devices are often distant from cloud data centers. Edge computing aims to reduce the latency by enabling processing tasks to be offloaded to resources located at the network's edge. However, determining which tasks must be offloaded to edge servers to reduce the latency of application requests is not trivial, especially if the tasks present dependencies. This paper proposes a DRL approach called TPTO, which leverages Transformer Networks and PPO to offload dependent tasks of IoT applications in edge computing. We consider users with various preferences, where devices can offload computation to an edge server via wireless channels. Performance evaluation results demonstrate that under fat application graphs, TPTO is more effective than state-of-the-art methods, such as Greedy, HEFT, and MRLCO, by reducing latency by 30.24%, 29.61%, and 12.41%, respectively. In addition, TPTO presents a training time approximately 2.5 times faster than an existing DRL approach.
A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models
- Authors: Harsh Sharma, Pratyush Dhingra, Janardhan Rao Doppa, Umit Ogras, Partha Pratim Pande
- Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2312.11750
- Pdf link: https://arxiv.org/pdf/2312.11750
- Abstract Transformers have revolutionized deep learning and generative modeling, enabling unprecedented advancements in natural language processing tasks. However, the size of transformer models is increasing continuously, driven by enhanced capabilities across various deep-learning tasks. This trend of ever-increasing model size has given rise to new challenges in terms of memory and computing requirements. Conventional computing platforms, including GPUs, suffer from suboptimal performance due to the memory demands imposed by models with millions/billions of parameters. The emerging chiplet-based platforms provide a new avenue for compute- and data-intensive machine learning (ML) applications enabled by a Network-on-Interposer (NoI). However, designing suitable hardware accelerators for executing Transformer inference workloads is challenging due to a wide variety of complex computing kernels in the Transformer architecture. In this paper, we leverage chiplet-based heterogeneous integration (HI) to design a high-performance and energy-efficient multi-chiplet platform to accelerate transformer workloads. We demonstrate that the proposed NoI architecture caters to the data access patterns inherent in a transformer model. The optimized placement of the chiplets and the associated NoI links and routers enable superior performance compared to the state-of-the-art hardware accelerators. The proposed NoI-based architecture demonstrates scalability across varying transformer models and improves latency and energy efficiency by up to 22.8x and 5.36x respectively.
Predicting Line-Level Defects by Capturing Code Contexts with Hierarchical Transformers
- Authors: Parvez Mahbub, Mohammad Masudur Rahman
- Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2312.11889
- Pdf link: https://arxiv.org/pdf/2312.11889
- Abstract Software defects consume 40% of the total budget in software development and cost the global economy billions of dollars every year. Unfortunately, despite the use of many software quality assurance (SQA) practices in software development (e.g., code review, continuous integration), defects may still exist in the official release of a software product. Therefore, prioritizing SQA efforts for the vulnerable areas of the codebase is essential to ensure the high quality of a software release. Predicting software defects at the line level could help prioritize the SQA effort but is a highly challenging task given that only ~3% of lines of a codebase could be defective. Existing works on line-level defect prediction often fall short and cannot fully leverage the line-level defect information. In this paper, we propose Bugsplorer, a novel deep-learning technique for line-level defect prediction. It leverages a hierarchical structure of transformer models to represent two types of code elements: code tokens and code lines. Unlike the existing techniques that are optimized for file-level defect prediction, Bugsplorer is optimized for a line-level defect prediction objective. Our evaluation with five performance metrics shows that Bugsplorer has a promising capability of predicting defective lines with 26-72% better accuracy than that of the state-of-the-art technique. It can rank the first 20% defective lines within the top 1-3% suspicious lines. Thus, Bugsplorer has the potential to significantly reduce SQA costs by ranking defective lines higher.
3D-LFM: Lifting Foundation Model
- Authors: Mosam Dabhi, Laszlo A. Jeni, Simon Lucey
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.11894
- Pdf link: https://arxiv.org/pdf/2312.11894
- Abstract The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3PDO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data -- significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.
Text-Conditioned Resampler For Long Form Video Understanding
- Authors: Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.11897
- Pdf link: https://arxiv.org/pdf/2312.11897
- Abstract Videos are highly redundant data source and it is often enough to identify a few key moments to solve any given task. In this paper, we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time allowing the model to use much longer chunks of video than earlier works. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we empirically validate its efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks which require longer video contexts and that can thus be used effectively for further evaluation of long-range video models.
External Knowledge Augmented Polyphone Disambiguation Using Large Language Model
- Authors: Chen Li
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2312.11920
- Pdf link: https://arxiv.org/pdf/2312.11920
- Abstract One of the key issues in Mandarin Chinese text-to-speech (TTS) systems is polyphone disambiguation when doing grapheme-to-phoneme (G2P) conversion. In this paper, we introduce a novel method to solve the problem as a generation task. Following the trending research of large language models (LLM) and prompt learning, the proposed method consists of three modules. Retrieval module incorporates external knowledge which is a multi-level semantic dictionary of Chinese polyphonic characters to format the sentence into a prompt. Generation module adopts the decoder-only Transformer architecture to induce the target text. Postprocess module corrects the generated text into a valid result if needed. Experimental results show that our method outperforms the existing methods on a public dataset called CPP. We also empirically study the impacts of different templates of the prompt, different sizes of training data, and whether to incorporate external knowledge.
Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment
- Authors: Hamza Mukhtar, Muhammad Usman Ghani Khan
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.11929
- Pdf link: https://arxiv.org/pdf/2312.11929
- Abstract Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics. Despite considerable advancements, existing MOT methodologies tend to falter when faced with non-uniform movements, occlusions, and appearance-reappearance scenarios of the objects. Recognizing this inadequacy, we put forward an integrated MOT method that not only marries object detection and identity linkage within a singular, end-to-end trainable framework but also equips the model with the ability to maintain object identity links over long periods of time. Our proposed model, named STMMOT, is built around four key modules: 1) candidate proposal generation, which generates object proposals via a vision-transformer encoder-decoder architecture that detects the object from each frame in the video; 2) scale variant pyramid, a progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; 3) spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and 4) spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT lies in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with attention mechanisms and eradicates the need for post-processing.
Context Disentangling and Prototype Inheriting for Robust Visual Grounding
- Authors: Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang, Zechao Li
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.11967
- Pdf link: https://arxiv.org/pdf/2312.11967
- Abstract Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that have the same category as others. However, most previous methods underestimate such information. Moreover, they are usually designed for the standard scene (without any novel object), which limits their generalization to the open-vocabulary scene. In this paper, we propose a novel framework with context disentangling and prototype inheriting for robust visual grounding to handle both scenes. Specifically, the context disentangling disentangles the referent and context features, which achieves better discrimination between them. The prototype inheriting inherits the prototypes discovered from the disentangled visual features by a prototype bank to fully utilize the seen data, especially for the open-vocabulary scene. The fused features, obtained by leveraging Hadamard product on disentangled linguistic and visual features of prototypes to avoid sharp adjusting the importance between the two types of features, are then attached with a special token and feed to a vision Transformer encoder for bounding box regression. Extensive experiments are conducted on both standard and open-vocabulary scenes. The performance comparisons indicate that our method outperforms the state-of-the-art methods in both scenarios. {The code is available at https://github.com/WayneTomas/TransCP.
Designing and Evaluating General-Purpose User Representations Based on Behavioral Logs from a Measurement Process Perspective: A Case Study with Snapchat
- Authors: Qixiang Fang, Zhihan Zhou, Francesco Barbieri, Yozen Liu, Leonardo Neves, Dong Nguyen, Daniel L. Oberski, Maarten W. Bos, Ron Dotsch
- Subjects: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2312.12111
- Pdf link: https://arxiv.org/pdf/2312.12111
- Abstract In human-computer interaction, understanding user behaviors and tailoring systems accordingly is pivotal. To this end, general-purpose user representation learning based on behavior logs is emerging as a powerful tool in user modeling, offering adaptability to various downstream tasks such as item recommendations and ad conversion prediction, without the need to fine-tune the upstream user model. While this methodology has shown promise in contexts like search engines and e-commerce platforms, its fit for instant messaging apps, a cornerstone of modern digital communication, remains largely uncharted. These apps, with their distinct interaction patterns, data structures, and user expectations, necessitate specialized attention. We explore this user modeling approach with Snapchat data as a case study. Furthermore, we introduce a novel design and evaluation framework rooted in the principles of the Measurement Process Framework from social science research methodology. Using this new framework, we design a Transformer-based user model that can produce high-quality general-purpose user representations for instant messaging platforms like Snapchat.
Exploring the Residual Stream of Transformers
- Authors: Zeping Yu, Kailai Yang, Zhiwei Liu, Sophia Ananiadou
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.12141
- Pdf link: https://arxiv.org/pdf/2312.12141
- Abstract Transformer-based models have achieved great breakthroughs in recent years. However, there are many significant questions that have not been answered in the field of explaining the reason why the models have powerful outputs. We do not know how to locate the models' important parameters storing the knowledge for predicting the next word, and whether these parameters are stored on the same layer/module or different ones. Moreover, we do not understand the mechanism to merge the knowledge into the final embedding for next word prediction. In this paper, we explore the residual stream of transformers to increase the interpretability. We find the mechanism behind residual connection is a direct addition function on before-softmax values, so the probabilities of tokens with larger before-softmax values will increase. Moreover, we prove that using log probability increase as contribution scores is reasonable, and based on this we can locate important parameters. Besides, we propose a method to analyze how previous layers affect upper layers by comparing the inner products. The experimental results and case study show that our research can increase the interpretability of transformer-based models. We will release our code on https://github.com/zepingyu0512/residualstream.
Integrating Human Vision Perception in Vision Transformers for Classifying Waste Items
- Authors: Akshat Kishore Shrivastava, Tapan Kumar Gandhi
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2312.12143
- Pdf link: https://arxiv.org/pdf/2312.12143
- Abstract In this paper, we propose an novel methodology aimed at simulating the learning phenomenon of nystagmus through the application of differential blurring on datasets. Nystagmus is a biological phenomenon that influences human vision throughout life, notably by diminishing head shake from infancy to adulthood. Leveraging this concept, we address the issue of waste classification, a pressing global concern. The proposed framework comprises two modules, with the second module closely resembling the original Vision Transformer, a state-of-the-art model model in classification tasks. The primary motivation behind our approach is to enhance the model's precision and adaptability, mirroring the real-world conditions that the human visual system undergoes. This novel methodology surpasses the standard Vision Transformer model in waste classification tasks, exhibiting an improvement with a margin of 2%. This improvement underscores the potential of our methodology in improving model precision by drawing inspiration from human vision perception. Further research in the proposed methodology could yield greater performance results, and can be extrapolated to other global issues.
Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment
- Authors: Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, Fu Lee Wang
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2312.12148
- Pdf link: https://arxiv.org/pdf/2312.12148
- Abstract With the continuous growth in the number of parameters of transformer-based pretrained language models (PLMs), particularly the emergence of large language models (LLMs) with billions of parameters, many natural language processing (NLP) tasks have demonstrated remarkable success. However, the enormous size and computational demands of these models pose significant challenges for adapting them to specific downstream tasks, especially in environments with limited computational resources. Parameter Efficient Fine-Tuning (PEFT) offers an effective solution by reducing the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine-tuning. The demands for fine-tuning PLMs, especially LLMs, have led to a surge in the development of PEFT methods, as depicted in Fig. 1. In this paper, we present a comprehensive and systematic review of PEFT methods for PLMs. We summarize these PEFT methods, discuss their applications, and outline future directions. Furthermore, we conduct experiments using several representative PEFT methods to better understand their effectiveness in parameter efficiency and memory efficiency. By offering insights into the latest advancements and practical applications, this survey serves as an invaluable resource for researchers and practitioners seeking to navigate the challenges and opportunities presented by PEFT in the context of PLMs.
Geo-located Aspect Based Sentiment Analysis (ABSA) for Crowdsourced Evaluation of Urban Environments
- Authors: Demircan Tas, Rohit Priyadarshi Sanatani
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2312.12253
- Pdf link: https://arxiv.org/pdf/2312.12253
- Abstract Sentiment analysis methods are rapidly being adopted by the field of Urban Design and Planning, for the crowdsourced evaluation of urban environments. However, most models used within this domain are able to identify positive or negative sentiment associated with a textual appraisal as a whole, without inferring information about specific urban aspects contained within it, or the sentiment associated with them. While Aspect Based Sentiment Analysis (ABSA) is becoming increasingly popular, most existing ABSA models are trained on non-urban themes such as restaurants, electronics, consumer goods and the like. This body of research develops an ABSA model capable of extracting urban aspects contained within geo-located textual urban appraisals, along with corresponding aspect sentiment classification. We annotate a dataset of 2500 crowdsourced reviews of public parks, and train a Bidirectional Encoder Representations from Transformers (BERT) model with Local Context Focus (LCF) on this data. Our model achieves significant improvement in prediction accuracy on urban reviews, for both Aspect Term Extraction (ATE) and Aspect Sentiment Classification (ASC) tasks. For demonstrative analysis, positive and negative urban aspects across Boston are spatially visualized. We hope that this model is useful for designers and planners for fine-grained urban sentiment evaluation.
First qualitative observations on deep learning vision model YOLO and DETR for automated driving in Austria
- Authors: Stefan Schoder
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.12314
- Pdf link: https://arxiv.org/pdf/2312.12314
- Abstract This study investigates the application of single and two-stage 2D-object detection algorithms like You Only Look Once (YOLO), Real-Time DEtection TRansformer (RT-DETR) algorithm for automated object detection to enhance road safety for autonomous driving on Austrian roads. The YOLO algorithm is a state-of-the-art real-time object detection system known for its efficiency and accuracy. In the context of driving, its potential to rapidly identify and track objects is crucial for advanced driver assistance systems (ADAS) and autonomous vehicles. The research focuses on the unique challenges posed by the road conditions and traffic scenarios in Austria. The country's diverse landscape, varying weather conditions, and specific traffic regulations necessitate a tailored approach for reliable object detection. The study utilizes a selective dataset comprising images and videos captured on Austrian roads, encompassing urban, rural, and alpine environments.
pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
- Authors: David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.12337
- Pdf link: https://arxiv.org/pdf/2312.12337
- Abstract We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations, we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick, allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets, where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field.
Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks
- Authors: Amrit Nagarajan, Anand Raghunathan
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.12385
- Pdf link: https://arxiv.org/pdf/2312.12385
- Abstract Transformers have rapidly increased in popularity in recent years, achieving state-of-the-art performance in processing text, images, audio and video. However, Transformers present large computational requirements for both training and inference, and are prone to overfitting during training. To address these challenges, we present Input Compression with Positional Consistency (ICPC), a new data augmentation method that, unlike prior augmentation techniques, simultaneously improves both generalization and training efficiency. ICPC applies varying levels of compression to each training sample in each epoch. This leads to smaller input sequences being processed by the Transformer, and hence faster training, while also alleviating overfitting by presenting each input with different compression levels. We introduce a consistency-aware position selection method in ICPC that enables accurate processing of compressed inputs without any changes to the underlying Transformer architecture. We detail compression-based augmentation methods for four different modalities -- insignificant word pruning for text, resolution modulation for images, spatio-temporal resolution modulation for videos, and spectogram size modulation for audio. ICPC also enables efficient variable-effort inference, where samples are first inferred at high compression levels, and progressively re-evaluated with lower compression for more challenging inputs. On 9 diverse tasks spanning 4 different modalities, ICPC improves accuracy by up to 1%, while also accelerating training and inference by up to 2.9X and 2.6X, respectively. Code is available at https://github.com/amrnag/ICPC.
Keyword: scene understanding
There is no result
Keyword: visual reasoning
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
- Authors: Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hongsheng Li, Xing Sun
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
- Arxiv link: https://arxiv.org/abs/2312.12436
- Pdf link: https://arxiv.org/pdf/2312.12436
- Abstract The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.