arxiv-daily
arxiv-daily copied to clipboard
New submissions for Wed, 7 Sep 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
TogetherNet: Bridging Image Restoration and Object Detection Together via Dynamic Enhancement Learning
- Authors: Yongzhen Wang, Xuefeng Yan, Kaiwen Zhang, Lina Gong, Haoran Xie, Fu Lee Wang, Mingqiang Wei
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01373
- Pdf link: https://arxiv.org/pdf/2209.01373
- Abstract Adverse weather conditions such as haze, rain, and snow often impair the quality of captured images, causing detection networks trained on normal images to generalize poorly in these scenarios. In this paper, we raise an intriguing question - if the combination of image restoration and object detection, can boost the performance of cutting-edge detectors in adverse weather conditions. To answer it, we propose an effective yet unified detection paradigm that bridges these two subtasks together via dynamic enhancement learning to discern objects in adverse weather conditions, called TogetherNet. Different from existing efforts that intuitively apply image dehazing/deraining as a pre-processing step, TogetherNet considers a multi-task joint learning problem. Following the joint learning scheme, clean features produced by the restoration network can be shared to learn better object detection in the detection network, thus helping TogetherNet enhance the detection capacity in adverse weather conditions. Besides the joint learning architecture, we design a new Dynamic Transformer Feature Enhancement module to improve the feature extraction and representation capabilities of TogetherNet. Extensive experiments on both synthetic and real-world datasets demonstrate that our TogetherNet outperforms the state-of-the-art detection approaches by a large margin both quantitatively and qualitatively. Source code is available at https://github.com/yz-wang/TogetherNet.
Recurrent Bilinear Optimization for Binary Neural Networks
- Authors: Sheng Xu, Yanjing Li, Tiancheng Wang, Teli Ma, Baochang Zhang, Peng Gao, Yu Qiao, Jinhu Lv, Guodong Guo
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01542
- Pdf link: https://arxiv.org/pdf/2209.01542
- Abstract Binary Neural Networks (BNNs) show great promise for real-world embedded devices. As one of the critical steps to achieve a powerful BNN, the scale factor calculation plays an essential role in reducing the performance gap to their real-valued counterparts. However, existing BNNs neglect the intrinsic bilinear relationship of real-valued weights and scale factors, resulting in a sub-optimal model caused by an insufficient training process. To address this issue, Recurrent Bilinear Optimization is proposed to improve the learning process of BNNs (RBONNs) by associating the intrinsic bilinear variables in the back propagation process. Our work is the first attempt to optimize BNNs from the bilinear perspective. Specifically, we employ a recurrent optimization and Density-ReLU to sequentially backtrack the sparse real-valued weight filters, which will be sufficiently trained and reach their performance limits based on a controllable learning process. We obtain robust RBONNs, which show impressive performance over state-of-the-art BNNs on various models and datasets. Particularly, on the task of object detection, RBONNs have great generalization performance. Our code is open-sourced on https://github.com/SteveTsui/RBONN .
Rice Leaf Disease Classification and Detection Using YOLOv5
- Authors: Md Ershadul Haque, Ashikur Rahman, Iftekhar Junaeid, Samiul Ul Hoque, Manoranjan Paul
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2209.01579
- Pdf link: https://arxiv.org/pdf/2209.01579
- Abstract A staple food in more than a hundred nations worldwide is rice (Oryza sativa). The cultivation of rice is vital to global economic growth. However, the main issue facing the agricultural industry is rice leaf disease. The quality and quantity of the crops have declined, and this is the main cause. As farmers in any country do not have much knowledge about rice leaf disease, they cannot diagnose rice leaf disease properly. That's why they cannot take proper care of rice leaves. As a result, the production is decreasing. From literature survey, it has seen that YOLOv5 exhibit the better result compare to others deep learning method. As a result of the continual advancement of object detection technology, YOLO family algorithms, which have extraordinarily high precision and better speed have been used in various scene recognition tasks to build rice leaf disease monitoring systems. We have annotate 1500 collected data sets and propose a rice leaf disease classification and detection method based on YOLOv5 deep learning. We then trained and evaluated the YOLOv5 model. The simulation outcomes show improved object detection result for the augmented YOLOv5 network proposed in this article. The required levels of recognition precision, recall, mAP value, and F1 score are 90%, 67%, 76%, and 81% respectively are considered as performance metrics.
Consistent Teacher Provides Better Supervision in Semi-supervised Object Detection
- Authors: Xinjiang Wang, Xingyi Yang, Shilong Zhang, Yijiang Li, Litong Feng, Shijie Fang, Chengqi Lyu, Kai Chen, Wayne Zhang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01589
- Pdf link: https://arxiv.org/pdf/2209.01589
- Abstract In this study, we dive deep into the unique challenges in semi-supervised object detection~(SSOD). We observe that current detectors generally suffer from 3 inconsistency problems. 1) Assignment inconsistency, that the conventional assignment policy is sensitive to labeling noise. 2) Subtasks inconsistency, where the classification and regression predictions are misaligned at the same feature point. 3) Temporal inconsistency, that the pseudo bboxes vary dramatically at different training steps. These issues lead to inconsistent optimization objectives of the student network, thus deteriorating performance and slowing down the model convergence. We, therefore, propose a systematic solution, termed Consistent Teacher, to remedy the above-mentioned challenges. First, adaptive anchor assignment substitutes the static IoU-based strategy, which enables the student network to be resistant to noisy psudo bboxes; Then we calibrate the subtask predictions by designing a feature alignment module; Lastly, We adopt a Gaussian Mixture Model (GMM) to dynamically adjust the pseudo-boxes threshold. Consistent Teacher provides a new strong baseline on a large range of SSOD evaluations. It achieves 40.0 mAP with ResNet-50 backbone given only 10% of annotated MS-COCO data, which surpasses previous baselines using pseudo labels by around 4 mAP. When trained on fully annotated MS-COCO with additional unlabeled data, the performance further increases to 49.1 mAP. Our code will be open-sourced soon.
MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition
- Authors: Yunhao Wang, Huixin Sun, Xiaodi Wang, Bin Zhang, Chao Li, Ying Xin, Baochang Zhang, Errui Ding, Shumin Han
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01620
- Pdf link: https://arxiv.org/pdf/2209.01620
- Abstract Vision Transformer and its variants have demonstrated great potential in various computer vision tasks. But conventional vision transformers often focus on global dependency at a coarse level, which suffer from a learning challenge on global relationships and fine-grained representation at a token level. In this paper, we introduce Multi-scale Attention Fusion into transformer (MAFormer), which explores local aggregation and global feature extraction in a dual-stream framework for visual recognition. We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists of: i) a local window attention branch that learns short-range interactions within windows, aggregating fine-grained local features; ii) global feature extraction through a novel Global Learning with Down-sampling (GLD) operation to efficiently capture long-range context information within the whole image; iii) a fusion module that self-explores the integration of both features via attention. Our MAFormer achieves state-of-the-art performance on common vision tasks. In particular, MAFormer-L achieves 85.9$%$ Top-1 accuracy on ImageNet, surpassing CSWin-B and LV-ViT-L by 1.7$%$ and 0.6$%$ respectively. On MSCOCO, MAFormer outperforms the prior art CSWin by 1.7$%$ mAPs on object detection and 1.4$%$ on instance segmentation with similar-sized parameters, demonstrating the potential to be a general backbone network.
SEFormer: Structure Embedding Transformer for 3D Object Detection
- Authors: Xiaoyu Feng, Heming Du, Yueqi Duan, Yongpan Liu, Hehe Fan
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01745
- Pdf link: https://arxiv.org/pdf/2209.01745
- Abstract Effectively preserving and encoding structure features from objects in irregular and sparse LiDAR points is a key challenge to 3D object detection on point cloud. Recently, Transformer has demonstrated promising performance on many 2D and even 3D vision tasks. Compared with the fixed and rigid convolution kernels, the self-attention mechanism in Transformer can adaptively exclude the unrelated or noisy points and thus suitable for preserving the local spatial structure in irregular LiDAR point cloud. However, Transformer only performs a simple sum on the point features, based on the self-attention mechanism, and all the points share the same transformation for value. Such isotropic operation lacks the ability to capture the direction-distance-oriented local structure which is important for 3D object detection. In this work, we propose a Structure-Embedding transFormer (SEFormer), which can not only preserve local structure as traditional Transformer but also have the ability to encode the local structure. Compared to the self-attention mechanism in traditional Transformer, SEFormer learns different feature transformations for value points based on the relative directions and distances to the query point. Then we propose a SEFormer based network for high-performance 3D object detection. Extensive experiments show that the proposed architecture can achieve SOTA results on Waymo Open Dataset, the largest 3D detection benchmark for autonomous driving. Specifically, SEFormer achieves 79.02% mAP, which is 1.2% higher than existing works. We will release the codes.
Adversarial Detection: Attacking Object Detection in Real Time
- Authors: Han Wu, Syed Yunas, Sareh Rowlands, Wenjie Ruan, Johan Wahlstrom
- Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2209.01962
- Pdf link: https://arxiv.org/pdf/2209.01962
- Abstract Intelligent robots hinge on accurate object detection models to perceive the environment. Advances in deep learning security unveil that object detection models are vulnerable to adversarial attacks. However, prior research primarily focuses on attacking static images or offline videos. It is still unclear if such attacks could jeopardize real-world robotic applications in dynamic environments. There is still a gap between theoretical discoveries and real-world applications. We bridge the gap by proposing the first real-time online attack against object detection models. We devised three attacks that fabricate bounding boxes for nonexistent objects at desired locations.
Impact of Scaled Image on Robustness of Deep Neural Networks
- Authors: Chengyin Hu, Weiwen Shi
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.02132
- Pdf link: https://arxiv.org/pdf/2209.02132
- Abstract Deep neural networks (DNNs) have been widely used in computer vision tasks like image classification, object detection and segmentation. Whereas recent studies have shown their vulnerability to manual digital perturbations or distortion in the input images. The accuracy of the networks is remarkably influenced by the data distribution of their training dataset. Scaling the raw images creates out-of-distribution data, which makes it a possible adversarial attack to fool the networks. In this work, we propose a Scaling-distortion dataset ImageNet-CS by Scaling a subset of the ImageNet Challenge dataset by different multiples. The aim of our work is to study the impact of scaled images on the performance of advanced DNNs. We perform experiments on several state-of-the-art deep neural network architectures on the proposed ImageNet-CS, and the results show a significant positive correlation between scaling size and accuracy decline. Moreover, based on ResNet50 architecture, we demonstrate some tests on the performance of recent proposed robust training techniques and strategies like Augmix, Revisiting and Normalizer Free on our proposed ImageNet-CS. Experiment results have shown that these robust training techniques can improve networks' robustness to scaling transformation.
Task-wise Sampling Convolutions for Arbitrary-Oriented Object Detection in Aerial Images
- Authors: Zhanchao Huang, Wei Li, Xiang-Gen Xia, Hao Wang, Ran Tao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.02200
- Pdf link: https://arxiv.org/pdf/2209.02200
- Abstract Arbitrary-oriented object detection (AOOD) has been widely applied to locate and classify objects with diverse orientations in remote sensing images. However, the inconsistent features for the localization and classification tasks in AOOD models may lead to ambiguity and low-quality object predictions, which constrains the detection performance. In this paper, an AOOD method called task-wise sampling convolutions (TS-Conv) is proposed. TS-Conv adaptively samples task-wise features from respective sensitive regions and maps these features together in alignment to guide a dynamic label assignment for better predictions. Specifically, sampling positions of the localization convolution in TS-Conv is supervised by the oriented bounding box (OBB) prediction associated with spatial coordinates. While sampling positions and convolutional kernel of the classification convolution are designed to be adaptively adjusted according to different orientations for improving the orientation robustness of features. Furthermore, a dynamic task-aware label assignment (DTLA) strategy is developed to select optimal candidate positions and assign labels dynamicly according to ranked task-aware scores obtained from TS-Conv. Extensive experiments on several public datasets covering multiple scenes, multimodal images, and multiple categories of objects demonstrate the effectiveness, scalability and superior performance of the proposed TS-Conv.
PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection
- Authors: Han Wang, Jun Tang, Xiaodong Liu, Shanyan Guan, Rong Xie, Li Song
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2209.02242
- Pdf link: https://arxiv.org/pdf/2209.02242
- Abstract Recent years have witnessed a trend of applying context frames to boost the performance of object detection as video object detection. Existing methods usually aggregate features at one stroke to enhance the feature. These methods, however, usually lack spatial information from neighboring frames and suffer from insufficient feature aggregation. To address the issues, we perform a progressive way to introduce both temporal information and spatial information for an integrated enhancement. The temporal information is introduced by the temporal feature aggregation model (TFAM), by conducting an attention mechanism between the context frames and the target frame (i.e., the frame to be detected). Meanwhile, we employ a Spatial Transition Awareness Model (STAM) to convey the location transition information between each context frame and target frame. Built upon a transformer-based detector DETR, our PTSEFormer also follows an end-to-end fashion to avoid heavy post-processing procedures while achieving 88.1% mAP on the ImageNet VID dataset. Codes are available at https://github.com/Hon-Wong/PTSEFormer.
MACAB: Model-Agnostic Clean-Annotation Backdoor to Object Detection with Natural Trigger in Real-World
- Authors: Hua Ma, Yinshan Li, Yansong Gao, Zhi Zhang, Alsharif Abuadbba, Anmin Fu, Said F. Al-Sarawi, Nepal Surya, Derek Abbott
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2209.02339
- Pdf link: https://arxiv.org/pdf/2209.02339
- Abstract Object detection is the foundation of various critical computer-vision tasks such as segmentation, object tracking, and event detection. To train an object detector with satisfactory accuracy, a large amount of data is required. However, due to the intensive workforce involved with annotating large datasets, such a data curation task is often outsourced to a third party or relied on volunteers. This work reveals severe vulnerabilities of such data curation pipeline. We propose MACAB that crafts clean-annotated images to stealthily implant the backdoor into the object detectors trained on them even when the data curator can manually audit the images. We observe that the backdoor effect of both misclassification and the cloaking are robustly achieved in the wild when the backdoor is activated with inconspicuously natural physical triggers. Backdooring non-classification object detection with clean-annotation is challenging compared to backdooring existing image classification tasks with clean-label, owing to the complexity of having multiple objects within each frame, including victim and non-victim objects. The efficacy of the MACAB is ensured by constructively i abusing the image-scaling function used by the deep learning framework, ii incorporating the proposed adversarial clean image replica technique, and iii combining poison data selection criteria given constrained attacking budget. Extensive experiments demonstrate that MACAB exhibits more than 90% attack success rate under various real-world scenes. This includes both cloaking and misclassification backdoor effect even restricted with a small attack budget. The poisoned samples cannot be effectively identified by state-of-the-art detection techniques.The comprehensive video demo is at https://youtu.be/MA7L_LpXkp4, which is based on a poison rate of 0.14% for YOLOv4 cloaking backdoor and Faster R-CNN misclassification backdoor.
Self-Calibrating Anomaly and Change Detection for Autonomous Inspection Robots
- Authors: Sahar Salimpour, Jorge Peña Queralta, Tomi Westerlund
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.02379
- Pdf link: https://arxiv.org/pdf/2209.02379
- Abstract Automatic detection of visual anomalies and changes in the environment has been a topic of recurrent attention in the fields of machine learning and computer vision over the past decades. A visual anomaly or change detection algorithm identifies regions of an image that differ from a reference image or dataset. The majority of existing approaches focus on anomaly or fault detection in a specific class of images or environments, while general purpose visual anomaly detection algorithms are more scarce in the literature. In this paper, we propose a comprehensive deep learning framework for detecting anomalies and changes in a priori unknown environments after a reference dataset is gathered, and without need for retraining the model. We use the SuperPoint and SuperGlue feature extraction and matching methods to detect anomalies based on reference images taken from a similar location and with partial overlapping of the field of view. We also introduce a self-calibrating method for the proposed model in order to address the problem of sensitivity to feature matching thresholds and environmental conditions. To evaluate the proposed framework, we have used a ground robot system for the purpose of reference and query data collection. We show that high accuracy can be obtained using the proposed method. We also show that the calibration process enhances changes and foreign object detection performance
Identification of Small Objects in Satellite Image Benchmarks
- Authors: Debojyoti Biswas, Jelena Tešić
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
- Arxiv link: https://arxiv.org/abs/2209.02564
- Pdf link: https://arxiv.org/pdf/2209.02564
- Abstract Recent increases in aerial image access and volume, increases in computational power, and interest in applications have opened the door to scaling up object detection and domain adaptation research to production. Aerial data sets are very large in size, and each frame of the data set contains a huge number of dense and small objects. Deep learning applications for aerial imagery are behind due to a lack of training data, and researchers have recently turned to domain adaptation (DA) from a labeled data set to an unlabeled data set to alleviate the issue. These factors create two major challenges: the high variety between datasets (e.g. object sizes, class distributions, object feature uniformity, image acquisition, distance, weather conditions), and the size of objects in satellite imagery and subsequent failure of state-of-the-art to capture small objects, local features, and region proposals for densely overlapped objects in satellite image. In this paper, we propose two solutions to these problems: a domain discriminator to better align the local feature space between domains; and a novel pipeline that improves the back-end by spatial pyramid pooling, cross-stage partial network, region proposal network via heatmap-based region proposals, and object localization and identification through a novel image difficulty score that adapts the overall focal loss measure based on the image difficulty. Our proposed model outperformed the state-of-the-art method by 7.4%.
Automatic counting of mounds on UAV images: combining instance segmentation and patch-level correction
- Authors: Majid Nikougoftar Nategh, Ahmed Zgaren, Wassim Bouachir, Nizar Bouguila
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.02608
- Pdf link: https://arxiv.org/pdf/2209.02608
- Abstract Site preparation by mounding is a commonly used silvicultural treatment that improves tree growth conditions by mechanically creating planting microsites called mounds. Following site preparation, the next critical step is to count the number of mounds, which provides forest managers with a precise estimate of the number of seedlings required for a given plantation block. Counting the number of mounds is generally conducted through manual field surveys by forestry workers, which is costly and prone to errors, especially for large areas. To address this issue, we present a novel framework exploiting advances in Unmanned Aerial Vehicle (UAV) imaging and computer vision to accurately estimate the number of mounds on a planting block. The proposed framework comprises two main components. First, we exploit a visual recognition method based on a deep learning algorithm for multiple object detection by pixel-based segmentation. This enables a preliminary count of visible mounds, as well as other frequently seen objects (e.g. trees, debris, accumulation of water), to be used to characterize the planting block. Second, since visual recognition could limited by several perturbation factors (e.g. mound erosion, occlusion), we employ a machine learning estimation function that predicts the final number of mounds based on the local block properties extracted in the first stage. We evaluate the proposed framework on a new UAV dataset representing numerous planting blocks with varying features. The proposed method outperformed manual counting methods in terms of relative counting precision, indicating that it has the potential to be advantageous and efficient in difficult situations.
Keyword: transformer
vieCap4H-VLSP 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM
- Authors: Thanh Tin Nguyen, Long H. Nguyen, Nhat Truong Pham, Liu Tai Nguyen, Van Huong Do, Hai Nguyen, Ngoc Duy Nguyen
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.01304
- Pdf link: https://arxiv.org/pdf/2209.01304
- Abstract This study presents our approach on the automatic Vietnamese image captioning for healthcare domain in text processing tasks of Vietnamese Language and Speech Processing (VLSP) Challenge 2021, as shown in Figure 1. In recent years, image captioning often employs a convolutional neural network-based architecture as an encoder and a long short-term memory (LSTM) as a decoder to generate sentences. These models perform remarkably well in different datasets. Our proposed model also has an encoder and a decoder, but we instead use a Swin Transformer in the encoder, and a LSTM combined with an attention module in the decoder. The study presents our training experiments and techniques used during the competition. Our model achieves a BLEU4 score of 0.293 on the vietCap4H dataset, and the score is ranked the 3$^{rd}$ place on the private leaderboard. Our code can be found at \url{https://git.io/JDdJm}.
TransPolymer: a Transformer-based Language Model for Polymer Property Predictions
- Authors: Changwen Xu, Yuyang Wang, Amir Barati Farimani
- Subjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
- Arxiv link: https://arxiv.org/abs/2209.01307
- Pdf link: https://arxiv.org/pdf/2209.01307
- Abstract Accurate and efficient prediction of polymer properties is of great significance in polymer development and design. Conventionally, expensive and time-consuming experiments or simulations are required to evaluate the function of polymers. Recently, Transformer models, equipped with attention mechanisms, have exhibited superior performance in various natural language processing tasks. However, such methods have not been investigated in polymer sciences. Herein, we report TransPolymer, a Transformer-based language model for polymer property prediction. Owing to our proposed polymer tokenizer with chemical awareness, TransPolymer can learn representations directly from polymer sequences. The model learns expressive representations by pretraining on a large unlabeled dataset, followed by finetuning the model on downstream datasets concerning various polymer properties. TransPolymer achieves superior performance in all eight datasets and surpasses other baselines significantly on most downstream tasks. Moreover, the improvement by the pretrained TransPolymer over supervised TransPolymer and other language models strengthens the significant benefits of pretraining on large unlabeled data in representation learning. Experiment results further demonstrate the important role of the attention mechanism in understanding polymer sequences. We highlight this model as a promising computational tool for promoting rational polymer design and understanding structure-property relationships in a data science view.
TogetherNet: Bridging Image Restoration and Object Detection Together via Dynamic Enhancement Learning
- Authors: Yongzhen Wang, Xuefeng Yan, Kaiwen Zhang, Lina Gong, Haoran Xie, Fu Lee Wang, Mingqiang Wei
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01373
- Pdf link: https://arxiv.org/pdf/2209.01373
- Abstract Adverse weather conditions such as haze, rain, and snow often impair the quality of captured images, causing detection networks trained on normal images to generalize poorly in these scenarios. In this paper, we raise an intriguing question - if the combination of image restoration and object detection, can boost the performance of cutting-edge detectors in adverse weather conditions. To answer it, we propose an effective yet unified detection paradigm that bridges these two subtasks together via dynamic enhancement learning to discern objects in adverse weather conditions, called TogetherNet. Different from existing efforts that intuitively apply image dehazing/deraining as a pre-processing step, TogetherNet considers a multi-task joint learning problem. Following the joint learning scheme, clean features produced by the restoration network can be shared to learn better object detection in the detection network, thus helping TogetherNet enhance the detection capacity in adverse weather conditions. Besides the joint learning architecture, we design a new Dynamic Transformer Feature Enhancement module to improve the feature extraction and representation capabilities of TogetherNet. Extensive experiments on both synthetic and real-world datasets demonstrate that our TogetherNet outperforms the state-of-the-art detection approaches by a large margin both quantitatively and qualitatively. Source code is available at https://github.com/yz-wang/TogetherNet.
Vision Transformers and YoloV5 based Driver Drowsiness Detection Framework
- Authors: Ghanta Sai Krishna, Kundrapu Supriya, Jai Vardhan, Mallikharjuna Rao K
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01401
- Pdf link: https://arxiv.org/pdf/2209.01401
- Abstract Human drivers have distinct driving techniques, knowledge, and sentiments due to unique driving traits. Driver drowsiness has been a serious issue endangering road safety; therefore, it is essential to design an effective drowsiness detection algorithm to bypass road accidents. Miscellaneous research efforts have been approached the problem of detecting anomalous human driver behaviour to examine the frontal face of the driver and automobile dynamics via computer vision techniques. Still, the conventional methods cannot capture complicated driver behaviour features. However, with the origin of deep learning architectures, a substantial amount of research has also been executed to analyze and recognize driver's drowsiness using neural network algorithms. This paper introduces a novel framework based on vision transformers and YoloV5 architectures for driver drowsiness recognition. A custom YoloV5 pre-trained architecture is proposed for face extraction with the aim of extracting Region of Interest (ROI). Owing to the limitations of previous architectures, this paper introduces vision transformers for binary image classification which is trained and validated on a public dataset UTA-RLDD. The model had achieved 96.2% and 97.4% as it's training and validation accuracies respectively. For the further evaluation, proposed framework is tested on a custom dataset of 39 participants in various light circumstances and achieved 95.5% accuracy. The conducted experimentations revealed the significant potential of our framework for practical applications in smart transportation systems.
Quantitative Stopword Generation for Sentiment Analysis via Recursive and Iterative Deletion
- Authors: Daniel M. DiPietro
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.01519
- Pdf link: https://arxiv.org/pdf/2209.01519
- Abstract Stopwords carry little semantic information and are often removed from text data to reduce dataset size and improve machine learning model performance. Consequently, researchers have sought to develop techniques for generating effective stopword sets. Previous approaches have ranged from qualitative techniques relying upon linguistic experts, to statistical approaches that extract word importance using correlations or frequency-dependent metrics computed on a corpus. We present a novel quantitative approach that employs iterative and recursive feature deletion algorithms to see which words can be deleted from a pre-trained transformer's vocabulary with the least degradation to its performance, specifically for the task of sentiment analysis. Empirically, stopword lists generated via this approach drastically reduce dataset size while negligibly impacting model performance, in one such example shrinking the corpus by 28.4% while improving the accuracy of a trained logistic regression model by 0.25%. In another instance, the corpus was shrunk by 63.7% with a 2.8% decrease in accuracy. These promising results indicate that our approach can generate highly effective stopword sets for specific NLP tasks.
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
- Authors: Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01540
- Pdf link: https://arxiv.org/pdf/2209.01540
- Abstract Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, the pre-extracted video features in previous studies cannot be refined through MVM during pre-training, and thus leading to unsatisfactory downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), which mitigates the disconnection between fixed video representations and MVM training. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens and latent visual features. We conduct comprehensive experiments and provide insights on the factors leading to effective MVM training. Empirically, we show VIOLET pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.
Hierarchical Transformer with Spatio-Temporal Context Aggregation for Next Point-of-Interest Recommendation
- Authors: Jiayi Xie, Zhenzhong Chen
- Subjects: Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2209.01559
- Pdf link: https://arxiv.org/pdf/2209.01559
- Abstract Next point-of-interest (POI) recommendation is a critical task in location-based social networks, yet remains challenging due to a high degree of variation and personalization exhibited in user movements. In this work, we explore the latent hierarchical structure composed of multi-granularity short-term structural patterns in user check-in sequences. We propose a Spatio-Temporal context AggRegated Hierarchical Transformer (STAR-HiT) for next POI recommendation, which employs stacked hierarchical encoders to recursively encode the spatio-temporal context and explicitly locate subsequences of different granularities. More specifically, in each encoder, the global attention layer captures the spatio-temporal context of the sequence, while the local attention layer performed within each subsequence enhances subsequence modeling using the local context. The sequence partition layer infers positions and lengths of subsequences from the global context adaptively, such that semantics in subsequences can be well preserved. Finally, the subsequence aggregation layer fuses representations within each subsequence to form the corresponding subsequence representation, thereby generating a new sequence of higher-level granularity. The stacking of encoders captures the latent hierarchical structure of the check-in sequence, which is used to predict the next visiting POI. Extensive experiments on three public datasets demonstrate that the proposed model achieves superior performance whilst providing explanations for recommendations. Codes are available at https://github.com/JennyXieJiayi/STAR-HiT.
Spatial-Temporal Transformer for Video Snapshot Compressive Imaging
- Authors: Lishun Wang, Miao Cao, Yong Zhong, Xin Yuan
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2209.01578
- Pdf link: https://arxiv.org/pdf/2209.01578
- Abstract Video snapshot compressive imaging (SCI) captures multiple sequential video frames by a single measurement using the idea of computational imaging. The underlying principle is to modulate high-speed frames through different masks and these modulated frames are summed to a single measurement captured by a low-speed 2D sensor (dubbed optical encoder); following this, algorithms are employed to reconstruct the desired high-speed frames (dubbed software decoder) if needed. In this paper, we consider the reconstruction algorithm in video SCI, i.e., recovering a series of video frames from a compressed measurement. Specifically, we propose a Spatial-Temporal transFormer (STFormer) to exploit the correlation in both spatial and temporal domains. STFormer network is composed of a token generation block, a video reconstruction block, and these two blocks are connected by a series of STFormer blocks. Each STFormer block consists of a spatial self-attention branch, a temporal self-attention branch and the outputs of these two branches are integrated by a fusion network. Extensive results on both simulated and real data demonstrate the state-of-the-art performance of STFormer. The code and models are publicly available at https://github.com/ucaswangls/STFormer.git
MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition
- Authors: Yunhao Wang, Huixin Sun, Xiaodi Wang, Bin Zhang, Chao Li, Ying Xin, Baochang Zhang, Errui Ding, Shumin Han
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01620
- Pdf link: https://arxiv.org/pdf/2209.01620
- Abstract Vision Transformer and its variants have demonstrated great potential in various computer vision tasks. But conventional vision transformers often focus on global dependency at a coarse level, which suffer from a learning challenge on global relationships and fine-grained representation at a token level. In this paper, we introduce Multi-scale Attention Fusion into transformer (MAFormer), which explores local aggregation and global feature extraction in a dual-stream framework for visual recognition. We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists of: i) a local window attention branch that learns short-range interactions within windows, aggregating fine-grained local features; ii) global feature extraction through a novel Global Learning with Down-sampling (GLD) operation to efficiently capture long-range context information within the whole image; iii) a fusion module that self-explores the integration of both features via attention. Our MAFormer achieves state-of-the-art performance on common vision tasks. In particular, MAFormer-L achieves 85.9$%$ Top-1 accuracy on ImageNet, surpassing CSWin-B and LV-ViT-L by 1.7$%$ and 0.6$%$ respectively. On MSCOCO, MAFormer outperforms the prior art CSWin by 1.7$%$ mAPs on object detection and 1.4$%$ on instance segmentation with similar-sized parameters, demonstrating the potential to be a general backbone network.
A Review of Sparse Expert Models in Deep Learning
- Authors: William Fedus, Jeff Dean, Barret Zoph
- Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.01667
- Pdf link: https://arxiv.org/pdf/2209.01667
- Abstract Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.
Features Fusion Framework for Multimodal Irregular Time-series Events
- Authors: Peiwang Tang, Xianchao Zhang
- Subjects: Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2209.01728
- Pdf link: https://arxiv.org/pdf/2209.01728
- Abstract Some data from multiple sources can be modeled as multimodal time-series events which have different sampling frequencies, data compositions, temporal relations and characteristics. Different types of events have complex nonlinear relationships, and the time of each event is irregular. Neither the classical Recurrent Neural Network (RNN) model nor the current state-of-the-art Transformer model can deal with these features well. In this paper, a features fusion framework for multimodal irregular time-series events is proposed based on the Long Short-Term Memory networks (LSTM). Firstly, the complex features are extracted according to the irregular patterns of different events. Secondly, the nonlinear correlation and complex temporal dependencies relationship between complex features are captured and fused into a tensor. Finally, a feature gate are used to control the access frequency of different tensors. Extensive experiments on MIMIC-III dataset demonstrate that the proposed framework significantly outperforms to the existing methods in terms of AUC (the area under Receiver Operating Characteristic curve) and AP (Average Precision).
SEFormer: Structure Embedding Transformer for 3D Object Detection
- Authors: Xiaoyu Feng, Heming Du, Yueqi Duan, Yongpan Liu, Hehe Fan
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01745
- Pdf link: https://arxiv.org/pdf/2209.01745
- Abstract Effectively preserving and encoding structure features from objects in irregular and sparse LiDAR points is a key challenge to 3D object detection on point cloud. Recently, Transformer has demonstrated promising performance on many 2D and even 3D vision tasks. Compared with the fixed and rigid convolution kernels, the self-attention mechanism in Transformer can adaptively exclude the unrelated or noisy points and thus suitable for preserving the local spatial structure in irregular LiDAR point cloud. However, Transformer only performs a simple sum on the point features, based on the self-attention mechanism, and all the points share the same transformation for value. Such isotropic operation lacks the ability to capture the direction-distance-oriented local structure which is important for 3D object detection. In this work, we propose a Structure-Embedding transFormer (SEFormer), which can not only preserve local structure as traditional Transformer but also have the ability to encode the local structure. Compared to the self-attention mechanism in traditional Transformer, SEFormer learns different feature transformations for value points based on the relative directions and distances to the query point. Then we propose a SEFormer based network for high-performance 3D object detection. Extensive experiments show that the proposed architecture can achieve SOTA results on Waymo Open Dataset, the largest 3D detection benchmark for autonomous driving. Specifically, SEFormer achieves 79.02% mAP, which is 1.2% higher than existing works. We will release the codes.
Continuous Decomposition of Granularity for Neural Paraphrase Generation
- Authors: Xiaodong Gu, Zhaowei Zhang, Sang-Woo Lee, Kang Min Yoo, Jung-Woo Ha
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.01765
- Pdf link: https://arxiv.org/pdf/2209.01765
- Abstract While Transformers have had significant success in paragraph generation, they treat sentences as linear sequences of tokens and often neglect their hierarchical information. Prior work has shown that decomposing the levels of granularity~(e.g., word, phrase, or sentence) for input tokens has produced substantial improvements, suggesting the possibility of enhancing Transformers via more fine-grained modeling of granularity. In this work, we propose a continuous decomposition of granularity for neural paraphrase generation (C-DNPG). In order to efficiently incorporate granularity into sentence encoding, C-DNPG introduces a granularity-aware attention (GA-Attention) mechanism which extends the multi-head self-attention with: 1) a granularity head that automatically infers the hierarchical structure of a sentence by neurally estimating the granularity level of each input token; and 2) two novel attention masks, namely, granularity resonance and granularity scope, to efficiently encode granularity into attention. Experiments on two benchmarks, including Quora question pairs and Twitter URLs have shown that C-DNPG outperforms baseline models by a remarkable margin and achieves state-of-the-art results in terms of many metrics. Qualitative analysis reveals that C-DNPG indeed captures fine-grained levels of granularity with effectiveness.
LKD-Net: Large Kernel Convolution Network for Single Image Dehazing
- Authors: Pinjun Luo, Guoqiang Xiao, Xinbo Gao, Song Wu
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.01788
- Pdf link: https://arxiv.org/pdf/2209.01788
- Abstract The deep convolutional neural networks (CNNs)-based single image dehazing methods have achieved significant success. The previous methods are devoted to improving the network's performance by increasing the network's depth and width. The current methods focus on increasing the convolutional kernel size to enhance its performance by benefiting from the larger receptive field. However, directly increasing the size of the convolutional kernel introduces a massive amount of computational overhead and parameters. Thus, a novel Large Kernel Convolution Dehaze Block (LKD Block) consisting of the Decomposition deep-wise Large Kernel Convolution Block (DLKCB) and the Channel Enhanced Feed-forward Network (CEFN) is devised in this paper. The designed DLKCB can split the deep-wise large kernel convolution into a smaller depth-wise convolution and a depth-wise dilated convolution without introducing massive parameters and computational overhead. Meanwhile, the designed CEFN incorporates a channel attention mechanism into Feed-forward Network to exploit significant channels and enhance robustness. By combining multiple LKD Blocks and Up-Down sampling modules, the Large Kernel Convolution Dehaze Network (LKD-Net) is conducted. The evaluation results demonstrate the effectiveness of the designed DLKCB and CEFN, and our LKD-Net outperforms the state-of-the-art. On the SOTS indoor dataset, our LKD-Net dramatically outperforms the Transformer-based method Dehamer with only 1.79% #Param and 48.9% FLOPs. The source code of our LKD-Net is available at https://github.com/SWU-CS-MediaLab/LKD-Net.
ADTR: Anomaly Detection Transformer with Feature Reconstruction
- Authors: Zhiyuan You, Kai Yang, Wenhan Luo, Lei Cui, Yu Zheng, Xinyi Le
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.01816
- Pdf link: https://arxiv.org/pdf/2209.01816
- Abstract Anomaly detection with only prior knowledge from normal samples attracts more attention because of the lack of anomaly samples. Existing CNN-based pixel reconstruction approaches suffer from two concerns. First, the reconstruction source and target are raw pixel values that contain indistinguishable semantic information. Second, CNN tends to reconstruct both normal samples and anomalies well, making them still hard to distinguish. In this paper, we propose Anomaly Detection TRansformer (ADTR) to apply a transformer to reconstruct pre-trained features. The pre-trained features contain distinguishable semantic information. Also, the adoption of transformer limits to reconstruct anomalies well such that anomalies could be detected easily once the reconstruction fails. Moreover, we propose novel loss functions to make our approach compatible with the normal-sample-only case and the anomaly-available case with both image-level and pixel-level labeled anomalies. The performance could be further improved by adding simple synthetic or external irrelevant anomalies. Extensive experiments are conducted on anomaly detection datasets including MVTec-AD and CIFAR-10. Our method achieves superior performance compared with all baselines.
Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples
- Authors: Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, Ramesh Darwishi
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.02128
- Pdf link: https://arxiv.org/pdf/2209.02128
- Abstract Recent advances in the development of large language models have resulted in public access to state-of-the-art pre-trained language models (PLMs), including Generative Pre-trained Transformer 3 (GPT-3) and Bidirectional Encoder Representations from Transformers (BERT). However, evaluations of PLMs, in practice, have shown their susceptibility to adversarial attacks during the training and fine-tuning stages of development. Such attacks can result in erroneous outputs, model-generated hate speech, and the exposure of users' sensitive information. While existing research has focused on adversarial attacks during either the training or the fine-tuning of PLMs, there is a deficit of information on attacks made between these two development phases. In this work, we highlight a major security vulnerability in the public release of GPT-3 and further investigate this vulnerability in other state-of-the-art PLMs. We restrict our work to pre-trained models that have not undergone fine-tuning. Further, we underscore token distance-minimized perturbations as an effective adversarial approach, bypassing both supervised and unsupervised quality measures. Following this approach, we observe a significant decrease in text classification quality when evaluating for semantic similarity.
CNSNet: A Cleanness-Navigated-Shadow Network for Shadow Removal
- Authors: Qianhao Yu, Naishan Zheng, Jie Huang, Feng Zhao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.02174
- Pdf link: https://arxiv.org/pdf/2209.02174
- Abstract The key to shadow removal is recovering the contents of the shadow regions with the guidance of the non-shadow regions. Due to the inadequate long-range modeling, the CNN-based approaches cannot thoroughly investigate the information from the non-shadow regions. To solve this problem, we propose a novel cleanness-navigated-shadow network (CNSNet), with a shadow-oriented adaptive normalization (SOAN) module and a shadow-aware aggregation with transformer (SAAT) module based on the shadow mask. Under the guidance of the shadow mask, the SOAN module formulates the statistics from the non-shadow region and adaptively applies them to the shadow region for region-wise restoration. The SAAT module utilizes the shadow mask to precisely guide the restoration of each shadowed pixel by considering the highly relevant pixels from the shadow-free regions for global pixel-wise restoration. Extensive experiments on three benchmark datasets (ISTD, ISTD+, and SRD) show that our method achieves superior de-shadowing performance.
Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students
- Authors: Xu Zheng, Yunhao Luo, Hao Wang, Chong Fu, Lin Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.02178
- Pdf link: https://arxiv.org/pdf/2209.02178
- Abstract The popular methods for semi-supervised semantic segmentation mostly adopt a unitary network model using convolutional neural networks (CNNs) and enforce consistency of the model predictions over small perturbations applied to the inputs or model. However, such a learning paradigm suffers from a) limited learning capability of the CNN-based model; b) limited capacity of learning the discriminative features for the unlabeled data; c) limited learning for both global and local information from the whole image. In this paper, we propose a novel Semi-supervised Learning approach, called Transformer-CNN Cohort (TCC), that consists of two students with one based on the vision transformer (ViT) and the other based on the CNN. Our method subtly incorporates the multi-level consistency regularization on the predictions and the heterogeneous feature spaces via pseudo labeling for the unlabeled data. First, as the inputs of the ViT student are image patches, the feature maps extracted encode crucial class-wise statistics. To this end, we propose class-aware feature consistency distillation (CFCD) that first leverages the outputs of each student as the pseudo labels and generates class-aware feature (CF) maps. It then transfers knowledge via the CF maps between the students. Second, as the ViT student has more uniform representations for all layers, we propose consistency-aware cross distillation to transfer knowledge between the pixel-wise predictions from the cohort. We validate the TCC framework on Cityscapes and Pascal VOC 2012 datasets, which significantly outperforms existing semi-supervised methods by a large margin.
LRT: An Efficient Low-Light Restoration Transformer for Dark Light Field Images
- Authors: Shansi Zhang, Nan Meng, Edmund Y. Lam
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2209.02197
- Pdf link: https://arxiv.org/pdf/2209.02197
- Abstract Light field (LF) images with the multi-view property have many applications, which can be severely affected by the low-light imaging. Recent learning-based methods for low-light enhancement have their own disadvantages, such as no noise suppression, complex training process and poor performance in extremely low-light conditions. Targeted on solving these deficiencies while fully utilizing the multi-view information, we propose an efficient Low-light Restoration Transformer (LRT) for LF images, with multiple heads to perform specific intermediate tasks, including denoising, luminance adjustment, refinement and detail enhancement, within a single network, achieving progressive restoration from small scale to full scale. We design an angular transformer block with a view-token scheme to model the global angular relationship efficiently, and a multi-scale window-based transformer block to encode the multi-scale local and global spatial information. To solve the problem of insufficient training data, we formulate a synthesis pipeline by simulating the major noise with the estimated noise parameters of LF camera. Experimental results demonstrate that our method can achieve superior performance on the restoration of extremely low-light and noisy LF images with high efficiency.
Reference Resolution and Context Change in Multimodal Situated Dialogue for Exploring Data Visualizations
- Authors: Abhinav Kumar, Barbara Di Eugenio, Abari Bhattacharya, Jillian Aurisano, Andrew Johnson
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.02215
- Pdf link: https://arxiv.org/pdf/2209.02215
- Abstract Reference resolution, which aims to identify entities being referred to by a speaker, is more complex in real world settings: new referents may be created by processes the agents engage in and/or be salient only because they belong to the shared physical setting. Our focus is on resolving references to visualizations on a large screen display in multimodal dialogue; crucially, reference resolution is directly involved in the process of creating new visualizations. We describe our annotations for user references to visualizations appearing on a large screen via language and hand gesture and also new entity establishment, which results from executing the user request to create a new visualization. We also describe our reference resolution pipeline which relies on an information-state architecture to maintain dialogue context. We report results on detecting and resolving references, effectiveness of contextual information on the model, and under-specified requests for creating visualizations. We also experiment with conventional CRF and deep learning / transformer models (BiLSTM-CRF and BERT-CRF) for tagging references in user utterance text. Our results show that transfer learning significantly boost performance of the deep learning methods, although CRF still out-performs them, suggesting that conventional methods may generalize better for low resource data.
PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection
- Authors: Han Wang, Jun Tang, Xiaodong Liu, Shanyan Guan, Rong Xie, Li Song
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2209.02242
- Pdf link: https://arxiv.org/pdf/2209.02242
- Abstract Recent years have witnessed a trend of applying context frames to boost the performance of object detection as video object detection. Existing methods usually aggregate features at one stroke to enhance the feature. These methods, however, usually lack spatial information from neighboring frames and suffer from insufficient feature aggregation. To address the issues, we perform a progressive way to introduce both temporal information and spatial information for an integrated enhancement. The temporal information is introduced by the temporal feature aggregation model (TFAM), by conducting an attention mechanism between the context frames and the target frame (i.e., the frame to be detected). Meanwhile, we employ a Spatial Transition Awareness Model (STAM) to convey the location transition information between each context frame and target frame. Built upon a transformer-based detector DETR, our PTSEFormer also follows an end-to-end fashion to avoid heavy post-processing procedures while achieving 88.1% mAP on the ImageNet VID dataset. Codes are available at https://github.com/Hon-Wong/PTSEFormer.
EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models
- Authors: Jiangsu Du, Ziming Liu, Jiarui Fang, Shenggui Li, Yongbin Li, Yutong Lu, Yang You
- Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2209.02341
- Pdf link: https://arxiv.org/pdf/2209.02341
- Abstract Large transformer models display promising performance on a wide range of natural language processing (NLP) tasks. Although the AI community has expanded the model scale to the trillion parameter level, the practical deployment of 10-100 billion parameter models is still uncertain due to the latency, throughput, and memory constraints. In this paper, we proposed EnergonAI to solve the challenges of the efficient deployment of 10-100 billion parameter transformer models on single- or multi-GPU systems. EnergonAI adopts a hierarchy-controller system architecture to coordinate multiple devices and efficiently support different parallel patterns. It delegates the execution of sub-models to multiple workers in the single-controller style and applies tensor parallelism and pipeline parallelism among the workers in a multi-controller style. Upon the novel architecture, we propose three techniques, i.e. non-blocking pipeline parallelism, distributed redundant computation elimination, and peer memory pooling. EnergonAI enables the users to program complex parallel code the same as a serial one. Compared with the FasterTransformer, we have proven that EnergonAI has superior performance on latency and throughput. In our experiments, EnergonAI can achieve 37% latency reduction in tensor parallelism, 10% scalability improvement in pipeline parallelism, and it improves the model scale inferred on a single GPU by using a larger heterogeneous memory space at cost of limited performance reduction.
SkeletonMAE: Spatial-Temporal Masked Autoencoders for Self-supervised Skeleton Action Recognition
- Authors: Wenhan Wu, Yilei Hua, Ce zheng, Shiqian Wu, Chen Chen, Aidong Lu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.02399
- Pdf link: https://arxiv.org/pdf/2209.02399
- Abstract Fully supervised skeleton-based action recognition has achieved great progress with the blooming of deep learning techniques. However, these methods require sufficient labeled data which is not easy to obtain. In contrast, self-supervised skeleton-based action recognition has attracted more attention. With utilizing the unlabeled data, more generalizable features can be learned to alleviate the overfitting problem and reduce the demand of massive labeled training data. Inspired by the MAE, we propose a spatial-temporal masked autoencoder framework for self-supervised 3D skeleton-based action recognition (SkeletonMAE). Following MAE's masking and reconstruction pipeline, we utilize a skeleton based encoder-decoder transformer architecture to reconstruct the masked skeleton sequences. A novel masking strategy, named Spatial-Temporal Masking, is introduced in terms of both joint-level and frame-level for the skeleton sequence. This pre-training strategy makes the encoder output generalizable skeleton features with spatial and temporal dependencies. Given the unmasked skeleton sequence, the encoder is fine-tuned for the action recognition task. Extensive experiments show that our SkeletonMAE achieves remarkable performance and outperforms the state-of-the-art methods on both NTU RGB+D and NTU RGB+D 120 datasets.
DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation
- Authors: Shuaitao Zhao, Kun Liu, Yuhang Huang, Qian Bao, Dan Zeng, Wu Liu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.02431
- Pdf link: https://arxiv.org/pdf/2209.02431
- Abstract Human pose estimation aims to figure out the keypoints of all people in different scenes. Current approaches still face some challenges despite promising results. Existing top-down methods deal with a single person individually, without the interaction between different people and the scene they are situated in. Consequently, the performance of human detection degrades when serious occlusion happens. On the other hand, existing bottom-up methods consider all people at the same time and capture the global knowledge of the entire image. However, they are less accurate than the top-down methods due to the scale variation. To address these problems, we propose a novel Dual-Pipeline Integrated Transformer (DPIT) by integrating top-down and bottom-up pipelines to explore the visual clues of different receptive fields and achieve their complementarity. Specifically, DPIT consists of two branches, the bottom-up branch deals with the whole image to capture the global visual information, while the top-down branch extracts the feature representation of local vision from the single-human bounding box. Then, the extracted feature representations from bottom-up and top-down branches are fed into the transformer encoder to fuse the global and local knowledge interactively. Moreover, we define the keypoint queries to explore both full-scene and single-human posture visual clues to realize the mutual complementarity of the two pipelines. To the best of our knowledge, this is one of the first works to integrate the bottom-up and top-down pipelines with transformers for human pose estimation. Extensive experiments on COCO and MPII datasets demonstrate that our DPIT achieves comparable performance to the state-of-the-art methods.
ViTKD: Practical Guidelines for ViT feature knowledge distillation
- Authors: Zhendong Yang, Zhe Li, Ailing Zeng, Zexian Li, Chun Yuan, Yu Li
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.02432
- Pdf link: https://arxiv.org/pdf/2209.02432
- Abstract Knowledge Distillation (KD) for Convolutional Neural Network (CNN) is extensively studied as a way to boost the performance of a small model. Recently, Vision Transformer (ViT) has achieved great success on many computer vision tasks and KD for ViT is also desired. However, besides the output logit-based KD, other feature-based KD methods for CNNs cannot be directly applied to ViT due to the huge structure gap. In this paper, we explore the way of feature-based distillation for ViT. Based on the nature of feature maps in ViT, we design a series of controlled experiments and derive three practical guidelines for ViT's feature distillation. Some of our findings are even opposite to the practices in the CNN era. Based on the three guidelines, we propose our feature-based method ViTKD which brings consistent and considerable improvement to the student. On ImageNet-1k, we boost DeiT-Tiny from 74.42% to 76.06%, DeiT-Small from 80.55% to 81.95%, and DeiT-Base from 81.76% to 83.46%. Moreover, ViTKD and the logit-based KD method are complementary and can be applied together directly. This combination can further improve the performance of the student. Specifically, the student DeiT-Tiny, Small, and Base achieve 77.78%, 83.59%, and 85.41%, respectively. The code is available at https://github.com/yzd-v/cls_KD.
Analyzing Transformers in Embedding Space
- Authors: Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.02535
- Pdf link: https://arxiv.org/pdf/2209.02535
- Abstract Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent work has shown that a zero-pass approach, where parameters are interpreted directly without a forward/backward pass is feasible for some Transformer parameters, and for two-layer attention networks. In this work, we present a theoretical analysis where all parameters of a trained Transformer are interpreted by projecting them into the embedding space, that is, the space of vocabulary items they operate on. We derive a simple theoretical framework to support our arguments and provide ample evidence for its validity. First, an empirical analysis showing that parameters of both pretrained and fine-tuned models can be interpreted in embedding space. Second, we present two applications of our framework: (a) aligning the parameters of different models that share a vocabulary, and (b) constructing a classifier without training by ``translating'' the parameters of a fine-tuned classifier to parameters of a different model that was only pretrained. Overall, our findings open the door to interpretation methods that, at least in part, abstract away from model specifics and operate in the embedding space only.
Semantic Image Synthesis with Semantically Coupled VQ-Model
- Authors: Stephan Alaniz, Thomas Hummel, Zeynep Akata
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2209.02536
- Pdf link: https://arxiv.org/pdf/2209.02536
- Abstract Semantic image synthesis enables control over unconditional image generation by allowing guidance on what is being generated. We conditionally synthesize the latent space from a vector quantized model (VQ-model) pre-trained to autoencode images. Instead of training an autoregressive Transformer on separately learned conditioning latents and image latents, we find that jointly learning the conditioning and image latents significantly improves the modeling capabilities of the Transformer model. While our jointly trained VQ-model achieves a similar reconstruction performance to a vanilla VQ-model for both semantic and image latents, tying the two modalities at the autoencoding stage proves to be an important ingredient to improve autoregressive modeling performance. We show that our model improves semantic image synthesis using autoregressive models on popular semantic image datasets ADE20k, Cityscapes and COCO-Stuff.
The Outcome of the 2022 Landslide4Sense Competition: Advanced Landslide Detection from Multi-Source Satellite Imagery
- Authors: Omid Ghorbanzadeh, Yonghao Xu, Hengwei Zhao, Junjue Wang, Yanfei Zhong, Dong Zhao, Qi Zang, Shuang Wang, Fahong Zhang, Yilei Shi, Xiao Xiang Zhu, Lin Bai, Weile Li, Weihang Peng, Pedram Ghamisi
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.02556
- Pdf link: https://arxiv.org/pdf/2209.02556
- Abstract The scientific outcomes of the 2022 Landslide4Sense (L4S) competition organized by the Institute of Advanced Research in Artificial Intelligence (IARAI) are presented here. The objective of the competition is to automatically detect landslides based on large-scale multiple sources of satellite imagery collected globally. The 2022 L4S aims to foster interdisciplinary research on recent developments in deep learning (DL) models for the semantic segmentation task using satellite imagery. In the past few years, DL-based models have achieved performance that meets expectations on image interpretation, due to the development of convolutional neural networks (CNNs). The main objective of this article is to present the details and the best-performing algorithms featured in this competition. The winning solutions are elaborated with state-of-the-art models like the Swin Transformer, SegFormer, and U-Net. Advanced machine learning techniques and strategies such as hard example mining, self-training, and mix-up data augmentation are also considered. Moreover, we describe the L4S benchmark data set in order to facilitate further comparisons, and report the results of the accuracy assessment online. The data is accessible on \textit{Future Development Leaderboard} for future evaluation at \url{https://www.iarai.ac.at/landslide4sense/challenge/}, and researchers are invited to submit more prediction results, evaluate the accuracy of their methods, compare them with those of other users, and, ideally, improve the landslide detection results reported in this article.
Keyword: scene understanding
Sequential Cross Attention Based Multi-task Learning
- Authors: Sunkyung Kim, Hyesong Choi, Dongbo Min
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.02518
- Pdf link: https://arxiv.org/pdf/2209.02518
- Abstract In multi-task learning (MTL) for visual scene understanding, it is crucial to transfer useful information between multiple tasks with minimal interferences. In this paper, we propose a novel architecture that effectively transfers informative features by applying the attention mechanism to the multi-scale features of the tasks. Since applying the attention module directly to all possible features in terms of scale and task requires a high complexity, we propose to apply the attention module sequentially for the task and scale. The cross-task attention module (CTAM) is first applied to facilitate the exchange of relevant information between the multiple task features of the same scale. The cross-scale attention module (CSAM) then aggregates useful information from feature maps at different resolutions in the same task. Also, we attempt to capture long range dependencies through the self-attention module in the feature extraction network. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the NYUD-v2 and PASCAL-Context dataset.
Keyword: visual reasoning
Kinova Gemini: Interactive Robot Grasping with Visual Reasoning and Conversational AI
- Authors: Hanxiao Chen, Jiankun Wang, Max Q.-H. Meng
- Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
- Arxiv link: https://arxiv.org/abs/2209.01319
- Pdf link: https://arxiv.org/pdf/2209.01319
- Abstract To facilitate recent advances in robotics and AI for delicate collaboration between humans and machines, we propose the Kinova Gemini, an original robotic system that integrates conversational AI dialogue and visual reasoning to make the Kinova Gen3 lite robot help people retrieve objects or complete perception-based pick-and-place tasks. When a person walks up to Kinova Gen3 lite, our Kinova Gemini is able to fulfill the user's requests in three different applications: (1) It can start a natural dialogue with people to interact and assist humans to retrieve objects and hand them to the user one by one. (2) It detects diverse objects with YOLO v3 and recognize color attributes of the item to ask people if they want to grasp it via the dialogue or enable the user to choose which specific one is required. (3) It applies YOLO v3 to recognize multiple objects and let you choose two items for perception-based pick-and-place tasks such as "Put the banana into the bowl" with visual reasoning and conversational interaction.