arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Tue, 30 Aug 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Ammunition Component Classification Using Deep Learning

  • Authors: Hadi Ghahremannezhad, Chengjun Liu, Hang Shi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.12863
  • Pdf link: https://arxiv.org/pdf/2208.12863
  • Abstract Ammunition scrap inspection is an essential step in the process of recycling ammunition metal scrap. Most ammunition is composed of a number of components, including case, primer, powder, and projectile. Ammo scrap containing energetics is considered to be potentially dangerous and should be separated before the recycling process. Manually inspecting each piece of scrap is tedious and time-consuming. We have gathered a dataset of ammunition components with the goal of applying artificial intelligence for classifying safe and unsafe scrap pieces automatically. First, two training datasets are manually created from visual and x-ray images of ammo. Second, the x-ray dataset is augmented using the spatial transforms of histogram equalization, averaging, sharpening, power law, and Gaussian blurring in order to compensate for the lack of sufficient training data. Lastly, the representative YOLOv4 object detection method is applied to detect the ammo components and classify the scrap pieces into safe and unsafe classes, respectively. The trained models are tested against unseen data in order to evaluate the performance of the applied method. The experiments demonstrate the feasibility of ammo component detection and classification using deep learning. The datasets and the pre-trained models are available at https://github.com/hadi-ghnd/Scrap-Classification.

YOLOX-PAI: An Improved YOLOX Version by PAI

  • Authors: Xinyi Zou, Ziheng Wu, Wenmeng Zhou, Jun Huang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.13040
  • Pdf link: https://arxiv.org/pdf/2208.13040
  • Abstract We develop an all-in-one computer vision toolbox named EasyCV to facilitate the use of various SOTA computer vision methods. Recently, we add YOLOX-PAI, an improved version of YOLOX, into EasyCV. We conduct ablation studies to investigate the influence of some detection methods on YOLOX. We also provide an easy use for PAI-Blade which is used to accelerate the inference process based on BladeDISC and TensorRT. Finally, we receive 42.8 mAP on COCO dateset within 1.0 ms on a single NVIDIA V100 GPU, which is a bit faster than YOLOv6. A simple but efficient predictor api is also designed in EasyCV to conduct end2end object detection. Codes and models are now available at: https://github.com/alibaba/EasyCV.

Incremental Semantic Localization using Hierarchical Clustering of Object Association Sets

  • Authors: Lan Hu, Zhongwei Luo, Runze Yuan, Yuchen Cao, Jiaxin Wei, Kai Wangand Laurent Kneip
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2208.13210
  • Pdf link: https://arxiv.org/pdf/2208.13210
  • Abstract We present a novel approach for relocalization or place recognition, a fundamental problem to be solved in many robotics, automation, and AR applications. Rather than relying on often unstable appearance information, we consider a situation in which the reference map is given in the form of localized objects. Our localization framework relies on 3D semantic object detections, which are then associated to objects in the map. Possible pair-wise association sets are grown based on hierarchical clustering using a merge metric that evaluates spatial compatibility. The latter notably uses information about relative object configurations, which is invariant with respect to global transformations. Association sets are furthermore updated and expanded as the camera incrementally explores the environment and detects further objects. We test our algorithm in several challenging situations including dynamic scenes, large view-point changes, and scenes with repeated instances. Our experiments demonstrate that our approach outperforms prior art in terms of both robustness and accuracy.

Confidence Estimation for Object Detection in Document Images

  • Authors: Mélodie Boillet, Christopher Kermorvant, Thierry Paquet
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.13391
  • Pdf link: https://arxiv.org/pdf/2208.13391
  • Abstract Deep neural networks are becoming increasingly powerful and large and always require more labelled data to be trained. However, since annotating data is time-consuming, it is now necessary to develop systems that show good performance while learning on a limited amount of data. These data must be correctly chosen to obtain models that are still efficient. For this, the systems must be able to determine which data should be annotated to achieve the best results. In this paper, we propose four estimators to estimate the confidence of object detection predictions. The first two are based on Monte Carlo dropout, the third one on descriptive statistics and the last one on the detector posterior probabilities. In the active learning framework, the three first estimators show a significant improvement in performance for the detection of document physical pages and text lines compared to a random selection of images. We also show that the proposed estimator based on descriptive statistics can replace MC dropout, reducing the computational cost without compromising the performances.

PV-RCNN++: Semantical Point-Voxel Feature Interaction for 3D Object Detection

  • Authors: Peng Wu, Lipeng Gu, Xuefeng Yan, Haoran Xie, Fu Lee Wang, Gary Cheng, Mingqiang Wei
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.13414
  • Pdf link: https://arxiv.org/pdf/2208.13414
  • Abstract Large imbalance often exists between the foreground points (i.e., objects) and the background points in outdoor LiDAR point clouds. It hinders cutting-edge detectors from focusing on informative areas to produce accurate 3D object detection results. This paper proposes a novel object detection network by semantical point-voxel feature interaction, dubbed PV-RCNN++. Unlike most of existing methods, PV-RCNN++ explores the semantic information to enhance the quality of object detection. First, a semantic segmentation module is proposed to retain more discriminative foreground keypoints. Such a module will guide our PV-RCNN++ to integrate more object-related point-wise and voxel-wise features in the pivotal areas. Then, to make points and voxels interact efficiently, we utilize voxel query based on Manhattan distance to quickly sample voxel-wise features around keypoints. Such the voxel query will reduce the time complexity from O(N) to O(K), compared to the ball query. Further, to avoid being stuck in learning only local features, an attention-based residual PointNet module is designed to expand the receptive field to adaptively aggregate the neighboring voxel-wise features into keypoints. Extensive experiments on the KITTI dataset show that PV-RCNN++ achieves 81.60$%$, 40.18$%$, 68.21$%$ 3D mAP on Car, Pedestrian, and Cyclist, achieving comparable or even better performance to the state-of-the-arts.

Light-YOLOv5: A Lightweight Algorithm for Improved YOLOv5 in Complex Fire Scenarios

  • Authors: Hao Xu, Bo Li, Fei Zhong
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.13422
  • Pdf link: https://arxiv.org/pdf/2208.13422
  • Abstract In response to the existing object detection algorithms are applied to complex fire scenarios with poor detection accuracy, slow speed and difficult deployment., this paper proposes a lightweight fire detection algorithm of Light-YOLOv5 that achieves a balance of speed and accuracy. First, the last layer of backbone network is replaced with SepViT Block to enhance the contact of backbone network to global information; second, a Light-BiFPN neck network is designed to lighten the model while improving the feature extraction; third, Global Attention Mechanism (GAM) is fused into the network to make the model more focused on global dimensional features; finally, we use the Mish activation function and SIoU loss to increase the convergence speed and improve the accuracy at the same time. The experimental results show that Light-YOLOv5 improves mAP by 3.3% compared to the original algorithm, reduces the number of parameters by 27.1%, decreases the computation by 19.1%, achieves FPS of 91.1. Even compared to the latest YOLOv7-tiny, the mAP of Light-YOLOv5 is 6.8% higher, which shows the effectiveness of the algorithm.

Explainability of Deep Learning models for Urban Space perception

  • Authors: Ruben Sangers, Jan van Gemert, Sander van Cranenburgh
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.13555
  • Pdf link: https://arxiv.org/pdf/2208.13555
  • Abstract Deep learning based computer vision models are increasingly used by urban planners to support decision making for shaping urban environments. Such models predict how people perceive the urban environment quality in terms of e.g. its safety or beauty. However, the blackbox nature of deep learning models hampers urban planners to understand what landscape objects contribute to a particularly high quality or low quality urban space perception. This study investigates how computer vision models can be used to extract relevant policy information about peoples' perception of the urban space. To do so, we train two widely used computer vision architectures; a Convolutional Neural Network and a transformer, and apply GradCAM -- a well-known ex-post explainable AI technique -- to highlight the image regions important for the model's prediction. Using these GradCAM visualizations, we manually annotate the objects relevant to the models' perception predictions. As a result, we are able to discover new objects that are not represented in present object detection models used for annotation in previous studies. Moreover, our methodological results suggest that transformer architectures are better suited to be used in combination with GradCAM techniques. Code is available on Github.

Chosen methods of improving object recognition of small objects with weak recognizable features

  • Authors: Magdalena Stachoń, Marcin Pietroń
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2208.13591
  • Pdf link: https://arxiv.org/pdf/2208.13591
  • Abstract Many object detection models struggle with several problematic aspects of small object detection including the low number of samples, lack of diversity and low features representation. Taking into account that GANs belong to generative models class, their initial objective is to learn to mimic any data distribution. Using the proper GAN model would enable augmenting low precision data increasing their amount and diversity. This solution could potentially result in improved object detection results. Additionally, incorporating GAN-based architecture inside deep learning model can increase accuracy of small objects recognition. In this work the GAN-based method with augmentation is presented to improve small object detection on VOC Pascal dataset. The method is compared with different popular augmentation strategies like object rotations, shifts etc. The experiments are based on FasterRCNN model.

Open-Set Semi-Supervised Object Detection

  • Authors: Yen-Cheng Liu, Chih-Yao Ma, Xiaoliang Dai, Junjiao Tian, Peter Vajda, Zijian He, Zsolt Kira
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.13722
  • Pdf link: https://arxiv.org/pdf/2208.13722
  • Abstract Recent developments for Semi-Supervised Object Detection (SSOD) have shown the promise of leveraging unlabeled data to improve an object detector. However, thus far these methods have assumed that the unlabeled data does not contain out-of-distribution (OOD) classes, which is unrealistic with larger-scale unlabeled datasets. In this paper, we consider a more practical yet challenging problem, Open-Set Semi-Supervised Object Detection (OSSOD). We first find the existing SSOD method obtains a lower performance gain in open-set conditions, and this is caused by the semantic expansion, where the distracting OOD objects are mispredicted as in-distribution pseudo-labels for the semi-supervised training. To address this problem, we consider online and offline OOD detection modules, which are integrated with SSOD methods. With the extensive studies, we found that leveraging an offline OOD detector based on a self-supervised vision transformer performs favorably against online OOD detectors due to its robustness to the interference of pseudo-labeling. In the experiment, our proposed framework effectively addresses the semantic expansion issue and shows consistent improvements on many OSSOD benchmarks, including large-scale COCO-OpenImages. We also verify the effectiveness of our framework under different OSSOD conditions, including varying numbers of in-distribution classes, different degrees of supervision, and different combinations of unlabeled sets.

Keyword: transformer

Domain-Specific NER via Retrieving Correlated Samples

  • Authors: Xin Zhang, Yong Jiang, Xiaobin Wang, Xuming Hu, Yueheng Sun, Pengjun Xie, Meishan Zhang
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2208.12995
  • Pdf link: https://arxiv.org/pdf/2208.12995
  • Abstract Successful Machine Learning based Named Entity Recognition models could fail on texts from some special domains, for instance, Chinese addresses and e-commerce titles, where requires adequate background knowledge. Such texts are also difficult for human annotators. In fact, we can obtain some potentially helpful information from correlated texts, which have some common entities, to help the text understanding. Then, one can easily reason out the correct answer by referencing correlated samples. In this paper, we suggest enhancing NER models with correlated samples. We draw correlated samples by the sparse BM25 retriever from large-scale in-domain unlabeled data. To explicitly simulate the human reasoning process, we perform a training-free entity type calibrating by majority voting. To capture correlation features in the training stage, we suggest to model correlated samples by the transformer-based multi-instance cross-encoder. Empirical results on datasets of the above two domains show the efficacy of our methods.

TrojViT: Trojan Insertion in Vision Transformers

  • Authors: Mengxin Zheng, Qian Lou, Lei Jiang
  • Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2208.13049
  • Pdf link: https://arxiv.org/pdf/2208.13049
  • Abstract Vision Transformers (ViTs) have demonstrated the state-of-the-art performance in various vision-related tasks. The success of ViTs motivates adversaries to perform backdoor attacks on ViTs. Although the vulnerability of traditional CNNs to backdoor attacks is well-known, backdoor attacks on ViTs are seldom-studied. Compared to CNNs capturing pixel-wise local features by convolutions, ViTs extract global context information through patches and attentions. Na"ively transplanting CNN-specific backdoor attacks to ViTs yields only a low clean data accuracy and a low attack success rate. In this paper, we propose a stealth and practical ViT-specific backdoor attack $TrojViT$. Rather than an area-wise trigger used by CNN-specific backdoor attacks, TrojViT generates a patch-wise trigger designed to build a Trojan composed of some vulnerable bits on the parameters of a ViT stored in DRAM memory through patch salience ranking and attention-target loss. TrojViT further uses minimum-tuned parameter update to reduce the bit number of the Trojan. Once the attacker inserts the Trojan into the ViT model by flipping the vulnerable bits, the ViT model still produces normal inference accuracy with benign inputs. But when the attacker embeds a trigger into an input, the ViT model is forced to classify the input to a predefined target class. We show that flipping only few vulnerable bits identified by TrojViT on a ViT model using the well-known RowHammer can transform the model into a backdoored one. We perform extensive experiments of multiple datasets on various ViT models. TrojViT can classify $99.64%$ of test images to a target class by flipping $345$ bits on a ViT for ImageNet.

An Access Control Method with Secret Key for Semantic Segmentation Models

  • Authors: Teru Nagamori, Ryota Iijima, Hitoshi Kiya
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.13135
  • Pdf link: https://arxiv.org/pdf/2208.13135
  • Abstract A novel method for access control with a secret key is proposed to protect models from unauthorized access in this paper. We focus on semantic segmentation models with the vision transformer (ViT), called segmentation transformer (SETR). Most existing access control methods focus on image classification tasks, or they are limited to CNNs. By using a patch embedding structure that ViT has, trained models and test images can be efficiently encrypted with a secret key, and then semantic segmentation tasks are carried out in the encrypted domain. In an experiment, the method is confirmed to provide the same accuracy as that of using plain images without any encryption to authorized users with a correct key and also to provide an extremely degraded accuracy to unauthorized users.

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers

  • Authors: Yutong Xie, Jianpeng Zhang, Yong Xia, Anton van den Hengel, Qi Wu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.13138
  • Pdf link: https://arxiv.org/pdf/2208.13138
  • Abstract Although Transformers have successfully transitioned from their language modelling origins to image-based applications, their quadratic computational complexity remains a challenge, particularly for dense prediction. In this paper we propose a content-based sparse attention method, as an alternative to dense self-attention, aiming to reduce the computation complexity while retaining the ability to model long-range dependencies. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost. Besides, we further extend the clustering-guided attention from single-scale to multi-scale, which is conducive to dense prediction tasks. We label the proposed Transformer architecture ClusTR, and demonstrate that it achieves state-of-the-art performance on various vision tasks but at lower computational cost and with fewer parameters. For instance, our ClusTR small model with 22.7M parameters achieves 83.2% Top-1 accuracy on ImageNet. Source code and ImageNet models will be made publicly available.

Learning Heterogeneous Interaction Strengths by Trajectory Prediction with Graph Neural Network

  • Authors: Seungwoong Ha, Hawoong Jeong
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.13179
  • Pdf link: https://arxiv.org/pdf/2208.13179
  • Abstract Dynamical systems with interacting agents are universal in nature, commonly modeled by a graph of relationships between their constituents. Recently, various works have been presented to tackle the problem of inferring those relationships from the system trajectories via deep neural networks, but most of the studies assume binary or discrete types of interactions for simplicity. In the real world, the interaction kernels often involve continuous interaction strengths, which cannot be accurately approximated by discrete relations. In this work, we propose the relational attentive inference network (RAIN) to infer continuously weighted interaction graphs without any ground-truth interaction strengths. Our model employs a novel pairwise attention (PA) mechanism to refine the trajectory representations and a graph transformer to extract heterogeneous interaction weights for each pair of agents. We show that our RAIN model with the PA mechanism accurately infers continuous interaction strengths for simulated physical systems in an unsupervised manner. Further, RAIN with PA successfully predicts trajectories from motion capture data with an interpretable interaction graph, demonstrating the virtue of modeling unknown dynamics with continuous weights.

Bayesian Neural Network Language Modeling for Speech Recognition

  • Authors: Boyang Xue, Shoukang Hu, Junhao Xu, Mengzhe Geng, Xunying Liu, Helen Meng
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2208.13259
  • Pdf link: https://arxiv.org/pdf/2208.13259
  • Abstract State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. They are prone to overfitting and poor generalization when given limited training data. To this end, an overarching full Bayesian learning framework encompassing three methods is proposed in this paper to account for the underlying uncertainty in LSTM-RNN and Transformer LMs. The uncertainty over their model parameters, choice of neural activations and hidden output representations are modeled using Bayesian, Gaussian Process and variational LSTM-RNN or Transformer LMs respectively. Efficient inference approaches were used to automatically select the optimal network internal components to be Bayesian learned using neural architecture search. A minimal number of Monte Carlo parameter samples as low as one was also used. These allow the computational costs incurred in Bayesian NNLM training and evaluation to be minimized. Experiments are conducted on two tasks: AMI meeting transcription and Oxford-BBC LipReading Sentences 2 (LRS2) overlapped speech recognition using state-of-the-art LF-MMI trained factored TDNN systems featuring data augmentation, speaker adaptation and audio-visual multi-channel beamforming for overlapped speech. Consistent performance improvements over the baseline LSTM-RNN and Transformer LMs with point estimated model parameters and drop-out regularization were obtained across both tasks in terms of perplexity and word error rate (WER). In particular, on the LRS2 data, statistically significant WER reductions up to 1.3% and 1.2% absolute (12.1% and 11.3% relative) were obtained over the baseline LSTM-RNN and Transformer LMs respectively after model combination between Bayesian NNLMs and their respective baselines.

naab: A ready-to-use plug-and-play corpus for Farsi

  • Authors: Sadra Sabouri, Elnaz Rahmati, Soroush Gooran, Hossein Sameti
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2208.13486
  • Pdf link: https://arxiv.org/pdf/2208.13486
  • Abstract Huge corpora of textual data are always known to be a crucial need for training deep models such as transformer-based ones. This issue is emerging more in lower resource languages - like Farsi. We propose naab, the biggest cleaned and ready-to-use open-source textual corpus in Farsi. It contains about 130GB of data, 250 million paragraphs, and 15 billion words. The project name is derived from the Farsi word NAAB K which means pure and high grade. We also provide the raw version of the corpus called naab-raw and an easy-to-use preprocessor that can be employed by those who wanted to make a customized corpus.

Explainability of Deep Learning models for Urban Space perception

  • Authors: Ruben Sangers, Jan van Gemert, Sander van Cranenburgh
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.13555
  • Pdf link: https://arxiv.org/pdf/2208.13555
  • Abstract Deep learning based computer vision models are increasingly used by urban planners to support decision making for shaping urban environments. Such models predict how people perceive the urban environment quality in terms of e.g. its safety or beauty. However, the blackbox nature of deep learning models hampers urban planners to understand what landscape objects contribute to a particularly high quality or low quality urban space perception. This study investigates how computer vision models can be used to extract relevant policy information about peoples' perception of the urban space. To do so, we train two widely used computer vision architectures; a Convolutional Neural Network and a transformer, and apply GradCAM -- a well-known ex-post explainable AI technique -- to highlight the image regions important for the model's prediction. Using these GradCAM visualizations, we manually annotate the objects relevant to the models' perception predictions. As a result, we are able to discover new objects that are not represented in present object detection models used for annotation in previous studies. Moreover, our methodological results suggest that transformer architectures are better suited to be used in combination with GradCAM techniques. Code is available on Github.

Spatio-Temporal Wind Speed Forecasting using Graph Networks and Novel Transformer Architectures

  • Authors: Lars Ødegaard Bentsen, Narada Dilp Warakagoda, Roy Stenbro, Paal Engelstad
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2208.13585
  • Pdf link: https://arxiv.org/pdf/2208.13585
  • Abstract To improve the security and reliability of wind energy production, short-term forecasting has become of utmost importance. This study focuses on multi-step spatio-temporal wind speed forecasting for the Norwegian continental shelf. A graph neural network (GNN) architecture was used to extract spatial dependencies, with different update functions to learn temporal correlations. These update functions were implemented using different neural network architectures. One such architecture, the Transformer, has become increasingly popular for sequence modelling in recent years. Various alterations of the original architecture have been proposed to better facilitate time-series forecasting, of which this study focused on the Informer, LogSparse Transformer and Autoformer. This is the first time the LogSparse Transformer and Autoformer have been applied to wind forecasting and the first time any of these or the Informer have been formulated in a spatio-temporal setting for wind forecasting. By comparing against spatio-temporal Long Short-Term Memory (LSTM) and Multi-Layer Perceptron (MLP) models, the study showed that the models using the altered Transformer architectures as update functions in GNNs were able to outperform these. Furthermore, we propose the Fast Fourier Transformer (FFTransformer), which is a novel Transformer architecture based on signal decomposition and consists of two separate streams that analyse trend and periodic components separately. The FFTransformer and Autoformer were found to achieve superior results for the 10-minute and 1-hour ahead forecasts, with the FFTransformer significantly outperforming all other models for the 4-hour ahead forecasts. Finally, by varying the degree of connectivity for the graph representations, the study explicitly demonstrates how all models were able to leverage spatial dependencies to improve local short-term wind speed forecasting.

Combating high variance in Data-Scarce Implicit Hate Speech Classification

  • Authors: Debaditya Pal, Kaustubh Chaudhari, Harsh Sharma
  • Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.13595
  • Pdf link: https://arxiv.org/pdf/2208.13595
  • Abstract Hate speech classification has been a long-standing problem in natural language processing. However, even though there are numerous hate speech detection methods, they usually overlook a lot of hateful statements due to them being implicit in nature. Developing datasets to aid in the task of implicit hate speech classification comes with its own challenges; difficulties are nuances in language, varying definitions of what constitutes hate speech, and the labor-intensive process of annotating such data. This had led to a scarcity of data available to train and test such systems, which gives rise to high variance problems when parameter-heavy transformer-based models are used to address the problem. In this paper, we explore various optimization and regularization techniques and develop a novel RoBERTa-based model that achieves state-of-the-art performance.

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

  • Authors: Jun Ling, Xu Tan, Liyang Chen, Runnan Li, Yuchao Zhang, Sheng Zhao, Li Song
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.13717
  • Pdf link: https://arxiv.org/pdf/2208.13717
  • Abstract While previous speech-driven talking face generation methods have made significant progress in improving the visual quality and lip-sync quality of the synthesized videos, they pay less attention to lip motion jitters which greatly undermine the realness of talking face videos. What causes motion jitters, and how to mitigate the problem? In this paper, we conduct systematic analyses on the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video, and improve the motion stability with a series of effective designs. We find that several issues can lead to jitters in synthesized talking face video: 1) jitters from the input 3D face representations; 2) training-inference mismatch; 3) lack of dependency modeling among video frames. Accordingly, we propose three effective solutions to address this issue: 1) we propose a gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) we add augmented erosions on the input data of the neural renderer in training to simulate the distortion in inference to reduce mismatch; 3) we develop an audio-fused transformer generator to model dependency among video frames. Besides, considering there is no off-the-shelf metric for measuring motion jitters in talking face video, we devise an objective metric (Motion Stability Index, MSI), to quantitatively measure the motion jitters by calculating the reciprocal of variance acceleration. Extensive experimental results show the superiority of our method on motion-stable face video generation, with better quality than previous systems.

CounTR: Transformer-based Generalised Visual Counting

  • Authors: Chang Liu, Yujie Zhong, Andrew Zisserman, Weidi Xie
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.13721
  • Pdf link: https://arxiv.org/pdf/2208.13721
  • Abstract In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given "exemplars" with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given "exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.

Open-Set Semi-Supervised Object Detection

  • Authors: Yen-Cheng Liu, Chih-Yao Ma, Xiaoliang Dai, Junjiao Tian, Peter Vajda, Zijian He, Zsolt Kira
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.13722
  • Pdf link: https://arxiv.org/pdf/2208.13722
  • Abstract Recent developments for Semi-Supervised Object Detection (SSOD) have shown the promise of leveraging unlabeled data to improve an object detector. However, thus far these methods have assumed that the unlabeled data does not contain out-of-distribution (OOD) classes, which is unrealistic with larger-scale unlabeled datasets. In this paper, we consider a more practical yet challenging problem, Open-Set Semi-Supervised Object Detection (OSSOD). We first find the existing SSOD method obtains a lower performance gain in open-set conditions, and this is caused by the semantic expansion, where the distracting OOD objects are mispredicted as in-distribution pseudo-labels for the semi-supervised training. To address this problem, we consider online and offline OOD detection modules, which are integrated with SSOD methods. With the extensive studies, we found that leveraging an offline OOD detector based on a self-supervised vision transformer performs favorably against online OOD detectors due to its robustness to the interference of pseudo-labeling. In the experiment, our proposed framework effectively addresses the semantic expansion issue and shows consistent improvements on many OSSOD benchmarks, including large-scale COCO-OpenImages. We also verify the effectiveness of our framework under different OSSOD conditions, including varying numbers of in-distribution classes, different degrees of supervision, and different combinations of unlabeled sets.

Keyword: scene understanding

Neuromorphic Visual Scene Understanding with Resonator Networks

  • Authors: Alpha Renner, Lazar Supic, Andreea Danielescu, Giacomo Indiveri, Bruno A. Olshausen, Yulia Sandamirskaya, Friedrich T. Sommer, E. Paxon Frady
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2208.12880
  • Pdf link: https://arxiv.org/pdf/2208.12880
  • Abstract Inferring the position of objects and their rigid transformations is still an open problem in visual scene understanding. Here we propose a neuromorphic solution that utilizes an efficient factorization network which is based on three key concepts: (1) a computational framework based on Vector Symbolic Architectures (VSA) with complex-valued vectors; (2) the design of Hierarchical Resonator Networks (HRN) to deal with the non-commutative nature of translation and rotation in visual scenes, when both are used in combination; (3) the design of a multi-compartment spiking phasor neuron model for implementing complex-valued vector binding on neuromorphic hardware. The VSA framework uses vector binding operations to produce generative image models in which binding acts as the equivariant operation for geometric transformations. A scene can therefore be described as a sum of vector products, which in turn can be efficiently factorized by a resonator network to infer objects and their poses. The HRN enables the definition of a partitioned architecture in which vector binding is equivariant for horizontal and vertical translation within one partition, and for rotation and scaling within the other partition. The spiking neuron model allows to map the resonator network onto efficient and low-power neuromorphic hardware. In this work, we demonstrate our approach using synthetic scenes composed of simple 2D shapes undergoing rigid geometric transformations and color changes. A companion paper demonstrates this approach in real-world application scenarios for machine vision and robotics.

Keyword: visual reasoning

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

  • Authors: Mustafa Shukor, Guillaume Couairon, Matthieu Cord
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.13628
  • Pdf link: https://arxiv.org/pdf/2208.13628
  • Abstract Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks. The current trend is to move towards ever larger models and pretraining datasets. This computational headlong rush does not seem reasonable in the long term to move toward sustainable solutions, and de facto excludes academic laboratories with limited resources. In this work, we propose a new framework, dubbed ViCHA, that efficiently exploits the input data to boost the learning by: (a) a new hierarchical cross-modal alignment loss, (b) new self-supervised scheme based on masked image modeling, (c) leveraging image-level annotations, called Visual Concepts, obtained with existing foundation models such as CLIP to boost the performance of the image encoder. Although pretrained on four times less data, our ViCHA strategy outperforms other approaches on several downstream tasks such as Image-Text Retrieval, VQA, Visual Reasoning, Visual Entailment and Visual Grounding. The code will be made publicly available here: https://github.com/mshukor/ViCHA

DongZhouGu avatar Aug 30 '22 04:08 DongZhouGu