arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Thu, 6 Oct 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

BayesFT: Bayesian Optimization for Fault Tolerant Neural Network Architecture

  • Authors: Nanyang Ye, Jingbiao Mei, Zhicheng Fang, Yuwen Zhang, Ziqing Zhang, Huaying Wu, Xiaoyao Liang
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2210.01795
  • Pdf link: https://arxiv.org/pdf/2210.01795
  • Abstract To deploy deep learning algorithms on resource-limited scenarios, an emerging device-resistive random access memory (ReRAM) has been regarded as promising via analog computing. However, the practicability of ReRAM is primarily limited due to the weight drifting of ReRAM neural networks due to multi-factor reasons, including manufacturing, thermal noises, and etc. In this paper, we propose a novel Bayesian optimization method for fault tolerant neural network architecture (BayesFT). For neural architecture search space design, instead of conducting neural architecture search on the whole feasible neural architecture search space, we first systematically explore the weight drifting tolerance of different neural network components, such as dropout, normalization, number of layers, and activation functions in which dropout is found to be able to improve the neural network robustness to weight drifting. Based on our analysis, we propose an efficient search space by only searching for dropout rates for each layer. Then, we use Bayesian optimization to search for the optimal neural architecture robust to weight drifting. Empirical experiments demonstrate that our algorithmic framework has outperformed the state-of-the-art methods by up to 10 times on various tasks, such as image classification and object detection.

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

  • Authors: Chenglin Yang, Siyuan Qiao, Qihang Yu, Xiaoding Yuan, Yukun Zhu, Alan Yuille, Hartwig Adam, Liang-Chieh Chen
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.01820
  • Pdf link: https://arxiv.org/pdf/2210.01820
  • Abstract This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% top-1 accuracy on ImageNet-1K with ImageNet-22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to window attention. Thanks to the mobile convolution that effectively exchanges local information between pixels (and thus cross-windows), MOAT does not need the extra window-shifting mechanism. As a result, on COCO object detection, MOAT achieves 59.2% box AP with 227M model parameters (single-scale inference, and hard NMS), and on ADE20K semantic segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale inference). Finally, the tiny-MOAT family, obtained by simply reducing the channel sizes, also surprisingly outperforms several mobile-specific transformer-based models on ImageNet. We hope our simple yet effective MOAT will inspire more seamless integration of convolution and self-attention. Code is made publicly available.

Centerpoints Are All You Need in Overhead Imagery

  • Authors: James Mason Inder, Mark Lowell, Andrew J. Maltenfort
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2210.01857
  • Pdf link: https://arxiv.org/pdf/2210.01857
  • Abstract Labeling data to use for training object detectors is expensive and time consuming. Publicly available overhead datasets for object detection are labeled with image-aligned bounding boxes, object-aligned bounding boxes, or object masks, but it is not clear whether such detailed labeling is necessary. To test the idea, we developed novel single- and two-stage network architectures that use centerpoints for labeling. In this paper we show that these architectures achieve nearly equivalent performance to approaches using more detailed labeling on three overhead object detection datasets.

Exploring Effective Knowledge Transfer for Few-shot Object Detection

  • Authors: Zhiyuan Zhao, Qingjie Liu, Yunhong Wang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.02021
  • Pdf link: https://arxiv.org/pdf/2210.02021
  • Abstract Recently, few-shot object detection~(FSOD) has received much attention from the community, and many methods are proposed to address this problem from a knowledge transfer perspective. Though promising results have been achieved, these methods fail to achieve shot-stable:~methods that excel in low-shot regimes are likely to struggle in high-shot regimes, and vice versa. We believe this is because the primary challenge of FSOD changes when the number of shots varies. In the low-shot regime, the primary challenge is the lack of inner-class variation. In the high-shot regime, as the variance approaches the real one, the main hindrance to the performance comes from misalignment between learned and true distributions. However, these two distinct issues remain unsolved in most existing FSOD methods. In this paper, we propose to overcome these challenges by exploiting rich knowledge the model has learned and effectively transferring them to the novel classes. For the low-shot regime, we propose a distribution calibration method to deal with the lack of inner-class variation problem. Meanwhile, a shift compensation method is proposed to compensate for possible distribution shift during fine-tuning. For the high-shot regime, we propose to use the knowledge learned from ImageNet as guidance for the feature learning in the fine-tuning stage, which will implicitly align the distributions of the novel classes. Although targeted toward different regimes, these two strategies can work together to further improve the FSOD performance. Experiments on both the VOC and COCO benchmarks show that our proposed method can significantly outperform the baseline method and produce competitive results in both low-shot settings (shot<5) and high-shot settings (shot>=5). Code is available at https://github.com/JulioZhao97/EffTrans_Fsdet.git.

Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

  • Authors: Youngwan Lee, Jeffrey Willette, Jonghee Kim, Juho Lee, Sung Ju Hwang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.02077
  • Pdf link: https://arxiv.org/pdf/2210.02077
  • Abstract Masked image modeling (MIM) has become a popular strategy for self-supervised learning~(SSL) of visual representations with Vision Transformers. A representative MIM model, the masked auto-encoder (MAE), randomly masks a subset of image patches and reconstructs the masked patches given the unmasked patches. Concurrently, many recent works in self-supervised learning utilize the student/teacher paradigm which provides the student with an additional target based on the output of a teacher composed of an exponential moving average (EMA) of previous students. Although common, relatively little is known about the dynamics of the interaction between the student and teacher. Through analysis on a simple linear model, we find that the teacher conditionally removes previous gradient directions based on feature similarities which effectively acts as a conditional momentum regularizer. From this analysis, we present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE. We find that RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training, which may provide a way to enhance the practicality of prohibitively expensive self-supervised learning of Vision Transformer models. Additionally, we show that RC-MAE achieves more robustness and better performance compared to MAE on downstream tasks such as ImageNet-1K classification, object detection, and instance segmentation.

WUDA: Unsupervised Domain Adaptation Based on Weak Source Domain Labels

  • Authors: Shengjie Liu, Chuang Zhu, Wenqi Tang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.02088
  • Pdf link: https://arxiv.org/pdf/2210.02088
  • Abstract Unsupervised domain adaptation (UDA) for semantic segmentation addresses the cross-domain problem with fine source domain labels. However, the acquisition of semantic labels has always been a difficult step, many scenarios only have weak labels (e.g. bounding boxes). For scenarios where weak supervision and cross-domain problems coexist, this paper defines a new task: unsupervised domain adaptation based on weak source domain labels (WUDA). To explore solutions for this task, this paper proposes two intuitive frameworks: 1) Perform weakly supervised semantic segmentation in the source domain, and then implement unsupervised domain adaptation; 2) Train an object detection model using source domain data, then detect objects in the target domain and implement weakly supervised semantic segmentation. We observe that the two frameworks behave differently when the datasets change. Therefore, we construct dataset pairs with a wide range of domain shifts and conduct extended experiments to analyze the impact of different domain shifts on the two frameworks. In addition, to measure domain shift, we apply the metric representation shift to urban landscape image segmentation for the first time. The source code and constructed datasets are available at \url{https://github.com/bupt-ai-cz/WUDA}.

Centralized Feature Pyramid for Object Detection

  • Authors: Yu Quan, Dong Zhang, Liyan Zhang, Jinhui Tang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.02093
  • Pdf link: https://arxiv.org/pdf/2210.02093
  • Abstract Visual feature pyramid has shown its superiority in both effectiveness and efficiency in a wide range of applications. However, the existing methods exorbitantly concentrate on the inter-layer feature interactions but ignore the intra-layer feature regulations, which are empirically proved beneficial. Although some methods try to learn a compact intra-layer feature representation with the help of the attention mechanism or the vision transformer, they ignore the neglected corner regions that are important for dense prediction tasks. To address this problem, in this paper, we propose a Centralized Feature Pyramid (CFP) for object detection, which is based on a globally explicit centralized feature regulation. Specifically, we first propose a spatial explicit visual center scheme, where a lightweight MLP is used to capture the globally long-range dependencies and a parallel learnable visual center mechanism is used to capture the local corner regions of the input images. Based on this, we then propose a globally centralized regulation for the commonly-used feature pyramid in a top-down fashion, where the explicit visual center information obtained from the deepest intra-layer feature is used to regulate frontal shallow features. Compared to the existing feature pyramids, CFP not only has the ability to capture the global long-range dependencies, but also efficiently obtain an all-round yet discriminative feature representation. Experimental results on the challenging MS-COCO validate that our proposed CFP can achieve the consistent performance gains on the state-of-the-art YOLOv5 and YOLOX object detection baselines.

Spatio-Temporal Learnable Proposals for End-to-End Video Object Detection

  • Authors: Khurram Azeem Hashmi, Didier Stricker, Muhammamd Zeshan Afzal
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.02368
  • Pdf link: https://arxiv.org/pdf/2210.02368
  • Abstract This paper presents the novel idea of generating object proposals by leveraging temporal information for video object detection. The feature aggregation in modern region-based video object detectors heavily relies on learned proposals generated from a single-frame RPN. This imminently introduces additional components like NMS and produces unreliable proposals on low-quality frames. To tackle these restrictions, we present SparseVOD, a novel video object detection pipeline that employs Sparse R-CNN to exploit temporal information. In particular, we introduce two modules in the dynamic head of Sparse R-CNN. First, the Temporal Feature Extraction module based on the Temporal RoI Align operation is added to extract the RoI proposal features. Second, motivated by sequence-level semantic aggregation, we incorporate the attention-guided Semantic Proposal Feature Aggregation module to enhance object feature representation before detection. The proposed SparseVOD effectively alleviates the overhead of complicated post-processing methods and makes the overall pipeline end-to-end trainable. Extensive experiments show that our method significantly improves the single-frame Sparse RCNN by 8%-9% in mAP. Furthermore, besides achieving state-of-the-art 80.3% mAP on the ImageNet VID dataset with ResNet-50 backbone, our SparseVOD outperforms existing proposal-based methods by a significant margin on increasing IoU thresholds (IoU > 0.5).

Keyword: transformer

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

  • Authors: Chenglin Yang, Siyuan Qiao, Qihang Yu, Xiaoding Yuan, Yukun Zhu, Alan Yuille, Hartwig Adam, Liang-Chieh Chen
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.01820
  • Pdf link: https://arxiv.org/pdf/2210.01820
  • Abstract This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% top-1 accuracy on ImageNet-1K with ImageNet-22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to window attention. Thanks to the mobile convolution that effectively exchanges local information between pixels (and thus cross-windows), MOAT does not need the extra window-shifting mechanism. As a result, on COCO object detection, MOAT achieves 59.2% box AP with 227M model parameters (single-scale inference, and hard NMS), and on ADE20K semantic segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale inference). Finally, the tiny-MOAT family, obtained by simply reducing the channel sizes, also surprisingly outperforms several mobile-specific transformer-based models on ImageNet. We hope our simple yet effective MOAT will inspire more seamless integration of convolution and self-attention. Code is made publicly available.

Memory in humans and deep language models: Linking hypotheses for model augmentation

  • Authors: Omri Raccah, Pheobe Chen, Ted L. Willke, David Poeppel, Vy A. Vo
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2210.01869
  • Pdf link: https://arxiv.org/pdf/2210.01869
  • Abstract The computational complexity of the self-attention mechanism in Transformer models significantly limits their ability to generalize over long temporal durations. Memory-augmentation, or the explicit storing of past information in external memory for subsequent predictions, has become a constructive avenue for mitigating this limitation. We argue that memory-augmented Transformers can benefit substantially from considering insights from the memory literature in humans. We detail an approach to integrating evidence from the human memory system through the specification of cross-domain linking hypotheses. We then provide an empirical demonstration to evaluate the use of surprisal as a linking hypothesis, and further identify the limitations of this approach to inform future research.

A Perceptual Quality Metric for Video Frame Interpolation

  • Authors: Qiqi Hou, Abhijay Ghildyal, Feng Liu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2210.01879
  • Pdf link: https://arxiv.org/pdf/2210.01879
  • Abstract Research on video frame interpolation has made significant progress in recent years. However, existing methods mostly use off-the-shelf metrics to measure the quality of interpolation results with the exception of a few methods that employ user studies, which is time-consuming. As video frame interpolation results often exhibit unique artifacts, existing quality metrics sometimes are not consistent with human perception when measuring the interpolation results. Some recent deep learning-based perceptual quality metrics are shown more consistent with human judgments, but their performance on videos is compromised since they do not consider temporal information. In this paper, we present a dedicated perceptual quality metric for measuring video frame interpolation results. Our method learns perceptual features directly from videos instead of individual frames. It compares pyramid features extracted from video frames and employs Swin Transformer blocks-based spatio-temporal modules to extract spatio-temporal information. To train our metric, we collected a new video frame interpolation quality assessment dataset. Our experiments show that our dedicated quality metric outperforms state-of-the-art methods when measuring video frame interpolation results. Our code and model are made publicly available at \url{https://github.com/hqqxyy/VFIPS}.

Multi-view Human Body Mesh Translator

  • Authors: Xiangjian Jiang, Xuecheng Nie, Zitian Wang, Luoqi Liu, Si Liu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2210.01886
  • Pdf link: https://arxiv.org/pdf/2210.01886
  • Abstract Existing methods for human mesh recovery mainly focus on single-view frameworks, but they often fail to produce accurate results due to the ill-posed setup. Considering the maturity of the multi-view motion capture system, in this paper, we propose to solve the prior ill-posed problem by leveraging multiple images from different views, thus significantly enhancing the quality of recovered meshes. In particular, we present a novel \textbf{M}ulti-view human body \textbf{M}esh \textbf{T}ranslator (MMT) model for estimating human body mesh with the help of vision transformer. Specifically, MMT takes multi-view images as input and translates them to targeted meshes in a single-forward manner. MMT fuses features of different views in both encoding and decoding phases, leading to representations embedded with global information. Additionally, to ensure the tokens are intensively focused on the human pose and shape, MMT conducts cross-view alignment at the feature level by projecting 3D keypoint positions to each view and enforcing their consistency in geometry constraints. Comprehensive experiments demonstrate that MMT outperforms existing single or multi-view models by a large margin for human mesh recovery task, notably, 28.8% improvement in MPVE over the current state-of-the-art method on the challenging HUMBI dataset. Qualitative evaluation also verifies the effectiveness of MMT in reconstructing high-quality human mesh. Codes will be made available upon acceptance.

Meta-Ensemble Parameter Learning

  • Authors: Zhengcong Fei, Shuman Tian, Junshi Huang, Xiaoming Wei, Xiaolin Wei
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2210.01973
  • Pdf link: https://arxiv.org/pdf/2210.01973
  • Abstract Ensemble of machine learning models yields improved performance as well as robustness. However, their memory requirements and inference costs can be prohibitively high. Knowledge distillation is an approach that allows a single model to efficiently capture the approximate performance of an ensemble while showing poor scalability as demand for re-training when introducing new teacher models. In this paper, we study if we can utilize the meta-learning strategy to directly predict the parameters of a single model with comparable performance of an ensemble. Hereto, we introduce WeightFormer, a Transformer-based model that can predict student network weights layer by layer in a forward pass, according to the teacher model parameters. The proprieties of WeightFormer are investigated on the CIFAR-10, CIFAR-100, and ImageNet datasets for model structures of VGGNet-11, ResNet-50, and ViT-B/32, where it demonstrates that our method can achieve approximate classification performance of an ensemble and outperforms both the single network and standard knowledge distillation. More encouragingly, we show that WeightFormer results can further exceeds average ensemble with minor fine-tuning. Importantly, our task along with the model and results can potentially lead to a new, more efficient, and scalable paradigm of ensemble networks parameter learning.

Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform

  • Authors: Yufan Zhuang, Zihan Wang, Fangbo Tao, Jingbo Shang
  • Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2210.01989
  • Pdf link: https://arxiv.org/pdf/2210.01989
  • Abstract We propose Waveformer that learns attention mechanism in the wavelet coefficient space, requires only linear time complexity, and enjoys universal approximating power. Specifically, we first apply forward wavelet transform to project the input sequences to multi-resolution orthogonal wavelet bases, then conduct nonlinear transformations (in this case, a random feature kernel) in the wavelet coefficient space, and finally reconstruct the representation in input space via backward wavelet transform. We note that other non-linear transformations may be used, hence we name the learning paradigm Wavelet transformatIon for Sequence lEarning (WISE). We emphasize the importance of backward reconstruction in the WISE paradigm -- without it, one would be mixing information from both the input space and coefficient space through skip connections, which shall not be considered as mathematically sound. Compared with Fourier transform in recent works, wavelet transform is more efficient in time complexity and better captures local and positional information; we further support this through our ablation studies. Extensive experiments on seven long-range understanding datasets from the Long Range Arena benchmark and code understanding tasks demonstrate that (1) Waveformer achieves competitive and even better accuracy than a number of state-of-the-art Transformer variants and (2) WISE can boost accuracies of various attention approximation methods without increasing the time complexity. These together showcase the superiority of learning attention in a wavelet coefficient space over the input space.

Point Cloud Recognition with Position-to-Structure Attention Transformers

  • Authors: Zheng Ding, James Hou, Zhuowen Tu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.02030
  • Pdf link: https://arxiv.org/pdf/2210.02030
  • Abstract In this paper, we present Position-to-Structure Attention Transformers (PS-Former), a Transformer-based algorithm for 3D point cloud recognition. PS-Former deals with the challenge in 3D point cloud representation where points are not positioned in a fixed grid structure and have limited feature description (only 3D coordinates ($x, y, z$) for scattered points). Existing Transformer-based architectures in this domain often require a pre-specified feature engineering step to extract point features. Here, we introduce two new aspects in PS-Former: 1) a learnable condensation layer that performs point downsampling and feature extraction; and 2) a Position-to-Structure Attention mechanism that recursively enriches the structural information with the position attention branch. Compared with the competing methods, while being generic with less heuristics feature designs, PS-Former demonstrates competitive experimental results on three 3D point cloud tasks including classification, part segmentation, and scene segmentation.

Advanced Deep Learning Architectures for Accurate Detection of Subsurface Tile Drainage Pipes from Remote Sensing Images

  • Authors: Tom L. Breitkopf (1), Leonard W. Hackel (1), Mahdyar Ravanbakhsh (1), Anne-Karin Cooke (2), Sandra Willkommen (2), Stefan Broda (2), Begüm Demir (1) ((1) Technische Universität Berlin, (2) Bundesanstalt für Geowissenschaften und Rohstoffe Berlin)
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2210.02071
  • Pdf link: https://arxiv.org/pdf/2210.02071
  • Abstract Subsurface tile drainage pipes provide agronomic, economic and environmental benefits. By lowering the water table of wet soils, they improve the aeration of plant roots and ultimately increase the productivity of farmland. They do however also provide an entryway of agrochemicals into subsurface water bodies and increase nutrition loss in soils. For maintenance and infrastructural development, accurate maps of tile drainage pipe locations and drained agricultural land are needed. However, these maps are often outdated or not present. Different remote sensing (RS) image processing techniques have been applied over the years with varying degrees of success to overcome these restrictions. Recent developments in deep learning (DL) techniques improve upon the conventional techniques with machine learning segmentation models. In this study, we introduce two DL-based models: i) improved U-Net architecture; and ii) Visual Transformer-based encoder-decoder in the framework of tile drainage pipe detection. Experimental results confirm the effectiveness of both models in terms of detection accuracy when compared to a basic U-Net architecture. Our code and models are publicly available at \url{https://git.tu-berlin.de/rsim/drainage-pipes-detection}.

Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

  • Authors: Youngwan Lee, Jeffrey Willette, Jonghee Kim, Juho Lee, Sung Ju Hwang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.02077
  • Pdf link: https://arxiv.org/pdf/2210.02077
  • Abstract Masked image modeling (MIM) has become a popular strategy for self-supervised learning~(SSL) of visual representations with Vision Transformers. A representative MIM model, the masked auto-encoder (MAE), randomly masks a subset of image patches and reconstructs the masked patches given the unmasked patches. Concurrently, many recent works in self-supervised learning utilize the student/teacher paradigm which provides the student with an additional target based on the output of a teacher composed of an exponential moving average (EMA) of previous students. Although common, relatively little is known about the dynamics of the interaction between the student and teacher. Through analysis on a simple linear model, we find that the teacher conditionally removes previous gradient directions based on feature similarities which effectively acts as a conditional momentum regularizer. From this analysis, we present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE. We find that RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training, which may provide a way to enhance the practicality of prohibitively expensive self-supervised learning of Vision Transformer models. Additionally, we show that RC-MAE achieves more robustness and better performance compared to MAE on downstream tasks such as ImageNet-1K classification, object detection, and instance segmentation.

Transformer-based conditional generative adversarial network for multivariate time series generation

  • Authors: Abdellah Madane, Mohamed-djallel Dilmi, Florent Forest, Hanane Azzag, Mustapha Lebbah, Jerome Lacaille
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2210.02089
  • Pdf link: https://arxiv.org/pdf/2210.02089
  • Abstract Conditional generation of time-dependent data is a task that has much interest, whether for data augmentation, scenario simulation, completing missing data, or other purposes. Recent works proposed a Transformer-based Time series generative adversarial network (TTS-GAN) to address the limitations of recurrent neural networks. However, this model assumes a unimodal distribution and tries to generate samples around the expectation of the real data distribution. One of its limitations is that it may generate a random multivariate time series; it may fail to generate samples in the presence of multiple sub-components within an overall distribution. One could train models to fit each sub-component separately to overcome this limitation. Our work extends the TTS-GAN by conditioning its generated output on a particular encoded context allowing the use of one model to fit a mixture distribution with multiple sub-components. Technically, it is a conditional generative adversarial network that models realistic multivariate time series under different types of conditions, such as categorical variables or multivariate time series. We evaluate our model on UniMiB Dataset, which contains acceleration data following the XYZ axes of human activities collected using Smartphones. We use qualitative evaluations and quantitative metrics such as Principal Component Analysis (PCA), and we introduce a modified version of the Frechet inception distance (FID) to measure the performance of our model and the statistical similarities between the generated and the real data distributions. We show that this transformer-based CGAN can generate realistic high-dimensional and long data sequences under different kinds of conditions.

Centralized Feature Pyramid for Object Detection

  • Authors: Yu Quan, Dong Zhang, Liyan Zhang, Jinhui Tang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.02093
  • Pdf link: https://arxiv.org/pdf/2210.02093
  • Abstract Visual feature pyramid has shown its superiority in both effectiveness and efficiency in a wide range of applications. However, the existing methods exorbitantly concentrate on the inter-layer feature interactions but ignore the intra-layer feature regulations, which are empirically proved beneficial. Although some methods try to learn a compact intra-layer feature representation with the help of the attention mechanism or the vision transformer, they ignore the neglected corner regions that are important for dense prediction tasks. To address this problem, in this paper, we propose a Centralized Feature Pyramid (CFP) for object detection, which is based on a globally explicit centralized feature regulation. Specifically, we first propose a spatial explicit visual center scheme, where a lightweight MLP is used to capture the globally long-range dependencies and a parallel learnable visual center mechanism is used to capture the local corner regions of the input images. Based on this, we then propose a globally centralized regulation for the commonly-used feature pyramid in a top-down fashion, where the explicit visual center information obtained from the deepest intra-layer feature is used to regulate frontal shallow features. Compared to the existing feature pyramids, CFP not only has the ability to capture the global long-range dependencies, but also efficiently obtain an all-round yet discriminative feature representation. Experimental results on the challenging MS-COCO validate that our proposed CFP can achieve the consistent performance gains on the state-of-the-art YOLOv5 and YOLOX object detection baselines.

MTSMAE: Masked Autoencoders for Multivariate Time-Series Forecasting

  • Authors: Peiwang Tang, Xianchao Zhang
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2210.02199
  • Pdf link: https://arxiv.org/pdf/2210.02199
  • Abstract Large-scale self-supervised pre-training Transformer architecture have significantly boosted the performance for various tasks in natural language processing (NLP) and computer vision (CV). However, there is a lack of researches on processing multivariate time-series by pre-trained Transformer, and especially, current study on masking time-series for self-supervised learning is still a gap. Different from language and image processing, the information density of time-series increases the difficulty of research. The challenge goes further with the invalidity of the previous patch embedding and mask methods. In this paper, according to the data characteristics of multivariate time-series, a patch embedding method is proposed, and we present an self-supervised pre-training approach based on Masked Autoencoders (MAE), called MTSMAE, which can improve the performance significantly over supervised learning without pre-training. Evaluating our method on several common multivariate time-series datasets from different fields and with different characteristics, experiment results demonstrate that the performance of our method is significantly better than the best method currently available.

Spatial-Temporal-Aware Safe Multi-Agent Reinforcement Learning of Connected Autonomous Vehicles in Challenging Scenarios

  • Authors: Zhili Zhang, Songyang Han, Jiangwei Wang, Fei Miao
  • Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
  • Arxiv link: https://arxiv.org/abs/2210.02300
  • Pdf link: https://arxiv.org/pdf/2210.02300
  • Abstract Communication technologies enable coordination among connected and autonomous vehicles (CAVs). However, it remains unclear how to utilize shared information to improve the safety and efficiency of the CAV system. In this work, we propose a framework of constrained multi-agent reinforcement learning (MARL) with a parallel safety shield for CAVs in challenging driving scenarios. The coordination mechanisms of the proposed MARL include information sharing and cooperative policy learning, with Graph Convolutional Network (GCN)-Transformer as a spatial-temporal encoder that enhances the agent's environment awareness. The safety shield module with Control Barrier Functions (CBF)-based safety checking protects the agents from taking unsafe actions. We design a constrained multi-agent advantage actor-critic (CMAA2C) algorithm to train safe and cooperative policies for CAVs. With the experiment deployed in the CARLA simulator, we verify the effectiveness of the safety checking, spatial-temporal encoder, and coordination mechanisms designed in our method by comparative experiments in several challenging scenarios with the defined hazard vehicles (HAZV). Results show that our proposed methodology significantly increases system safety and efficiency in challenging scenarios.

Temporally Consistent Video Transformer for Long-Term Video Prediction

  • Authors: Wilson Yan, Danijar Hafner, Stephen James, Pieter Abbeel
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2210.02396
  • Pdf link: https://arxiv.org/pdf/2210.02396
  • Abstract Generating long, temporally consistent video remains an open challenge in video generation. Primarily due to computational limitations, most prior methods limit themselves to training on a small subset of frames that are then extended to generate longer videos through a sliding window fashion. Although these techniques may produce sharp videos, they have difficulty retaining long-term temporal consistency due to their limited context length. In this work, we present Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics video prediction model that learns compressed representations to efficiently condition on long videos of hundreds of frames during both training and generation. We use a MaskGit prior for dynamics prediction which enables both sharper and faster generations compared to prior work. Our experiments show that TECO outperforms SOTA baselines in a variety of video prediction benchmarks ranging from simple mazes in DMLab, large 3D worlds in Minecraft, and complex real-world videos from Kinetics-600. In addition, to better understand the capabilities of video prediction models in modeling temporal consistency, we introduce several challenging video prediction tasks consisting of agents randomly traversing 3D scenes of varying difficulty. This presents a challenging benchmark for video prediction in partially observable environments where a model must understand what parts of the scenes to re-create versus invent depending on its past observations or generations. Generated videos are available at https://wilson1yan.github.io/teco

Phenaki: Variable Length Video Generation From Open Domain Textual Description

  • Authors: Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, Dumitru Erhan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2210.02399
  • Pdf link: https://arxiv.org/pdf/2210.02399
  • Abstract We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.

Emotion Twenty Questions Dialog System for Lexical Emotional Intelligence

  • Authors: Abe Kazemzadeh, Adedamola Sanusi, Huihui (Summer)Nie
  • Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
  • Arxiv link: https://arxiv.org/abs/2210.02400
  • Pdf link: https://arxiv.org/pdf/2210.02400
  • Abstract This paper presents a web-based demonstration of Emotion Twenty Questions (EMO20Q), a dialog game whose purpose is to study how people describe emotions. EMO20Q can also be used to develop artificially intelligent dialog agents that can play the game. In previous work, an EMO20Q agent used a sequential Bayesian machine learning model and could play the question-asking role. Newer transformer-based neural machine learning models have made it possible to develop an agent for the question-answering role. This demo paper describes the recent developments in the question-answering role of the EMO20Q game, which requires the agent to respond to more open-ended inputs. Furthermore, we also describe the design of the system, including the web-based front-end, agent architecture and programming, and updates to earlier software used. The demo system will be available to collect pilot data during the ACII conference and this data will be used to inform future experiments and system design.

Medical Image Retrieval via Nearest Neighbor Search on Pre-trained Image Features

  • Authors: Deepak Gupta, Russell Loane, Soumya Gayen, Dina Demner-Fushman
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2210.02401
  • Pdf link: https://arxiv.org/pdf/2210.02401
  • Abstract Nearest neighbor search (NNS) aims to locate the points in high-dimensional space that is closest to the query point. The brute-force approach for finding the nearest neighbor becomes computationally infeasible when the number of points is large. The NNS has multiple applications in medicine, such as searching large medical imaging databases, disease classification, diagnosis, etc. With a focus on medical imaging, this paper proposes DenseLinkSearch an effective and efficient algorithm that searches and retrieves the relevant images from heterogeneous sources of medical images. Towards this, given a medical database, the proposed algorithm builds the index that consists of pre-computed links of each point in the database. The search algorithm utilizes the index to efficiently traverse the database in search of the nearest neighbor. We extensively tested the proposed NNS approach and compared the performance with state-of-the-art NNS approaches on benchmark datasets and our created medical image datasets. The proposed approach outperformed the existing approach in terms of retrieving accurate neighbors and retrieval speed. We also explore the role of medical image feature representation in content-based medical image retrieval tasks. We propose a Transformer-based feature representation technique that outperformed the existing pre-trained Transformer approach on CLEF 2011 medical image retrieval task. The source code of our experiments are available at https://github.com/deepaknlp/DLS.

Keyword: scene understanding

Image Masking for Robust Self-Supervised Monocular Depth Estimation

  • Authors: Hemang Chawla, Kishaan Jeeveswaran, Elahe Arani, Bahram Zonooz
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2210.02357
  • Pdf link: https://arxiv.org/pdf/2210.02357
  • Abstract Self-supervised monocular depth estimation is a salient task for 3D scene understanding. Learned jointly with monocular ego-motion estimation, several methods have been proposed to predict accurate pixel-wise depth without using labeled data. Nevertheless, these methods focus on improving performance under ideal conditions without natural or digital corruptions. A general absence of occlusions is assumed even for object-specific depth estimation. These methods are also vulnerable to adversarial attacks, which is a pertinent concern for their reliable deployment on robots and autonomous driving systems. We propose MIMDepth, a method that adapts masked image modeling (MIM) for self-supervised monocular depth estimation. While MIM has been used to learn generalizable features during pre-training, we show how it could be adapted for direct training of monocular depth estimation. Our experiments show that MIMDepth is more robust to noise, blur, weather conditions, digital artifacts, occlusions, as well as untargeted and targeted adversarial attacks.

Keyword: visual reasoning

There is no result

DongZhouGu avatar Oct 06 '22 03:10 DongZhouGu