arxiv-daily New submissions for Thu, 13 Oct 22

New submissions for Thu, 13 Oct 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Towards Discriminative and Transferable One-Stage Few-Shot Object Detectors

Authors: Karim Guirguis, Mohamed Abdelsamad, George Eskandar, Ahmed Hendawy, Matthias Kayser, Bin Yang, Juergen Beyerer
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.05783
Pdf link: https://arxiv.org/pdf/2210.05783
Abstract Recent object detection models require large amounts of annotated data for training a new classes of objects. Few-shot object detection (FSOD) aims to address this problem by learning novel classes given only a few samples. While competitive results have been achieved using two-stage FSOD detectors, typically one-stage FSODs underperform compared to them. We make the observation that the large gap in performance between two-stage and one-stage FSODs are mainly due to their weak discriminability, which is explained by a small post-fusion receptive field and a small number of foreground samples in the loss function. To address these limitations, we propose the Few-shot RetinaNet (FSRN) that consists of: a multi-way support training strategy to augment the number of foreground samples for dense meta-detectors, an early multi-level feature fusion providing a wide receptive field that covers the whole anchor area and two augmentation techniques on query and source images to enhance transferability. Extensive experiments show that the proposed approach addresses the limitations and boosts both discriminability and transferability. FSRN is almost two times faster than two-stage FSODs while remaining competitive in accuracy, and it outperforms the state-of-the-art of one-stage meta-detectors and also some two-stage FSODs on the MS-COCO and PASCAL VOC benchmarks.

Common Corruption Robustness of Point Cloud Detectors: Benchmark and Enhancement

Authors: Shuangzhi Li, Zhijie Wang, Felix Juefei-Xu, Qing Guo, Xingyu Li, Lei Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2210.05896
Pdf link: https://arxiv.org/pdf/2210.05896
Abstract Object detection through LiDAR-based point cloud has recently been important in autonomous driving. Although achieving high accuracy on public benchmarks, the state-of-the-art detectors may still go wrong and cause a heavy loss due to the widespread corruptions in the real world like rain, snow, sensor noise, etc. Nevertheless, there is a lack of a large-scale dataset covering diverse scenes and realistic corruption types with different severities to develop practical and robust point cloud detectors, which is challenging due to the heavy collection costs. To alleviate the challenge and start the first step for robust point cloud detection, we propose the physical-aware simulation methods to generate degraded point clouds under different real-world common corruptions. Then, for the first attempt, we construct a benchmark based on the physical-aware common corruptions for point cloud detectors, which contains a total of 1,122,150 examples covering 7,481 scenes, 25 common corruption types, and 6 severities. With such a novel benchmark, we conduct extensive empirical studies on 8 state-of-the-art detectors that contain 6 different detection frameworks. Thus we get several insight observations revealing the vulnerabilities of the detectors and indicating the enhancement directions. Moreover, we further study the effectiveness of existing robustness enhancement methods based on data augmentation and data denoising. The benchmark can potentially be a new platform for evaluating point cloud detectors, opening a door for developing novel robustness enhancement methods.

PSNet: Parallel Symmetric Network for Video Salient Object Detection

Authors: Runmin Cong, Weiyu Song, Jianjun Lei, Guanghui Yue, Yao Zhao, Sam Kwong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.05912
Pdf link: https://arxiv.org/pdf/2210.05912
Abstract For the video salient object detection (VSOD) task, how to excavate the information from the appearance modality and the motion modality has always been a topic of great concern. The two-stream structure, including an RGB appearance stream and an optical flow motion stream, has been widely used as a typical pipeline for VSOD tasks, but the existing methods usually only use motion features to unidirectionally guide appearance features or adaptively but blindly fuse two modality features. However, these methods underperform in diverse scenarios due to the uncomprehensive and unspecific learning schemes. In this paper, following a more secure modeling philosophy, we deeply investigate the importance of appearance modality and motion modality in a more comprehensive way and propose a VSOD network with up and down parallel symmetry, named PSNet. Two parallel branches with different dominant modalities are set to achieve complete video saliency decoding with the cooperation of the Gather Diffusion Reinforcement (GDR) module and Cross-modality Refinement and Complement (CRC) module. Finally, we use the Importance Perception Fusion (IPF) module to fuse the features from two parallel branches according to their different importance in different scenarios. Experiments on four dataset benchmarks demonstrate that our method achieves desirable and competitive performance.

BoxMask: Revisiting Bounding Box Supervision for Video Object Detection

Authors: Khurram Azeem Hashmi, Alain Pagani, Didier Stricker, Muhammamd Zeshan Afzal
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.06008
Pdf link: https://arxiv.org/pdf/2210.06008
Abstract We present a new, simple yet effective approach to uplift video object detection. We observe that prior works operate on instance-level feature aggregation that imminently neglects the refined pixel-level representation, resulting in confusion among objects sharing similar appearance or motion characteristics. To address this limitation, we propose BoxMask, which effectively learns discriminative representations by incorporating class-aware pixel-level information. We simply consider bounding box-level annotations as a coarse mask for each object to supervise our method. The proposed module can be effortlessly integrated into any region-based detector to boost detection. Extensive experiments on ImageNet VID and EPIC KITCHENS datasets demonstrate consistent and significant improvement when we plug our BoxMask module into numerous recent state-of-the-art methods.

Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning

Authors: Fuying Wang, Yuyin Zhou, Shujun Wang, Varut Vardhanabhuti, Lequan Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.06044
Pdf link: https://arxiv.org/pdf/2210.06044
Abstract Learning medical visual representations directly from paired radiology reports has become an emerging topic in representation learning. However, existing medical image-text joint learning methods are limited by instance or local supervision analysis, ignoring disease-level semantic correspondences. In this paper, we present a novel Multi-Granularity Cross-modal Alignment (MGCA) framework for generalized medical visual representation learning by harnessing the naturally exhibited semantic correspondences between medical image and radiology reports at three different levels, i.e., pathological region-level, instance-level, and disease-level. Specifically, we first incorporate the instance-wise alignment module by maximizing the agreement between image-report pairs. Further, for token-wise alignment, we introduce a bidirectional cross-attention strategy to explicitly learn the matching between fine-grained visual tokens and text tokens, followed by contrastive learning to align them. More important, to leverage the high-level inter-subject relationship semantic (e.g., disease) correspondences, we design a novel cross-modal disease-level alignment paradigm to enforce the cross-modal cluster assignment consistency. Extensive experimental results on seven downstream medical image datasets covering image classification, object detection, and semantic segmentation tasks demonstrate the stable and superior performance of our framework.

Latency-aware Spatial-wise Dynamic Networks

Authors: Yizeng Han, Zhihang Yuan, Yifan Pu, Chenhao Xue, Shiji Song, Guangyu Sun, Gao Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.06223
Pdf link: https://arxiv.org/pdf/2210.06223
Abstract Spatial-wise dynamic convolution has become a promising approach to improving the inference efficiency of deep networks. By allocating more computation to the most informative pixels, such an adaptive inference paradigm reduces the spatial redundancy in image features and saves a considerable amount of unnecessary computation. However, the theoretical efficiency achieved by previous methods can hardly translate into a realistic speedup, especially on the multi-core processors (e.g. GPUs). The key challenge is that the existing literature has only focused on designing algorithms with minimal computation, ignoring the fact that the practical latency can also be influenced by scheduling strategies and hardware properties. To bridge the gap between theoretical computation and practical efficiency, we propose a latency-aware spatial-wise dynamic network (LASNet), which performs coarse-grained spatially adaptive inference under the guidance of a novel latency prediction model. The latency prediction model can efficiently estimate the inference latency of dynamic networks by simultaneously considering algorithms, scheduling strategies, and hardware properties. We use the latency predictor to guide both the algorithm design and the scheduling optimization on various hardware platforms. Experiments on image classification, object detection and instance segmentation demonstrate that the proposed framework significantly improves the practical inference efficiency of deep networks. For example, the average latency of a ResNet-101 on the ImageNet validation set could be reduced by 36% and 46% on a server GPU (Nvidia Tesla-V100) and an edge device (Nvidia Jetson TX2 GPU) respectively without sacrificing the accuracy. Code is available at https://github.com/LeapLabTHU/LASNet.

MFFN: Multi-view Feature Fusion Network for Camouflaged Object Detection

Authors: Dehua Zheng, Xiaochen Zheng, Laurence T. Yang, Yuan Gao, Chenlu Zhu, Yiheng Ruan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.06361
Pdf link: https://arxiv.org/pdf/2210.06361
Abstract Recent research about camouflaged object detection (COD) aims to segment highly concealed objects hidden in complex surroundings. The tiny, fuzzy camouflaged objects result in visually indistinguishable properties. However, current single-view COD detectors are sensitive to background distractors. Therefore, blurred boundaries and variable shapes of the camouflaged objects are challenging to be fully captured with a single-view detector. To overcome these obstacles, we propose a behavior-inspired framework, called Multi-view Feature Fusion Network (MFFN), which mimics the human behaviors of finding indistinct objects in images, i.e., observing from multiple angles, distances, perspectives. Specifically, the key idea behind it is to generate multiple ways of observation (multi-view) by data augmentation and apply them as inputs. MFFN captures critical edge and semantic information by comparing and fusing extracted multi-view features. In addition, our MFFN exploits the dependence and interaction between views by the designed hierarchical view and channel integration modules. Furthermore, our methods leverage the complementary information between different views through a two-stage attention module called Co-attention of Multi-view (CAMV). And we designed a local-overall module called Channel Fusion Unit (CFU) to explore the channel-wise contextual clues of diverse feature maps in an iterative manner. The experiment results show that our method performs favorably against existing state-of-the-art methods via training with the same data. The code will be available at https: //github.com/dwardzheng/MFFN_COD.

A Unified Framework with Meta-dropout for Few-shot Learning

Authors: Shaobo Lin, Xingyu Zeng, Rui Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.06409
Pdf link: https://arxiv.org/pdf/2210.06409
Abstract Conventional training of deep neural networks usually requires a substantial amount of data with expensive human annotations. In this paper, we utilize the idea of meta-learning to explain two very different streams of few-shot learning, i.e., the episodic meta-learning-based and pre-train finetune-based few-shot learning, and form a unified meta-learning framework. In order to improve the generalization power of our framework, we propose a simple yet effective strategy named meta-dropout, which is applied to the transferable knowledge generalized from base categories to novel categories. The proposed strategy can effectively prevent neural units from co-adapting excessively in the meta-training stage. Extensive experiments on the few-shot object detection and few-shot image classification datasets, i.e., Pascal VOC, MS COCO, CUB, and mini-ImageNet, validate the effectiveness of our method.

Self-supervised video pretraining yields strong image representations

Authors: Nikhil Parthasarathy, S. M. Ali Eslami, João Carreira, Olivier J. Hénaff
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.06433
Pdf link: https://arxiv.org/pdf/2210.06433
Abstract Videos contain far more information than still images and hold the potential for learning rich representations of the visual world. Yet, pretraining on image datasets has remained the dominant paradigm for learning representations that capture spatial information, and previous attempts at video pretraining have fallen short on image understanding tasks. In this work we revisit self-supervised learning of image representations from the dynamic evolution of video frames. To that end, we propose a dataset curation procedure that addresses the domain mismatch between video and image datasets, and develop a contrastive learning framework which handles the complex transformations present in natural videos. This simple paradigm for distilling knowledge from videos to image representations, called VITO, performs surprisingly well on a variety of image-based transfer learning tasks. For the first time, our video-pretrained model closes the gap with ImageNet pretraining on semantic segmentation on PASCAL and ADE20K and object detection on COCO and LVIS, suggesting that video-pretraining could become the new default for learning image representations.

Keyword: transformer

Transformers generalize differently from information stored in context vs in weights

Authors: Stephanie C.Y. Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K. Lampinen, Felix Hill
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.05675
Pdf link: https://arxiv.org/pdf/2210.05675
Abstract Transformer models can use two fundamentally different kinds of information: information stored in weights during training, and information provided ``in-context'' at inference time. In this work, we show that transformers exhibit different inductive biases in how they represent and generalize from the information in these two sources. In particular, we characterize whether they generalize via parsimonious rules (rule-based generalization) or via direct comparison with observed examples (exemplar-based generalization). This is of important practical consequence, as it informs whether to encode information in weights or in context, depending on how we want models to use that information. In transformers trained on controlled stimuli, we find that generalization from weights is more rule-based whereas generalization from context is largely exemplar-based. In contrast, we find that in transformers pre-trained on natural language, in-context learning is significantly rule-based, with larger models showing more rule-basedness. We hypothesise that rule-based generalization from in-context information might be an emergent consequence of large-scale training on language, which has sparse rule-like structure. Using controlled stimuli, we verify that transformers pretrained on data containing sparse rule-like structure exhibit more rule-based generalization.

Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

Authors: William Held, Diyi Yang
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.05709
Pdf link: https://arxiv.org/pdf/2210.05709
Abstract Multilingual transformer-based models demonstrate remarkable zero and few-shot transfer across languages by learning and reusing language-agnostic features. However, as a fixed-size model acquires more languages, its performance across all languages degrades, a phenomenon termed interference. Often attributed to limited model capacity, interference is commonly addressed by adding additional parameters despite evidence that transformer-based models are overparameterized. In this work, we show that it is possible to reduce interference by instead identifying and pruning language-specific parameters. First, we use Shapley Values, a credit allocation metric from coalitional game theory, to identify attention heads that introduce interference. Then, we show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction, seeing gains as large as 24.7%. Finally, we provide insights on language-agnostic and language-specific attention heads using attention visualization.

Curved Representation Space of Vision Transformers

Authors: Juyeop Kim, Junha Park, Songkuk Kim, Jong-Seok Lee
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.05742
Pdf link: https://arxiv.org/pdf/2210.05742
Abstract Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a better alternative to traditional convolutional neural networks (CNNs) for computer vision tasks. However, our understanding of how the new architecture works is still limited. In this paper, we focus on the phenomenon that Transformers show higher robustness against corruptions than CNNs, while not being overconfident (in fact, we find Transformers are actually underconfident). This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction by investigating how the output of the penultimate layer moves in the representation space as the input data moves within a small area. In particular, we show the following. (1) While CNNs exhibit fairly linear relationship between the input and output movements, Transformers show nonlinear relationship for some data. For those data, the output of Transformers moves in a curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is hard to move it out of the decision region since the output moves along a curved trajectory instead of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a data is slightly modified to jump out of the curved region, the movements afterwards become linear and the output goes to the decision boundary directly. Thus, Transformers can be attacked easily after a small random jump and the perturbation in the final attacked data remains imperceptible, i.e., there does exist a decision boundary near the data. This also explains the underconfident prediction of Transformers. (4) The curved regions in the representation space start to form at an early training stage and grow throughout the training course. Some data are trapped in the regions, obstructing Transformers from reducing the training loss.

Streaming Punctuation for Long-form Dictation with Transformers

Authors: Piyush Behre, Sharman Tan, Padma Varadharajan, Shuangyu Chang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.05756
Pdf link: https://arxiv.org/pdf/2210.05756
Abstract While speech recognition Word Error Rate (WER) has reached human parity for English, long-form dictation scenarios still suffer from segmentation and punctuation problems resulting from irregular pausing patterns or slow speakers. Transformer sequence tagging models are effective at capturing long bi-directional context, which is crucial for automatic punctuation. A typical Automatic Speech Recognition (ASR) production system, however, is constrained by real-time requirements, making it hard to incorporate the right context when making punctuation decisions. In this paper, we propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows and measure its impact on punctuation and segmentation accuracy in a variety of scenarios. The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation achieves an average BLEU-score gain of 0.66 for the downstream task of Machine Translation (MT).

Scaling Up Deliberation for Multilingual ASR

Authors: Ke Hu, Bo Li, Tara N. Sainath
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2210.05785
Pdf link: https://arxiv.org/pdf/2210.05785
Abstract Multilingual end-to-end automatic speech recognition models are attractive due to its simplicity in training and deployment. Recent work on large-scale training of such models has shown promising results compared to monolingual models. However, the work often focuses on multilingual models themselves in a single-pass setup. In this work, we investigate second-pass deliberation for multilingual speech recognition. Our proposed deliberation is multilingual, i.e., the text encoder encodes hypothesis text from multiple languages, and the decoder attends to multilingual text and audio. We investigate scaling the deliberation text encoder and decoder, and compare scaling the deliberation decoder and the first-pass cascaded encoder. We show that deliberation improves the average WER on 9 languages by 4% relative compared to the single-pass model. By increasing the size of the deliberation up to 1B parameters, the average WER improvement increases to 9%, with up to 14% for certain languages. Our deliberation rescorer is based on transformer layers and can be parallelized during rescoring.

Robustify Transformers with Robust Kernel Density Estimation

Authors: Xing Han, Tongzheng Ren, Tan Minh Nguyen, Khai Nguyen, Joydeep Ghosh, Nhat Ho
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.05794
Pdf link: https://arxiv.org/pdf/2210.05794
Abstract Recent advances in Transformer architecture have empowered its empirical success in various tasks across different domains. However, existing works mainly focus on improving the standard accuracy and computational cost, without considering the robustness of contaminated samples. Existing work has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on the well-known kernel density estimation (KDE). This motivates us to leverage the robust kernel density estimation (RKDE) in the self-attention mechanism, to alleviate the issue of the contamination of data by down-weighting the weight of bad samples in the estimation process. The modified self-attention mechanism can be incorporated into different Transformer variants. Empirical results on language modeling and image classification tasks demonstrate the effectiveness of this approach.

Continuous conditional video synthesis by neural processes

Authors: Xi Ye, Guillaume-Alexandre Bilodeau
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.05810
Pdf link: https://arxiv.org/pdf/2210.05810
Abstract We propose a unified model for multiple conditional video synthesis tasks, including video prediction and video frame interpolation. We show that conditional video synthesis can be formulated as a neural process, which maps input spatio-temporal coordinates to target pixel values given context spatio-temporal coordinates and pixels values. Specifically, we feed an implicit neural representations of coordinates into a Transformer-based non-autoregressive conditional video synthesis model. Our task-specific models outperform previous work for video interpolation on multiple datasets and reach a competitive performance with the state-of-the-art models for video prediction. Importantly, the model is able to interpolate or predict with an arbitrary high frame rate, i.e., continuous synthesis. Our source code is available at \url{https://github.com/NPVS/NPVS}.

SaiT: Sparse Vision Transformers through Adaptive Token Pruning

Authors: Ling Li, David Thorsley, Joseph Hassoun
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.05832
Pdf link: https://arxiv.org/pdf/2210.05832
Abstract While vision transformers have achieved impressive results, effectively and efficiently accelerating these models can further boost performances. In this work, we propose a dense/sparse training framework to obtain a unified model, enabling weight sharing across various token densities. Thus one model offers a range of accuracy and throughput tradeoffs for different applications. Besides, we introduce adaptive token pruning to optimize the patch token sparsity based on the input image. In addition, we investigate knowledge distillation to enhance token selection capability in early transformer modules. Sparse adaptive image Transformer (SaiT) offers varying levels of model acceleration by merely changing the token sparsity on the fly. Specifically, SaiT reduces the computation complexity (FLOPs) by 39% - 43% and increases the throughput by 67% - 91% with less than 0.5% accuracy loss for various vision transformer models. Meanwhile, the same model also provides the zero accuracy drop option by skipping the sparsification step. SaiT achieves better accuracy and computation tradeoffs than state-of-the-art transformer and convolutional models.

SEAL : Interactive Tool for Systematic Error Analysis and Labeling

Authors: Nazneen Rajani, Weixin Liang, Lingjiao Chen, Meg Mitchell, James Zou
Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2210.05839
Pdf link: https://arxiv.org/pdf/2210.05839
Abstract With the advent of Transformers, large language models (LLMs) have saturated well-known NLP benchmarks and leaderboards with high aggregate performance. However, many times these models systematically fail on tail data or rare groups not obvious in aggregate evaluation. Identifying such problematic data groups is even more challenging when there are no explicit labels (e.g., ethnicity, gender, etc.) and further compounded for NLP datasets due to the lack of visual features to characterize failure modes (e.g., Asian males, animals indoors, waterbirds on land, etc.). This paper introduces an interactive Systematic Error Analysis and Labeling (\seal) tool that uses a two-step approach to first identify high error slices of data and then, in the second step, introduce methods to give human-understandable semantics to those underperforming slices. We explore a variety of methods for coming up with coherent semantics for the error groups using language models for semantic labeling and a text-to-image model for generating visual features. SEAL toolkit and demo screencast is available at https://huggingface.co/spaces/nazneen/seal.

SegViT: Semantic Segmentation with Plain Vision Transformers

Authors: Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin Wei, Chunhua Shen, Yifan Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.05844
Pdf link: https://arxiv.org/pdf/2210.05844
Abstract We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40%$ computations while maintaining competitive performance.

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Authors: Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, Animesh Garg
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.05861
Pdf link: https://arxiv.org/pdf/2210.05861
Abstract Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.

Multi-Content Time-Series Popularity Prediction with Multiple-Model Transformers in MEC Networks

Authors: Zohreh HajiAkhondi-Meybodi, Arash Mohammadi, Ming Hou, Elahe Rahimian, Shahin Heidarian, Jamshid Abouei, Konstantinos N. Plataniotis
Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2210.05874
Pdf link: https://arxiv.org/pdf/2210.05874
Abstract Coded/uncoded content placement in Mobile Edge Caching (MEC) has evolved as an efficient solution to meet the significant growth of global mobile data traffic by boosting the content diversity in the storage of caching nodes. To meet the dynamic nature of the historical request pattern of multimedia contents, the main focus of recent researches has been shifted to develop data-driven and real-time caching schemes. In this regard and with the assumption that users' preferences remain unchanged over a short horizon, the Top-K popular contents are identified as the output of the learning model. Most existing datadriven popularity prediction models, however, are not suitable for the coded/uncoded content placement frameworks. On the one hand, in coded/uncoded content placement, in addition to classifying contents into two groups, i.e., popular and nonpopular, the probability of content request is required to identify which content should be stored partially/completely, where this information is not provided by existing data-driven popularity prediction models. On the other hand, the assumption that users' preferences remain unchanged over a short horizon only works for content with a smooth request pattern. To tackle these challenges, we develop a Multiple-model (hybrid) Transformer-based Edge Caching (MTEC) framework with higher generalization ability, suitable for various types of content with different time-varying behavior, that can be adapted with coded/uncoded content placement frameworks. Simulation results corroborate the effectiveness of the proposed MTEC caching framework in comparison to its counterparts in terms of the cache-hit ratio, classification accuracy, and the transferred byte volume.

Dynamic Clustering Network for Unsupervised Semantic Segmentation

Authors: Kehan Li, Zhennan Wang, Zesen Cheng, Runyi Yu, Yian Zhao, Guoli Song, Li Yuan, Jie Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.05944
Pdf link: https://arxiv.org/pdf/2210.05944
Abstract Recently, the ability of self-supervised Vision Transformer (ViT) to represent pixel-level semantic relationships promotes the development of unsupervised dense prediction tasks. In this work, we investigate transferring self-supervised ViT to unsupervised semantic segmentation task. According to the analysis that the pixel-level representations of self-supervised ViT within a single image achieve good intra-class compactness and inter-class discrimination, we propose the Dynamic Clustering Network (DCN) to dynamically infer the underlying cluster centers for different images. By training with the proposed modularity loss, the DCN learns to project a set of prototypes to cluster centers for pixel representations in each image and assign pixels to different clusters, resulting on dividing each image to class-agnostic regions. For achieving unsupervised semantic segmentation task, we treat it as a region classification problem. Based on the regions produced by the DCN, we explore different ways to extract region-level representations and classify them in an unsupervised manner. We demonstrate the effectiveness of the proposed method trough experiments on unsupervised semantic segmentation, and achieve state-of-the-art performance on PASCAL VOC 2012 unsupervised semantic segmentation task.

ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors

Authors: Chenjie Cao, Qiaole Dong, Yanwei Fu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.05950
Pdf link: https://arxiv.org/pdf/2210.05950
Abstract The image inpainting task fills missing areas of a corrupted image. Despite impressive results have been achieved recently, it is still challenging to restore corrupted images with both vivid textures and reasonable structures. Some previous methods only tackle regular textures while losing holistic structures limited by receptive fields of Convolution Neural Networks (CNNs). To this end, we study learning a Zero-initialized residual addition based Incremental Transformer on Structural priors (ZITS++), an improved model over our conference ZITS model. Specifically, given one corrupt image, we present the Transformer Structure Restorer (TSR) module to restore holistic structural priors at low image resolution, which are further upsampled by Simple Structure Upsampler (SSU) module to higher image resolution. Further, to well recover image texture details, we take the Fourier CNN Texture Restoration (FTR) module, which has both the Fourier and large-kernel attention convolutions. Typically, FTR can be independently pre-trained without image structural priors. Furthermore, to enhance the FTR, the upsampled structural priors from TSR are further processed by Structure Feature Encoder (SFE), and updating the FTR by a novel incremental training strategy of Zero-initialized Residual Addition (ZeroRA). Essentially, a new masking positional encoding is proposed to encode the large irregular masks. Extensive experiments on various datasets validate the efficacy of our model compared with other competitors. We also conduct extensive ablation to compare and verify various priors for image inpainting tasks.

Towards Theoretically Inspired Neural Initialization Optimization

Authors: Yibo Yang, Hong Wang, Haobo Yuan, Zhouchen Lin
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.05956
Pdf link: https://arxiv.org/pdf/2210.05956
Abstract Automated machine learning has been widely explored to reduce human efforts in designing neural architectures and looking for proper hyperparameters. In the domain of neural initialization, however, similar automated techniques have rarely been studied. Most existing initialization methods are handcrafted and highly dependent on specific architectures. In this paper, we propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. Specifically, GradCosine is the cosine similarity of sample-wise gradients with respect to the initialized parameters. By analyzing the sample-wise optimization landscape, we show that both the training and test performance of a network can be improved by maximizing GradCosine under gradient norm constraint. Based on this observation, we further propose the neural initialization optimization (NIO) algorithm. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost compared with the training time. With NIO, we improve the classification performance of a variety of neural architectures on CIFAR-10, CIFAR-100, and ImageNet. Moreover, we find that our method can even help to train large vision Transformer architecture without warmup.

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

Authors: Zhiying Lu, Hongtao Xie, Chuanbin Liu, Yongdong Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.05958
Pdf link: https://arxiv.org/pdf/2210.05958
Abstract There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters. Code is available at https://github.com/ArieSeirack/DHVT.

Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction

Authors: Dong Wei, Huaijiang Sun, Bin Li, Jianfeng Lu, Weiqing Li, Xiaoning Sun, Shengxiang Hu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.05976
Pdf link: https://arxiv.org/pdf/2210.05976
Abstract Stochastic human motion prediction aims to forecast multiple plausible future motions given a single pose sequence from the past. Most previous works focus on designing elaborate losses to improve the accuracy, while the diversity is typically characterized by randomly sampling a set of latent variables from the latent prior, which is then decoded into possible motions. This joint training of sampling and decoding, however, suffers from posterior collapse as the learned latent variables tend to be ignored by a strong decoder, leading to limited diversity. Alternatively, inspired by the diffusion process in nonequilibrium thermodynamics, we propose MotionDiff, a diffusion probabilistic model to treat the kinematics of human joints as heated particles, which will diffuse from original states to a noise distribution. This process offers a natural way to obtain the "whitened" latents without any trainable parameters, and human motion prediction can be regarded as the reverse diffusion process that converts the noise distribution into realistic future motions conditioned on the observed sequence. Specifically, MotionDiff consists of two parts: a spatial-temporal transformer-based diffusion network to generate diverse yet plausible motions, and a graph convolutional network to further refine the outputs. Experimental results on two datasets demonstrate that our model yields the competitive performance in terms of both accuracy and diversity.

Distilling Knowledge from Language Models for Video-based Action Anticipation

Authors: Sayontan Ghosh, Tanvi Aggarwal, Minh Hoai, Niranjan Balasubramanian
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.05991
Pdf link: https://arxiv.org/pdf/2210.05991
Abstract Anticipating future actions in a video is useful for many autonomous and assistive technologies. Prior action anticipation work mostly treats this as a vision modality problem, where the models learn the task information primarily from the video features in the target action anticipation datasets. In this work, we propose a method to make use of the text-modality that is available during the training, to bring in complementary information that is not present in the target action anticipation datasets. In particular, we leverage pre-trained language models to build a text-modality teacher that is able to predict future actions based on text labels of the past actions extracted from the input video. To further adapt the teacher to the target domain (cooking), we also pretrain the teacher on textual instructions from a recipes dataset (Recipe1M). Then, we distill the knowledge gained by the text-modality teacher into a vision-modality student to further improve it's performance. We empirically evaluate this simple cross-modal distillation strategy on two video datasets EGTEA-GAZE+ and EPIC-KITCHEN 55. Distilling this text-modality knowledge into a strong vision model (Anticipative Vision Transformer) yields consistent gains across both datasets, 3.5% relative improvement on top1 class mean recall for EGTEA-GAZE+, 7.2% on top5 many-shot class mean recall for EPIC-KITCHEN 55 and achieves new state-of-the-results.

BEV Lane Det: Fast Lane Detection on BEV Ground

Authors: Ruihao Wang, Jian Qin, Kaiying Li, Dong Cao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.06006
Pdf link: https://arxiv.org/pdf/2210.06006
Abstract Recently, 3D lane detection has been an actively developing area in autonomous driving which is the key to routing the vehicle. This work proposes a deployment-oriented monocular 3D lane detector with only naive CNN and FC layers. This detector achieved state-of-the-art results on the Apollo 3D Lane Synthetic dataset and OpenLane real-world dataset with 96 FPS runtime speed. We conduct three techniques in our detector: (1) Virtual Camera eliminates the difference in poses of cameras mounted on different vehicles. (2) Spatial Feature Pyramid Transform as a light-weighed image-view to bird-eye view transformer can utilize scales of image-view featmaps. (3) Yolo Style Lane Representation makes a good balance between bird-eye view resolution and runtime speed. Meanwhile, it can reduce the inefficiency caused by the class imbalance due to the sparsity of the lane detection task during training. Combining these three techniques, we obtained a 58.4% F1-score on the OpenLane dataset, which is a 10.6% improvement over the baseline. On the Apollo dataset, we achieved an F1-score of 96.9%, which is 4% points of supremacy over the best on the leaderboard. The source code will release soon.

JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VA

Authors: Yueh-Kao Wu, Ching-Yu Chiu, Yi-Hsuan Yang
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2210.06007
Pdf link: https://arxiv.org/pdf/2210.06007
Abstract This paper proposes a model that generates a drum track in the audio domain to play along to a user-provided drum-free recording. Specifically, using paired data of drumless tracks and the corresponding human-made drum tracks, we train a Transformer model to improvise the drum part of an unseen drumless recording. We combine two approaches to encode the input audio. First, we train a vector-quantized variational autoencoder (VQ-VAE) to represent the input audio with discrete codes, which can then be readily used in a Transformer. Second, using an audio-domain beat tracking model, we compute beat-related features of the input audio and use them as embeddings in the Transformer. Instead of generating the drum track directly as waveforms, we use a separate VQ-VAE to encode the mel-spectrogram of a drum track into another set of discrete codes, and train the Transformer to predict the sequence of drum-related discrete codes. The output codes are then converted to a mel-spectrogram with a decoder, and then to the waveform with a vocoder. We report both objective and subjective evaluations of variants of the proposed model, demonstrating that the model with beat information generates drum accompaniment that is rhythmically and stylistically consistent with the input audio.

Non-Autoregressive Machine Translation with Translation Memories

Authors: Jitao Xu, Josep Crego, François Yvon
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.06020
Pdf link: https://arxiv.org/pdf/2210.06020
Abstract Non-autoregressive machine translation (NAT) has recently made great progress. However, most works to date have focused on standard translation tasks, even though some edit-based NAT models, such as the Levenshtein Transformer (LevT), seem well suited to translate with a Translation Memory (TM). This is the scenario considered here. We first analyze the vanilla LevT model and explain why it does not do well in this setting. We then propose a new variant, TM-LevT, and show how to effectively train this model. By modifying the data presentation and introducing an extra deletion operation, we obtain performance that are on par with an autoregressive approach, while reducing the decoding load. We also show that incorporating TMs during training dispenses to use knowledge distillation, a well-known trick used to mitigate the multimodality issue.

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Authors: Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.06031
Pdf link: https://arxiv.org/pdf/2210.06031
Abstract Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-form video-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capture the rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we introduce two novel designs in our LF-VILA model. We first propose a Multimodal Temporal Contrastive (MTC) loss to learn the temporal relation across different modalities by encouraging fine-grained alignment between long-form videos and paragraphs. Second, we propose a Hierarchical Temporal Window Attention (HTWA) mechanism to effectively capture long-range dependency while reducing computational cost in Transformer. We fine-tune the pre-trained LF-VILA model on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video question-answering, and achieve new state-of-the-art performances. Specifically, our model achieves 16.1% relative improvement on ActivityNet paragraph-to-video retrieval task and 2.4% on How2QA task, respectively. We release our code, dataset, and pre-trained models at https://github.com/microsoft/XPretrain.

Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers

Authors: Moritz Einfalt, Katja Ludwig, Rainer Lienhart
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.06110
Pdf link: https://arxiv.org/pdf/2210.06110
Abstract The state-of-the-art for monocular 3D human pose estimation in videos is dominated by the paradigm of 2D-to-3D pose uplifting. While the uplifting methods themselves are rather efficient, the true computational complexity depends on the per-frame 2D pose estimation. In this paper, we present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences but still produce temporally dense 3D pose estimates. We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks. This allows to decouple the sampling rate of input 2D poses and the target frame rate of the video and drastically decreases the total computational complexity. Additionally, we explore the option of pre-training on large motion capture archives, which has been largely neglected so far. We evaluate our method on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP. With an MPJPE of 45.0 mm and 46.9 mm, respectively, our proposed method can compete with the state-of-the-art while reducing inference time by a factor of 12. This enables real-time throughput with variable consumer hardware in stationary and mobile applications. We release our code and models at https://github.com/goldbricklemon/uplift-upsample-3dhpe

Improved Data Augmentation for Translation Suggestion

Authors: Hongxiao Zhang, Siyu Lai, Songming Zhang, Hui Huang, Yufeng Chen, Jinan Xu, Jian Liu
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.06138
Pdf link: https://arxiv.org/pdf/2210.06138
Abstract Translation suggestion (TS) models are used to automatically provide alternative suggestions for incorrect spans in sentences generated by machine translation. This paper introduces the system used in our submission to the WMT'22 Translation Suggestion shared task. Our system is based on the ensemble of different translation architectures, including Transformer, SA-Transformer, and DynamicConv. We use three strategies to construct synthetic data from parallel corpora to compensate for the lack of supervised data. In addition, we introduce a multi-phase pre-training strategy, adding an additional pre-training phase with in-domain data. We rank second and third on the English-German and English-Chinese bidirectional tasks, respectively.

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

Authors: Qiming Peng, Yinxu Pan, Wenjin Wang, Bin Luo, Zhenyu Zhang, Zhengjie Huang, Teng Hu, Weichong Yin, Yongfeng Chen, Yin Zhang, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.06155
Pdf link: https://arxiv.org/pdf/2210.06155
Abstract Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at this http URL

FCT-GAN: Enhancing Table Synthesis via Fourier Transform

Authors: Zilong Zhao, Robert Birke, Lydia Y. Chen
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.06239
Pdf link: https://arxiv.org/pdf/2210.06239
Abstract Synthetic tabular data emerges as an alternative for sharing knowledge while adhering to restrictive data access regulations, e.g., European General Data Protection Regulation (GDPR). Mainstream state-of-the-art tabular data synthesizers draw methodologies from Generative Adversarial Networks (GANs), which are composed of a generator and a discriminator. While convolution neural networks are shown to be a better architecture than fully connected networks for tabular data synthesizing, two key properties of tabular data are overlooked: (i) the global correlation across columns, and (ii) invariant synthesizing to column permutations of input data. To address the above problems, we propose a Fourier conditional tabular generative adversarial network (FCT-GAN). We introduce feature tokenization and Fourier networks to construct a transformer-style generator and discriminator, and capture both local and global dependencies across columns. The tokenizer captures local spatial features and transforms original data into tokens. Fourier networks transform tokens to frequency domains and element-wisely multiply a learnable filter. Extensive evaluation on benchmarks and real-world data shows that FCT-GAN can synthesize tabular data with high machine learning utility (up to 27.8% better than state-of-the-art baselines) and high statistical similarity to the original data (up to 26.5% better), while maintaining the global correlation across columns, especially on high dimensional dataset.

Language Models are Realistic Tabular Data Generators

Authors: Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, Gjergji Kasneci
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.06280
Pdf link: https://arxiv.org/pdf/2210.06280
Abstract Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics still remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across many real-world data sets with heterogeneous feature types.

Towards Generalized and Explainable Long-Range Context Representation for Dialogue Systems

Authors: Suvodip Dey, Maunendra Sankar Desarkar, P. K. Srijith
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.06282
Pdf link: https://arxiv.org/pdf/2210.06282
Abstract Context representation is crucial to both dialogue understanding and generation. Recently, the most popular method for dialog context representation is to concatenate the last-$k$ previous utterances as context and use a large transformer-based model to generate the next response. However, this method may not be ideal for conversations containing long-range dependencies. In this work, we propose DialoGX, a novel encoder-decoder based framework for conversational response generation with a generalized and explainable context representation that can look beyond the last-$k$ utterances. Hence the method is adaptive to conversations with long-range dependencies. Our proposed solution is based on two key ideas: a) computing a dynamic representation of the entire context, and b) finding the previous utterances that are relevant for generating the next response. Instead of last-$k$ utterances, DialoGX uses the concatenation of the dynamic context vector and encoding of the most relevant utterances as input which enables it to represent conversations of any length in a compact and generalized fashion. We conduct our experiments on DailyDialog, a popular open-domain chit-chat dataset. DialoGX achieves comparable performance with the state-of-the-art models on the automated metrics. We also justify our context representation through the lens of psycholinguistics and show that the relevance score of previous utterances agrees well with human cognition which makes DialoGX explainable as well.

FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers

Authors: Yitian Liu (1), Zhouhui Lian (1) ((1) Wangxuan Institute of Computer Technology, Peking University, Beijing, China)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.06301
Pdf link: https://arxiv.org/pdf/2210.06301
Abstract Automatic generation of high-quality Chinese fonts from a few online training samples is a challenging task, especially when the amount of samples is very small. Existing few-shot font generation methods can only synthesize low-resolution glyph images that often possess incorrect topological structures or/and incomplete strokes. To address the problem, this paper proposes FontTransformer, a novel few-shot learning model, for high-resolution Chinese glyph image synthesis by using stacked Transformers. The key idea is to apply the parallel Transformer to avoid the accumulation of prediction errors and utilize the serial Transformer to enhance the quality of synthesized strokes. Meanwhile, we also design a novel encoding scheme to feed more glyph information and prior knowledge to our model, which further enables the generation of high-resolution and visually-pleasing glyph images. Both qualitative and quantitative experimental results demonstrate the superiority of our method compared to other existing approaches in the few-shot Chinese font synthesis task.

Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

Authors: Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2210.06313
Pdf link: https://arxiv.org/pdf/2210.06313
Abstract This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.

AISFormer: Amodal Instance Segmentation with Transformer

Authors: Minh Tran, Khoa Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, Ngan Le
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.06323
Pdf link: https://arxiv.org/pdf/2210.06323
Abstract Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer

ViewBirdiformer: Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view

Authors: Mai Nishimura, Shohei Nobuhara, Ko Nishino
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.06332
Pdf link: https://arxiv.org/pdf/2210.06332
Abstract We introduce a novel learning-based method for view birdification, the task of recovering ground-plane trajectories of pedestrians of a crowd and their observer in the same crowd just from the observed ego-centric video. View birdification becomes essential for mobile robot navigation and localization in dense crowds where the static background is hard to see and reliably track. It is challenging mainly for two reasons; i) absolute trajectories of pedestrians are entangled with the movement of the observer which needs to be decoupled from their observed relative movements in the ego-centric video, and ii) a crowd motion model describing the pedestrian movement interactions is specific to the scene yet unknown a priori. For this, we introduce a Transformer-based network referred to as ViewBirdiformer which implicitly models the crowd motion through self-attention and decomposes relative 2D movement observations onto the ground-plane trajectories of the crowd and the camera through cross-attention between views. Most important, ViewBirdiformer achieves view birdification in a single forward pass which opens the door to accurate real-time, always-on situational awareness. Extensive experimental results demonstrate that ViewBirdiformer achieves accuracy similar to or better than state-of-the-art with three orders of magnitude reduction in execution time.

CTL++: Evaluating Generalization on Never-Seen Compositional Patterns of Known Functions, and Compatibility of Neural Representations

Authors: Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2210.06350
Pdf link: https://arxiv.org/pdf/2210.06350
Abstract Well-designed diagnostic tasks have played a key role in studying the failure of neural nets (NNs) to generalize systematically. Famous examples include SCAN and Compositional Table Lookup (CTL). Here we introduce CTL++, a new diagnostic dataset based on compositions of unary symbolic functions. While the original CTL is used to test length generalization or productivity, CTL++ is designed to test systematicity of NNs, that is, their capability to generalize to unseen compositions of known functions. CTL++ splits functions into groups and tests performance on group elements composed in a way not seen during training. We show that recent CTL-solving Transformer variants fail on CTL++. The simplicity of the task design allows for fine-grained control of task difficulty, as well as many insightful analyses. For example, we measure how much overlap between groups is needed by tested NNs for learning to compose. We also visualize how learned symbol representations in outputs of functions from different groups are compatible in case of success but not in case of failure. These results provide insights into failure cases reported on more complex compositions in the natural language domain. Our code is public.

Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR

Authors: Sami Alabed, Dominik Grewe, Juliana Franco, Bart Chrzaszcz, Tom Natan, Tamara Norman, Norman A. Rink, Dimitrios Vytiniotis, Michael Schaarschmidt
Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.06352
Pdf link: https://arxiv.org/pdf/2210.06352
Abstract Large neural network models are commonly trained through a combination of advanced parallelism strategies in a single program, multiple data (SPMD) paradigm. For example, training large transformer models requires combining data, model, and pipeline partitioning; and optimizer sharding techniques. However, identifying efficient combinations for many model architectures and accelerator systems requires significant manual analysis. In this work, we present an automatic partitioner that identifies these combinations through a goal-oriented search. Our key findings are that a Monte Carlo Tree Search-based partitioner leveraging partition-specific compiler analysis directly into the search and guided goals matches expert-level strategies for various models.

Transformer-based Text Classification on Unified Bangla Multi-class Emotion Corpus

Authors: Md Sakib Ullah Sourav, Huidong Wang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.06405
Pdf link: https://arxiv.org/pdf/2210.06405
Abstract Because of its importance in studying people's thoughts on various Web 2.0 services, emotion classification (EC) is an important undertaking. Existing research, on the other hand, is mostly focused on the English language, with little work on low-resource languages. Though sentiment analysis, particularly the EC in English, has received a lot of attention in recent years, little study has been done in the context of Bangla, one of the world's most widely spoken languages. We propose a complete set of approaches for identifying and extracting emotions from Bangla texts in this research. We provide a Bangla emotion classifier for six classes (anger, disgust, fear, joy, sadness, and surprise) from Bangla words, using transformer-based models which exhibit phenomenal results in recent days, especially for high resource languages. The "Unified Bangla Multi-class Emotion Corpus (UBMEC)" is used to assess the performance of our models. UBMEC was created by combining two previously released manually labeled datasets of Bangla comments on 6-emotion classes with fresh manually tagged Bangla comments created by us. The corpus dataset and code we used in this work is publicly available.

Foundation Transformers

Authors: Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, Furu Wei
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.06423
Pdf link: https://arxiv.org/pdf/2210.06423
Abstract A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).

Token-Label Alignment for Vision Transformers

Authors: Han Xiao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.06455
Pdf link: https://arxiv.org/pdf/2210.06455
Abstract Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs). They mix two images as inputs for training and assign them with a mixed label with the same ratio. While they are shown effective for vision transformers (ViTs), we identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We empirically observe that the contributions of input tokens fluctuate as forward propagating, which might induce a different mixing ratio in the output tokens. The training target computed by the original data mixing strategy can thus be inaccurate, resulting in less effective training. To address this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token. We reuse the computed attention at each layer for efficient token-label alignment, introducing only negligible additional training costs. Extensive experiments demonstrate that our method improves the performance of ViTs on image classification, semantic segmentation, objective detection, and transfer learning tasks. Code is available at: https://github.com/Euphoria16/TL-Align.

Predictive Querying for Autoregressive Neural Sequence Models

Authors: Alex Boyd, Sam Showalter, Stephan Mandt, Padhraic Smyth
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.06464
Pdf link: https://arxiv.org/pdf/2210.06464
Abstract In reasoning about sequential events it is natural to pose probabilistic queries such as "when will event A occur next" or "what is the probability of A occurring before B", with applications in areas such as user modeling, medicine, and finance. However, with machine learning shifting towards neural autoregressive models such as RNNs and transformers, probabilistic querying has been largely restricted to simple cases such as next-event prediction. This is in part due to the fact that future querying involves marginalization over large path spaces, which is not straightforward to do efficiently in such models. In this paper we introduce a general typology for predictive queries in neural autoregressive sequence models and show that such queries can be systematically represented by sets of elementary building blocks. We leverage this typology to develop new query estimation methods based on beam search, importance sampling, and hybrids. Across four large-scale sequence datasets from different application domains, as well as for the GPT-2 language model, we demonstrate the ability to make query answering tractable for arbitrary queries in exponentially-large predictive path-spaces, and find clear differences in cost-accuracy tradeoffs between search and sampling methods.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

Oct 13 '22 04:10 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Thu, 13 Oct 22

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

Towards Discriminative and Transferable One-Stage Few-Shot Object Detectors

Common Corruption Robustness of Point Cloud Detectors: Benchmark and Enhancement

PSNet: Parallel Symmetric Network for Video Salient Object Detection

BoxMask: Revisiting Bounding Box Supervision for Video Object Detection

Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning

Latency-aware Spatial-wise Dynamic Networks

MFFN: Multi-view Feature Fusion Network for Camouflaged Object Detection

A Unified Framework with Meta-dropout for Few-shot Learning

Self-supervised video pretraining yields strong image representations

Keyword: transformer

Transformers generalize differently from information stored in context vs in weights

Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

Curved Representation Space of Vision Transformers

Streaming Punctuation for Long-form Dictation with Transformers

Scaling Up Deliberation for Multilingual ASR

Robustify Transformers with Robust Kernel Density Estimation

Continuous conditional video synthesis by neural processes

SaiT: Sparse Vision Transformers through Adaptive Token Pruning

SEAL : Interactive Tool for Systematic Error Analysis and Labeling

SegViT: Semantic Segmentation with Plain Vision Transformers

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Multi-Content Time-Series Popularity Prediction with Multiple-Model Transformers in MEC Networks

Dynamic Clustering Network for Unsupervised Semantic Segmentation

ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors

Towards Theoretically Inspired Neural Initialization Optimization

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction

Distilling Knowledge from Language Models for Video-based Action Anticipation

BEV Lane Det: Fast Lane Detection on BEV Ground

JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VA

Non-Autoregressive Machine Translation with Translation Memories

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers

Improved Data Augmentation for Translation Suggestion

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

FCT-GAN: Enhancing Table Synthesis via Fourier Transform

Language Models are Realistic Tabular Data Generators

Towards Generalized and Explainable Long-Range Context Representation for Dialogue Systems

FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers

Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

AISFormer: Amodal Instance Segmentation with Transformer

ViewBirdiformer: Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view

CTL++: Evaluating Generalization on Never-Seen Compositional Patterns of Known Functions, and Compatibility of Neural Representations

Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR

Transformer-based Text Classification on Unified Bangla Multi-class Emotion Corpus

Foundation Transformers

Token-Label Alignment for Vision Transformers

Predictive Querying for Autoregressive Neural Sequence Models

Keyword: scene understanding

Keyword: visual reasoning

arxiv-daily
arxiv-daily copied to clipboard