arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Wed, 17 Aug 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

SemAug: Semantically Meaningful Image Augmentations for Object Detection Through Language Grounding

  • Authors: Morgan Heisler, Amin Banitalebi-Dehkordi, Yong Zhang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.07407
  • Pdf link: https://arxiv.org/pdf/2208.07407
  • Abstract Data augmentation is an essential technique in improving the generalization of deep neural networks. The majority of existing image-domain augmentations either rely on geometric and structural transformations, or apply different kinds of photometric distortions. In this paper, we propose an effective technique for image augmentation by injecting contextually meaningful knowledge into the scenes. Our method of semantically meaningful image augmentation for object detection via language grounding, SemAug, starts by calculating semantically appropriate new objects that can be placed into relevant locations in the image (the what and where problems). Then it embeds these objects into their relevant target locations, thereby promoting diversity of object instance distribution. Our method allows for introducing new object instances and categories that may not even exist in the training set. Furthermore, it does not require the additional overhead of training a context network, so it can be easily added to existing architectures. Our comprehensive set of evaluations showed that the proposed method is very effective in improving the generalization, while the overhead is negligible. In particular, for a wide range of model architectures, our method achieved ~2-4% and ~1-2% mAP improvements for the task of object detection on the Pascal VOC and COCO datasets, respectively.

Object Discovery via Contrastive Learning for Weakly Supervised Object Detection

  • Authors: Jinhwan Seo, Wonho Bae, Danica J. Sutherland, Junhyug Noh, Daijin Kim
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.07576
  • Pdf link: https://arxiv.org/pdf/2208.07576
  • Abstract Weakly Supervised Object Detection (WSOD) is a task that detects objects in an image using a model trained only on image-level annotations. Current state-of-the-art models benefit from self-supervised instance-level supervision, but since weak supervision does not include count or location information, the most common ``argmax'' labeling method often ignores many instances of objects. To alleviate this issue, we propose a novel multiple instance labeling method called object discovery. We further introduce a new contrastive loss under weak supervision where no instance-level information is available for sampling, called weakly supervised contrastive loss (WSCL). WSCL aims to construct a credible similarity threshold for object discovery by leveraging consistent features for embedding vectors in the same class. As a result, we achieve new state-of-the-art results on MS-COCO 2014 and 2017 as well as PASCAL VOC 2012, and competitive results on PASCAL VOC 2007.

Keyword: transformer

SynKB: Semantic Search for Synthetic Procedures

  • Authors: Fan Bai, Alan Ritter, Peter Madrid, Dayne Freitag, John Niekrasz
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2208.07400
  • Pdf link: https://arxiv.org/pdf/2208.07400
  • Abstract In this paper we present SynKB, an open-source, automatically extracted knowledge base of chemical synthesis protocols. Similar to proprietary chemistry databases such as Reaxsys, SynKB allows chemists to retrieve structured knowledge about synthetic procedures. By taking advantage of recent advances in natural language processing for procedural texts, SynKB supports more flexible queries about reaction conditions, and thus has the potential to help chemists search the literature for conditions used in relevant reactions as they design new synthetic routes. Using customized Transformer models to automatically extract information from 6 million synthesis procedures described in U.S. and EU patents, we show that for many queries, SynKB has higher recall than Reaxsys, while maintaining high precision. We plan to make SynKB available as an open-source tool; in contrast, proprietary chemistry databases require costly subscriptions.

Invariant Inference With Provable Complexity From the Monotone Theory

  • Authors: Yotam M. Y. Feldman, Sharon Shoham
  • Subjects: Programming Languages (cs.PL)
  • Arxiv link: https://arxiv.org/abs/2208.07451
  • Pdf link: https://arxiv.org/pdf/2208.07451
  • Abstract Invariant inference algorithms such as interpolation-based inference and IC3/PDR show that it is feasible, in practice, to find inductive invariants for many interesting systems, but non-trivial upper bounds on the computational complexity of such algorithms are scarce, and limited to simple syntactic forms of invariants. In this paper we achieve invariant inference algorithms, in the domain of propositional transition systems, with provable upper bounds on the number of SAT calls. We do this by building on the monotone theory, developed by Bshouty for exact learning Boolean formulas. We prove results for two invariant inference frameworks: (i) model-based interpolation, where we show an algorithm that, under certain conditions about reachability, efficiently infers invariants when they have both short CNF and DNF representations (transcending previous results about monotone invariants); and (ii) abstract interpretation in a domain based on the monotone theory that was previously studied in relation to property-directed reachability, where we propose an efficient implementation of the best abstract transformer, leading to overall complexity bounds on the number of SAT calls. These results build on a novel procedure for computing least monotone overapproximations.

Conv-Adapter: Exploring Parameter Efficient Transfer Learning for ConvNets

  • Authors: Hao Chen, Ran Tao, Han Zhang, Yidong Wang, Wei Ye, Jindong Wang, Guosheng Hu, Marios Savvides
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2208.07463
  • Pdf link: https://arxiv.org/pdf/2208.07463
  • Abstract While parameter efficient tuning (PET) methods have shown great potential with transformer architecture on Natural Language Processing (NLP) tasks, their effectiveness is still under-studied with large-scale ConvNets on Computer Vision (CV) tasks. This paper proposes Conv-Adapter, a PET module designed for ConvNets. Conv-Adapter is light-weight, domain-transferable, and architecture-agnostic with generalized performance on different tasks. When transferring on downstream tasks, Conv-Adapter learns tasks-specific feature modulation to the intermediate representations of backbone while keeping the pre-trained parameters frozen. By introducing only a tiny amount of learnable parameters, e.g., only 3.5% full fine-tuning parameters of ResNet50, Conv-Adapter outperforms previous PET baseline methods and achieves comparable or surpasses the performance of full fine-tuning on 23 classification tasks of various domains. It also presents superior performance on few-shot classifications, with an average margin of 3.39%. Beyond classification, Conv-Adapter can generalize to detection and segmentation tasks with more than 50% reduction of parameters but comparable performance to the traditional full fine-tuning.

Temporal Action Localization with Multi-temporal Scales

  • Authors: Zan Gao, Xinglei Cui, Tao Zhuo, Zhiyong Cheng, An-An Liu, Meng Wang, Shenyong Chen
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.07493
  • Pdf link: https://arxiv.org/pdf/2208.07493
  • Abstract Temporal action localization plays an important role in video analysis, which aims to localize and classify actions in untrimmed videos. The previous methods often predict actions on a feature space of a single-temporal scale. However, the temporal features of a low-level scale lack enough semantics for action classification while a high-level scale cannot provide rich details of the action boundaries. To address this issue, we propose to predict actions on a feature space of multi-temporal scales. Specifically, we use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales. Besides, to establish the long temporal scale of the entire video, we use a spatial-temporal transformer encoder to capture the long-range dependencies of video frames. Then the refined features with long-range dependencies are fed into a classifier for the coarse action prediction. Finally, to further improve the prediction accuracy, we propose to use a frame-level self attention module to refine the classification and boundaries of each action instance. Extensive experiments show that the proposed method can outperform state-of-the-art approaches on the THUMOS14 dataset and achieves comparable performance on the ActivityNet1.3 dataset. Compared with A2Net (TIP20, Avg{0.3:0.7}), Sub-Action (CSVT2022, Avg{0.1:0.5}), and AFSD (CVPR21, Avg{0.3:0.7}) on the THUMOS14 dataset, the proposed method can achieve improvements of 12.6%, 17.4% and 2.2%, respectively

Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

  • Authors: Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao
  • Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2208.07589
  • Pdf link: https://arxiv.org/pdf/2208.07589
  • Abstract With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently. Despite significant progress, there are still two major challenges on the way towards robust MSA: 1) inefficiency when modeling cross-modal interactions in unaligned multimodal data; and 2) vulnerability to random modality feature missing which typically occurs in realistic settings. In this paper, we propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR). Concretely, EMT employs utterance-level representations from each modality as the global multimodal context to interact with local unimodal features and mutually promote each other. It not only avoids the quadratic scaling cost of previous local-local cross-modal interaction methods but also leads to better performance. To improve model robustness in the incomplete modality setting, on the one hand, DLFR performs low-level feature reconstruction to implicitly encourage the model to learn semantic information from incomplete data. On the other hand, it innovatively regards complete and incomplete data as two different views of one sample and utilizes siamese representation learning to explicitly attract their high-level representations. Comprehensive experiments on three popular datasets demonstrate that our method achieves superior performance in both complete and incomplete modality settings.

Mask and Reason: Pre-Training Knowledge Graph Transformers for Complex Logical Queries

  • Authors: Xiao Liu, Shiyu Zhao, Kai Su, Yukuo Cen, Jiezhong Qiu, Mengdi Zhang, Wei Wu, Yuxiao Dong, Jie Tang
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2208.07638
  • Pdf link: https://arxiv.org/pdf/2208.07638
  • Abstract Knowledge graph (KG) embeddings have been a mainstream approach for reasoning over incomplete KGs. However, limited by their inherently shallow and static architectures, they can hardly deal with the rising focus on complex logical queries, which comprise logical operators, imputed edges, multiple source entities, and unknown intermediate entities. In this work, we present the Knowledge Graph Transformer (kgTransformer) with masked pre-training and fine-tuning strategies. We design a KG triple transformation method to enable Transformer to handle KGs, which is further strengthened by the Mixture-of-Experts (MoE) sparse activation. We then formulate the complex logical queries as masked prediction and introduce a two-stage masked pre-training strategy to improve transferability and generalizability. Extensive experiments on two benchmarks demonstrate that kgTransformer can consistently outperform both KG embedding-based baselines and advanced encoders on nine in-domain and out-of-domain reasoning tasks. Additionally, kgTransformer can reason with explainability via providing the full reasoning paths to interpret given answers.

Towards Local Underexposed Photo Enhancement

  • Authors: Yizhan Huang, Xiaogang Xu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.07711
  • Pdf link: https://arxiv.org/pdf/2208.07711
  • Abstract Inspired by the ability of deep generative models to generate highly realistic images, much recent work has made progress in enhancing underexposed images globally. However, the local image enhancement approach has not been explored, although they are requisite in the real-world scenario, e.g., fixing local underexposure. In this work, we define a new task setting for underexposed image enhancement where users are able to control which region to be enlightened with an input mask. As indicated by the mask, an image can be divided into three areas, including Masked Area A, Transition Area B, and Unmasked Area C. As a result, Area A should be enlightened to the desired lighting, and there shall be a smooth transition (Area B) from the enlightened area (Area A) to the unchanged region (Area C). To finish this task, we propose two methods: Concatenate the mask as additional channels (MConcat), Mask-based Normlization (MNorm). While MConcat simply append the mask channels to the input images, MNorm can dynamically enhance the spatial-varying pixels, guaranteeing the enhanced images are consistent with the requirement indicated by the input mask. Moreover, MConcat serves as a play-and-plug module, and can be incorporated with existing networks, which globally enhance images, to achieve the local enhancement. And the overall network can be trained with three kinds of loss functions in Area A, Area B, and Area C, which are unified for various model structures. We perform extensive experiments on public datasets with various parametric approaches for low-light enhancement, %the Convolutional-Neutral-Network-based model and Transformer-based model, demonstrating the effectiveness of our methods.

Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model

  • Authors: Xiulong Yang, Sheng-Min Shih, Yinlin Fu, Xiaoting Zhao, Shihao Ji
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.07791
  • Pdf link: https://arxiv.org/pdf/2208.07791
  • Abstract Diffusion Denoising Probability Models (DDPM) and Vision Transformer (ViT) have demonstrated significant progress in generative tasks and discriminative tasks, respectively, and thus far these models have largely been developed in their own domains. In this paper, we establish a direct connection between DDPM and ViT by integrating the ViT architecture into DDPM, and introduce a new generative model called Generative ViT (GenViT). The modeling flexibility of ViT enables us to further extend GenViT to hybrid discriminative-generative modeling, and introduce a Hybrid ViT (HybViT). Our work is among the first to explore a single ViT for image generation and classification jointly. We conduct a series of experiments to analyze the performance of proposed models and demonstrate their superiority over prior state-of-the-arts in both generative and discriminative tasks. Our code and pre-trained models can be found in https://github.com/sndnyang/Diffusion_ViT .

BERT(s) to Detect Multiword Expressions

  • Authors: Damith Premasiri, Tharindu Ranasinghe
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2208.07832
  • Pdf link: https://arxiv.org/pdf/2208.07832
  • Abstract Multiword expressions (MWEs) present groups of words in which the meaning of the whole is not derived from the meaning of its parts. The task of processing MWEs is crucial in many natural language processing (NLP) applications, including machine translation and terminology extraction. Therefore, detecting MWEs is a popular research theme. In this paper, we explore state-of-the-art neural transformers in the task of detecting MWEs.We empirically evaluate several transformer models in the dataset for SemEval-2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM). We show that transformer models outperform the previous neural models based on long short-term memory (LSTM). The code and pre-trained model will be made freely available to the community.

Parallel Hierarchical Transformer with Attention Alignment for Abstractive Multi-Document Summarization

  • Authors: Ye Ma, Lu Zong
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2208.07845
  • Pdf link: https://arxiv.org/pdf/2208.07845
  • Abstract In comparison to single-document summarization, abstractive Multi-Document Summarization (MDS) brings challenges on the representation and coverage of its lengthy and linked sources. This study develops a Parallel Hierarchical Transformer (PHT) with attention alignment for MDS. By incorporating word- and paragraph-level multi-head attentions, the hierarchical architecture of PHT allows better processing of dependencies at both token and document levels. To guide the decoding towards a better coverage of the source documents, the attention-alignment mechanism is then introduced to calibrate beam search with predicted optimal attention distributions. Based on the WikiSum data, a comprehensive evaluation is conducted to test improvements on MDS by the proposed architecture. By better handling the inner- and cross-document information, results in both ROUGE and human evaluation suggest that our hierarchical model generates summaries of higher quality relative to other Transformer-based baselines at relatively low computational cost.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

DongZhouGu avatar Aug 17 '22 03:08 DongZhouGu