arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Mon, 5 Sep 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Kraken: A Direct Event/Frame-Based Multi-sensor Fusion SoC for Ultra-Efficient Visual Processing in Nano-UAVs

  • Authors: Alfio Di Mauro, Moritz Scherer, Davide Rossi, Luca Benini
  • Subjects: Hardware Architecture (cs.AR); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2209.01065
  • Pdf link: https://arxiv.org/pdf/2209.01065
  • Abstract Small-size unmanned aerial vehicles (UAV) have the potential to dramatically increase safety and reduce cost in applications like critical infrastructure maintenance and post-disaster search and rescue. Many scenarios require UAVs to shrink toward nano and pico-size form factors. The key open challenge to achieve true autonomy on Nano-UAVs is to run complex visual tasks like object detection, tracking, navigation and obstacle avoidance fully on board, at high speed and robustness, under tight payload and power constraints. With the Kraken SoC, fabricated in 22nm FDX technology, we demonstrate a multi-visual-sensor capability exploiting both event-based and BW/RGB imagers, combining their output for multi-functional visual tasks previously impossible on a single low-power chip for Nano-UAVs. Kraken is an ultra-low-power, heterogeneous SoC architecture integrating three acceleration engines and a vast set of peripherals to enable efficient interfacing with standard frame-based sensors and novel event-based DVS. Kraken enables highly sparse event-driven sub-uJ/inf SNN inference on a dedicated neuromorphic energy-proportional accelerator. Moreover, it can perform frame-based inference by combining a 1.8TOp\s\W 8-cores RISC-V processor cluster with mixed-precision DNN extensions with a 1036TOp\s\W} TNN accelerator.

Keyword: transformer

Temporal Conditional VAE for Distributional Drift Adaptation in Multivariate Time Series

  • Authors: Hui He, Qi Zhang, Kun Yi, Kaize Shi, Simeng Bai, Zhendong Niu, Longbin Cao
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2209.00654
  • Pdf link: https://arxiv.org/pdf/2209.00654
  • Abstract Due to the nonstationary nature, the distribution of real-world multivariate time series (MTS) changes over time, which is known as distribution drift. Most existing MTS forecasting models greatly suffer from the distribution drift and degrade the forecasting performance over time. Existing methods address distribution drift via adapting to the latest arrived data or self-correcting per the meta knowledge derived from future data. Despite their great success in MTS forecasting, these methods hardly capture the intrinsic distribution changes especially from a distributional perspective. Accordingly, we propose a novel framework temporal conditional variational autoencoder (TCVAE) to model the dynamic distributional dependencies over time between historical observations and future data in MTS and infer the dependencies as a temporal conditional distribution to leverage latent variables. Specifically, a novel temporal Hawkes attention mechanism represents temporal factors subsequently fed into feed-forward networks to estimate the prior Gaussian distribution of latent variables. The representation of temporal factors further dynamically adjusts the structures of Transformer-based encoder and decoder to distribution changes by leveraging a gated attention mechanism. Moreover, we introduce conditional continuous normalization flow to transform the prior Gaussian to a complex and form-free distribution to facilitate flexible inference of the temporal conditional distribution. Extensive experiments conducted on six real-world MTS datasets demonstrate the TCVAE's superior robustness and effectiveness over the state-of-the-art MTS forecasting baselines. We further illustrate the TCVAE applicability through multifaceted case studies and visualization in real-world scenarios.

Detection of False Data Injection Attacks in Smart Grid: A Secure Federated Deep Learning Approach

  • Authors: Yang Li, Xinhao Wei, Yuanzheng Li, Zhaoyang Dong, Mohammad Shahidehpour
  • Subjects: Cryptography and Security (cs.CR); Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2209.00778
  • Pdf link: https://arxiv.org/pdf/2209.00778
  • Abstract As an important cyber-physical system (CPS), smart grid is highly vulnerable to cyber attacks. Amongst various types of attacks, false data injection attack (FDIA) proves to be one of the top-priority cyber-related issues and has received increasing attention in recent years. However, so far little attention has been paid to privacy preservation issues in the detection of FDIAs in smart grid. Inspired by federated learning, a FDIA detection method based on secure federated deep learning is proposed in this paper by combining Transformer, federated learning and Paillier cryptosystem. The Transformer, as a detector deployed in edge nodes, delves deep into the connection between individual electrical quantities by using its multi-head self-attention mechanism. By using federated learning framework, our approach utilizes the data from all nodes to collaboratively train a detection model while preserving data privacy by keeping the data locally during training. To improve the security of federated learning, a secure federated learning scheme is designed by combing Paillier cryptosystem with federated learning. Through extensive experiments on the IEEE 14-bus and 118-bus test systems, the effectiveness and superiority of the proposed method are verifed.

Geometry Aligned Variational Transformer for Image-conditioned Layout Generation

  • Authors: Yunning Cao, Ye Ma, Min Zhou, Chuanbin Liu, Hongtao Xie, Tiezheng Ge, Yuning Jiang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2209.00852
  • Pdf link: https://arxiv.org/pdf/2209.00852
  • Abstract Layout generation is a novel task in computer vision, which combines the challenges in both object localization and aesthetic appraisal, widely used in advertisements, posters, and slides design. An accurate and pleasant layout should consider both the intra-domain relationship within layout elements and the inter-domain relationship between layout elements and the image. However, most previous methods simply focus on image-content-agnostic layout generation, without leveraging the complex visual information from the image. To this end, we explore a novel paradigm entitled image-conditioned layout generation, which aims to add text overlays to an image in a semantically coherent manner. Specifically, we propose an Image-Conditioned Variational Transformer (ICVT) that autoregressively generates various layouts in an image. First, self-attention mechanism is adopted to model the contextual relationship within layout elements, while cross-attention mechanism is used to fuse the visual information of conditional images. Subsequently, we take them as building blocks of conditional variational autoencoder (CVAE), which demonstrates appealing diversity. Second, in order to alleviate the gap between layout elements domain and visual domain, we design a Geometry Alignment module, in which the geometric information of the image is aligned with the layout representation. In addition, we construct a large-scale advertisement poster layout designing dataset with delicate layout and saliency map annotations. Experimental results show that our model can adaptively generate layouts in the non-intrusive area of the image, resulting in a harmonious layout design.

Vision-Language Adaptive Mutual Decoder for OOV-STR

  • Authors: Jinshui Hu, Chenyu Liu, Qiandong Yan, Xuyang Zhu, Fengli yu, Jiajia Wu, Bing Yin
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.00859
  • Pdf link: https://arxiv.org/pdf/2209.00859
  • Abstract Recent works have shown huge success of deep learning models for common in vocabulary (IV) scene text recognition. However, in real-world scenarios, out-of-vocabulary (OOV) words are of great importance and SOTA recognition models usually perform poorly on OOV settings. Inspired by the intuition that the learned language prior have limited OOV preformence, we design a framework named Vision Language Adaptive Mutual Decoder (VLAMD) to tackle OOV problems partly. VLAMD consists of three main conponents. Firstly, we build an attention based LSTM decoder with two adaptively merged visual-only modules, yields a vision-language balanced main branch. Secondly, we add an auxiliary query based autoregressive transformer decoding head for common visual and language prior representation learning. Finally, we couple these two designs with bidirectional training for more diverse language modeling, and do mutual sequential decoding to get robuster results. Our approach achieved 70.31% and 59.61% word accuracy on IV+OOV and OOV settings respectively on Cropped Word Recognition Task of OOV-ST Challenge at ECCV 2022 TiE Workshop, where we got 1st place on both settings.

Real-time 3D Single Object Tracking with Transformer

  • Authors: Jiayao Shan, Sifan Zhou, Yubo Cui, Zheng Fang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.00860
  • Pdf link: https://arxiv.org/pdf/2209.00860
  • Abstract LiDAR-based 3D single object tracking is a challenging issue in robotics and autonomous driving. Currently, existing approaches usually suffer from the problem that objects at long distance often have very sparse or partially-occluded point clouds, which makes the features extracted by the model ambiguous. Ambiguous features will make it hard to locate the target object and finally lead to bad tracking results. To solve this problem, we utilize the powerful Transformer architecture and propose a Point-Track-Transformer (PTT) module for point cloud-based 3D single object tracking task. Specifically, PTT module generates fine-tuned attention features by computing attention weights, which guides the tracker focusing on the important features of the target and improves the tracking ability in complex scenarios. To evaluate our PTT module, we embed PTT into the dominant method and construct a novel 3D SOT tracker named PTT-Net. In PTT-Net, we embed PTT into the voting stage and proposal generation stage, respectively. PTT module in the voting stage could model the interactions among point patches, which learns context-dependent features. Meanwhile, PTT module in the proposal generation stage could capture the contextual information between object and background. We evaluate our PTT-Net on KITTI and NuScenes datasets. Experimental results demonstrate the effectiveness of PTT module and the superiority of PTT-Net, which surpasses the baseline by a noticeable margin, ~10% in the Car category. Meanwhile, our method also has a significant performance improvement in sparse scenarios. In general, the combination of transformer and tracking pipeline enables our PTT-Net to achieve state-of-the-art performance on both two datasets. Additionally, PTT-Net could run in real-time at 40FPS on NVIDIA 1080Ti GPU. Our code is open-sourced for the research community at https://github.com/shanjiayao/PTT.

SATformer: Transformers for SAT Solving

  • Authors: Zhengyuan Shi, Min Li, Sadaf Khan, Hui-Ling Zhen, Mingxuan Yuan, Qiang Xu
  • Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
  • Arxiv link: https://arxiv.org/abs/2209.00953
  • Pdf link: https://arxiv.org/pdf/2209.00953
  • Abstract In this paper, we propose SATformer, a novel Transformer-based solution for Boolean satisfiability (SAT) solving. Different from existing learning-based SAT solvers that learn at the problem instance level, SATformer learns the minimum unsatisfiable cores (MUC) of unsatisfiable problem instances, which provide rich information for the causality of such problems. Specifically, we apply a graph neural network (GNN) to obtain the embeddings of the clauses in the conjunctive normal format (CNF). A hierarchical Transformer architecture is applied on the clause embeddings to capture the relationships among clauses, and the self-attention weight is learned to be high when those clauses forming UNSAT cores are attended together, and set to be low otherwise. By doing so, SATformer effectively learns the correlations among clauses for SAT prediction. Experimental results show that SATformer is more powerful than existing end-to-end learning-based SAT solvers.

INTERACTION: A Generative XAI Framework for Natural Language Inference Explanations

  • Authors: Jialin Yu, Alexandra I. Cristea, Anoushka Harit, Zhongtian Sun, Olanrewaju Tahir Aduragba, Lei Shi, Noura Al Moubayed
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2209.01061
  • Pdf link: https://arxiv.org/pdf/2209.01061
  • Abstract XAI with natural language processing aims to produce human-readable explanations as evidence for AI decision-making, which addresses explainability and transparency. However, from an HCI perspective, the current approaches only focus on delivering a single explanation, which fails to account for the diversity of human thoughts and experiences in language. This paper thus addresses this gap, by proposing a generative XAI framework, INTERACTION (explaIn aNd predicT thEn queRy with contextuAl CondiTional varIational autO-eNcoder). Our novel framework presents explanation in two steps: (step one) Explanation and Label Prediction; and (step two) Diverse Evidence Generation. We conduct intensive experiments with the Transformer architecture on a benchmark dataset, e-SNLI. Our method achieves competitive or better performance against state-of-the-art baseline models on explanation generation (up to 4.7% gain in BLEU) and prediction (up to 4.4% gain in accuracy) in step one; it can also generate multiple diverse explanations in step two.

Type-Directed Synthesis of Visualizations from Natural Language Queries

  • Authors: Qiaochu Chen, Shankara Pailoor, Celeste Barnaby, Abby Criswell, Chenglong Wang, Greg Durrett, Isil Dillig
  • Subjects: Programming Languages (cs.PL)
  • Arxiv link: https://arxiv.org/abs/2209.01081
  • Pdf link: https://arxiv.org/pdf/2209.01081
  • Abstract We propose a new technique based on program synthesis for automatically generating visualizations from natural language queries. Our method parses the natural language query into a refinement type specification using the intents-and-slots paradigm and leverages type-directed synthesis to generate a set of visualization programs that are most likely to meet the user's intent. Our refinement type system captures useful hints present in the natural language query and allows the synthesis algorithm to reject visualizations that violate well-established design guidelines for the input data set. We have implemented our ideas in a tool called Graphy and evaluated it on NLVCorpus, which consists of 3 popular datasets and over 700 real-world natural language queries. Our experiments show that Graphy significantly outperforms state-of-the-art natural-language-based visualization tools, including transformer and rule-based ones.

Back-to-Bones: Rediscovering the Role of Backbones in Domain Generalization

  • Authors: Simone Angarano, Mauro Martini, Francesco Salvetti, Vittorio Mazzia, Marcello Chiaberge
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2209.01121
  • Pdf link: https://arxiv.org/pdf/2209.01121
  • Abstract Domain Generalization (DG) studies the capability of a deep learning model to generalize to out-of-training distributions. In the last decade, literature has been massively filled with a collection of training methodologies that claim to obtain more abstract and robust data representations to tackle domain shifts. Recent research has provided a reproducible benchmark for DG, pointing out the effectiveness of naive empirical risk minimization (ERM) over existing algorithms. Nevertheless, researchers persist in using the same outdated feature extractors, and no attention has been given to the effects of different backbones yet. In this paper, we start back to backbones proposing a comprehensive analysis of their intrinsic generalization capabilities, so far ignored by the research community. We evaluate a wide variety of feature extractors, from standard residual solutions to transformer-based architectures, finding an evident linear correlation between large-scale single-domain classification accuracy and DG capability. Our extensive experimentation shows that by adopting competitive backbones in conjunction with effective data augmentation, plain ERM outperforms recent DG solutions and achieves state-of-the-art accuracy. Moreover, our additional qualitative studies reveal that novel backbones give more similar representations to same-class samples, separating different domains in the feature space. This boost in generalization capabilities leaves marginal room for DG algorithms and suggests a new paradigm for investigating the problem, placing backbones in the spotlight and encouraging the development of consistent algorithms on top of them.

ARST: Auto-Regressive Surgical Transformer for Phase Recognition from Laparoscopic Videos

  • Authors: Xiaoyang Zou, Wenyong Liu, Junchen Wang, Rong Tao, Guoyan Zheng
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2209.01148
  • Pdf link: https://arxiv.org/pdf/2209.01148
  • Abstract Phase recognition plays an essential role for surgical workflow analysis in computer assisted intervention. Transformer, originally proposed for sequential data modeling in natural language processing, has been successfully applied to surgical phase recognition. Existing works based on transformer mainly focus on modeling attention dependency, without introducing auto-regression. In this work, an Auto-Regressive Surgical Transformer, referred as ARST, is first proposed for on-line surgical phase recognition from laparoscopic videos, modeling the inter-phase correlation implicitly by conditional probability distribution. To reduce inference bias and to enhance phase consistency, we further develop a consistency constraint inference strategy based on auto-regression. We conduct comprehensive validations on a well-known public dataset Cholec80. Experimental results show that our method outperforms the state-of-the-art methods both quantitatively and qualitatively, and achieves an inference rate of 66 frames per second (fps).

Extend and Explain: Interpreting Very Long Language Models

  • Authors: Joel Stremmel, Brian L. Hill, Jeffrey Hertzberg, Jaime Murillo, Llewelyn Allotey, Eran Halperin
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2209.01174
  • Pdf link: https://arxiv.org/pdf/2209.01174
  • Abstract While Transformer language models (LMs) are state-of-the-art for information extraction, long text introduces computational challenges requiring suboptimal preprocessing steps or alternative model architectures. Sparse-attention LMs can represent longer sequences, overcoming performance hurdles. However, it remains unclear how to explain predictions from these models, as not all tokens attend to each other in the self-attention layers, and long sequences pose computational challenges for explainability algorithms when runtime depends on document length. These challenges are severe in the medical context where documents can be very long, and machine learning (ML) models must be auditable and trustworthy. We introduce a novel Masked Sampling Procedure (MSP) to identify the text blocks that contribute to a prediction, apply MSP in the context of predicting diagnoses from medical text, and validate our approach with a blind review by two clinicians. Our method identifies about 1.7x more clinically informative text blocks than the previous state-of-the-art, runs up to 100x faster, and is tractable for generating important phrase pairs. MSP is particularly well-suited to long LMs but can be applied to any text classifier. We provide a general implementation of MSP.

Transformers in Remote Sensing: A Survey

  • Authors: Abdulaziz Amer Aleissaee, Amandeep Kumar, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal, Gui-Song Xia, Fahad Shahbaz khan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.01206
  • Pdf link: https://arxiv.org/pdf/2209.01206
  • Abstract Deep learning-based algorithms have seen a massive popularity in different areas of remote sensing image analysis over the past decade. Recently, transformers-based architectures, originally introduced in natural language processing, have pervaded computer vision field where the self-attention mechanism has been utilized as a replacement to the popular convolution operator for capturing long-range dependencies. Inspired by recent advances in computer vision, remote sensing community has also witnessed an increased exploration of vision transformers for a diverse set of tasks. Although a number of surveys have focused on transformers in computer vision in general, to the best of our knowledge we are the first to present a systematic review of recent advances based on transformers in remote sensing. Our survey covers more than 60 recent transformers-based methods for different remote sensing problems in sub-areas of remote sensing: very high-resolution (VHR), hyperspectral (HSI) and synthetic aperture radar (SAR) imagery. We conclude the survey by discussing different challenges and open issues of transformers in remote sensing. Additionally, we intend to frequently update and maintain the latest transformers in remote sensing papers with their respective code at: https://github.com/VIROBO-15/Transformer-in-Remote-Sensing

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

DongZhouGu avatar Sep 05 '22 04:09 DongZhouGu