arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Mon, 28 Nov 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?

  • Authors: Thomas M. Hehn, Julian F.P. Kooij, Dariu M. Gavrila
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.13309
  • Pdf link: https://arxiv.org/pdf/2211.13309
  • Abstract Various state-of-the-art self-supervised visual representation learning approaches take advantage of data from multiple sensors by aligning the feature representations across views and/or modalities. In this work, we investigate how aligning representations affects the visual features obtained from cross-view and cross-modal contrastive learning on images and point clouds. On five real-world datasets and on five tasks, we train and evaluate 108 models based on four pretraining variations. We find that cross-modal representation alignment discards complementary visual information, such as color and texture, and instead emphasizes redundant depth cues. The depth cues obtained from pretraining improve downstream depth prediction performance. Also overall, cross-modal alignment leads to more robust encoders than pre-training by cross-view alignment, especially on depth prediction, instance segmentation, and object detection.

Automated Driving Systems Data Acquisition and Processing Platform

  • Authors: Xin Xia, Zonglin Meng, Xu Han, Hanzhao Li, Takahiro Tsukiji, Runsheng Xu, Zhaoliang Zhang, Jiaqi Ma
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2211.13425
  • Pdf link: https://arxiv.org/pdf/2211.13425
  • Abstract This paper presents an automated driving system (ADS) data acquisition and processing platform for vehicle trajectory extraction, reconstruction, and evaluation based on connected automated vehicle (CAV) cooperative perception. This platform presents a holistic pipeline from the raw advanced sensory data collection to data processing, which can process the sensor data from multiple CAVs and extract the objects' Identity (ID) number, position, speed, and orientation information in the map and Frenet coordinates. First, the ADS data acquisition and analytics platform are presented. Specifically, the experimental CAVs platform and sensor configuration are shown, and the processing software, including a deep-learning-based object detection algorithm using LiDAR information, a late fusion scheme to leverage cooperative perception to fuse the detected objects from multiple CAVs, and a multi-object tracking method is introduced. To further enhance the object detection and tracking results, high definition maps consisting of point cloud and vector maps are generated and forwarded to a world model to filter out the objects off the road and extract the objects' coordinates in Frenet coordinates and the lane information. In addition, a post-processing method is proposed to refine trajectories from the object tracking algorithms. Aiming to tackle the ID switch issue of the object tracking algorithm, a fuzzy-logic-based approach is proposed to detect the discontinuous trajectories of the same object. Finally, results, including object detection and tracking and a late fusion scheme, are presented, and the post-processing algorithm's improvements in noise level and outlier removal are discussed, confirming the functionality and effectiveness of the proposed holistic data collection and processing platform.

Towards computer vision technologies: Semi-automated reading of automated utility meters

  • Authors: Maria Spichkova, Johan van Zyl
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13483
  • Pdf link: https://arxiv.org/pdf/2211.13483
  • Abstract In this report we analysed a possibility of using computer vision techniques for automated reading of utility meters. In our study, we focused on two computer vision techniques: an open-source solution Tensorflow Object Detection (Tensorflow) and a commercial solution Anyline. This report extends our previous publication: We start with presentation of a structured analysis of related approaches. After that we provide a detailed comparison of two computer vision technologies, Tensorflow Object Detection (Tensorflow) and Anyline, applied to semi-automated reading of utility meters. In this paper, we discuss limitations and benefits of each solution applied to utility meters reading, especially focusing on aspects such as accuracy and inference time. Our goal was to determine the solution that is the most suitable for this particular application area, where there are several specific challenges.

Few-shot Object Detection with Refined Contrastive Learning

  • Authors: Zeyu Shangguan, Lian Huai, Tong Liu, Xingqun Jiang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.13495
  • Pdf link: https://arxiv.org/pdf/2211.13495
  • Abstract Due to the scarcity of sampling data in reality, few-shot object detection (FSOD) has drawn more and more attention because of its ability to quickly train new detection concepts with less data. However, there are still failure identifications due to the difficulty in distinguishing confusable classes. We also notice that the high standard deviation of average precisions reveals the inconsistent detection performance. To this end, we propose a novel FSOD method with Refined Contrastive Learning (FSRC). A pre-determination component is introduced to find out the Resemblance Group (GR) from novel classes which contains confusable classes. Afterwards, refined contrastive learning (RCL) is pointedly performed on this group of classes in order to increase the inter-class distances among them. In the meantime, the detection results distribute more uniformly which further improve the performance. Experimental results based on PASCAL VOC and COCO datasets demonstrate our proposed method outperforms the current state-of-the-art research. FSRC can not only decouple the relevance of confusable classes to get a better performance, but also makes predictions more consistent by reducing the standard deviation of the AP of classes to be detected.

1st Workshop on Maritime Computer Vision (MaCVi) 2023: Challenge Results

  • Authors: Benjamin Kiefer, Matej Kristan, Janez Perš, Lojze Žust, Fabio Poiesi, Fabio Augusto de Alcantara Andrade, Alexandre Bernardino, Matthew Dawkins, Jenni Raitoharju, Yitong Quan, Adem Atmaca, Timon Höfer, Qiming Zhang, Yufei Xu, Jing Zhang, Dacheng Tao, Lars Sommer, Raphael Spraul, Hangyue Zhao, Hongpu Zhang, Yanyun Zhao, Jan Lukas Augustin, Eui-ik Jeon, Impyeong Lee, Luca Zedda, Andrea Loddo, Cecilia Di Ruberto, Sagar Verma, Siddharth Gupta, Shishir Muralidhara, Niharika Hegde, Daitao Xing, Nikolaos Evangeliou, Anthony Tzes, Vojtěch Bartl, Jakub Špaňhel, Adam Herout, Neelanjan Bhowmik, Toby P. Breckon, Shivanand Kundargi, Tejas Anvekar, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudengudi, Arpita Vats, Yang Song, Delong Liu, Yonglin Li, Shuman Li, Chenhao Tan, et al. (23 additional authors not shown)
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2211.13508
  • Pdf link: https://arxiv.org/pdf/2211.13508
  • Abstract The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. %This report summarizes the main findings of the individual subchallenges, which are (1) UAV-based Maritime Object Detection, (2) UAV-based Maritime Object Tracking, (3) USV-based Maritime Obstacle Segmentation and (4) USV-based Maritime Obstacle Detection. This report summarizes the main findings of the individual subchallenges and introduces %Furthermore, we introduce a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. %The tech report for most of the top performing methods is attached. The datasets, evaluation code and the %competition's final standing leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.

Roboflow 100: A Rich, Multi-Domain Object Detection Benchmark

  • Authors: Floriana Ciaglia, Francesco Saverio Zuppichini, Paul Guerrie, Mark McQuade, Jacob Solawetz
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13523
  • Pdf link: https://arxiv.org/pdf/2211.13523
  • Abstract The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model. We introduce the Roboflow-100 (RF100) consisting of 100 datasets, 7 imagery domains, 224,714 images, and 805 class labels with over 11,170 labelling hours. We derived RF100 from over 90,000 public datasets, 60 million public images that are actively being assembled and labelled by computer vision practitioners in the open on the web application Roboflow Universe. By releasing RF100, we aim to provide a semantically diverse, multi-domain benchmark of datasets to help researchers test their model's generalizability with real-life data. RF100 download and benchmark replication are available on GitHub.

3D Dual-Fusion: Dual-Domain Dual-Query Camera-LiDAR Fusion for 3D Object Detection

  • Authors: Yecheol Kim, Konyul Park, Minwook Kim, Dongsuk Kum, Jun Won Choi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13529
  • Pdf link: https://arxiv.org/pdf/2211.13529
  • Abstract Fusing data from cameras and LiDAR sensors is an essential technique to achieve robust 3D object detection. One key challenge in camera-LiDAR fusion involves mitigating the large domain gap between the two sensors in terms of coordinates and data distribution when fusing their features. In this paper, we propose a novel camera-LiDAR fusion architecture called, 3D Dual-Fusion, which is designed to mitigate the gap between the feature representations of camera and LiDAR data. The proposed method fuses the features of the camera-view and 3D voxel-view domain and models their interactions through deformable attention. We redesign the transformer fusion encoder to aggregate the information from the two domains. Two major changes include 1) dual query-based deformable attention to fuse the dual-domain features interactively and 2) 3D local self-attention to encode the voxel-domain queries prior to dual-query decoding. The results of an experimental evaluation show that the proposed camera-LiDAR fusion architecture achieved competitive performance on the KITTI and nuScenes datasets, with state-of-the-art performances in some 3D object detection benchmarks categories.

Combating noisy labels in object detection datasets

  • Authors: Krystian Chachuła, Adam Popowicz, Jakub Łyskawa, Bartłomiej Olber, Piotr Frątczak, Krystian Radlak
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.13993
  • Pdf link: https://arxiv.org/pdf/2211.13993
  • Abstract The quality of training datasets for deep neural networks is a key factor contributing to the accuracy of resulting models. This is even more important in difficult tasks such as object detection. Dealing with errors in these datasets was in the past limited to accepting that some fraction of examples is incorrect or predicting their confidence and assigning appropriate weights during training. In this work, we propose a different approach. For the first time, we extended the confident learning algorithm to the object detection task. By focusing on finding incorrect labels in the original training datasets, we can eliminate erroneous examples in their root. Suspicious bounding boxes can be re-annotated in order to improve the quality of the dataset itself, thus leading to better models without complicating their already complex architectures. We can effectively point out 99% of artificially disturbed bounding boxes with FPR below 0.3. We see this method as a promising path to correcting well-known object detection datasets.

CAD2Render: A Modular Toolkit for GPU-accelerated Photorealistic Synthetic Data Generation for the Manufacturing Industry

  • Authors: Steven Moonen, Bram Vanherle, Joris de Hoog, Taoufik Bourgana, Abdellatif Bey-Temsamani, Nick Michiels
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.14054
  • Pdf link: https://arxiv.org/pdf/2211.14054
  • Abstract The use of computer vision for product and assembly quality control is becoming ubiquitous in the manufacturing industry. Lately, it is apparent that machine learning based solutions are outperforming classical computer vision algorithms in terms of performance and robustness. However, a main drawback is that they require sufficiently large and labeled training datasets, which are often not available or too tedious and too time consuming to acquire. This is especially true for low-volume and high-variance manufacturing. Fortunately, in this industry, CAD models of the manufactured or assembled products are available. This paper introduces CAD2Render, a GPU-accelerated synthetic data generator based on the Unity High Definition Render Pipeline (HDRP). CAD2Render is designed to add variations in a modular fashion, making it possible for high customizable data generation, tailored to the needs of the industrial use case at hand. Although CAD2Render is specifically designed for manufacturing use cases, it can be used for other domains as well. We validate CAD2Render by demonstrating state of the art performance in two industrial relevant setups. We demonstrate that the data generated by our approach can be used to train object detection and pose estimation models with a high enough accuracy to direct a robot. The code for CAD2Render is available at https://github.com/EDM-Research/CAD2Render.

Language-Assisted 3D Feature Learning for Semantic Scene Understanding

  • Authors: Junbo Zhang, Guofan Fan, Guanghan Wang, Zhengyuan Su, Kaisheng Ma, Li Yi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.14091
  • Pdf link: https://arxiv.org/pdf/2211.14091
  • Abstract Learning descriptive 3D features is crucial for understanding 3D scenes with diverse objects and complex structures. However, it is usually unknown whether important geometric attributes and scene context obtain enough emphasis in an end-to-end trained 3D scene understanding network. To guide 3D feature learning toward important geometric attributes and scene context, we explore the help of textual scene descriptions. Given some free-form descriptions paired with 3D scenes, we extract the knowledge regarding the object relationships and object attributes. We then inject the knowledge to 3D feature learning through three classification-based auxiliary tasks. This language-assisted training can be combined with modern object detection and instance segmentation methods to promote 3D semantic scene understanding, especially in a label-deficient regime. Moreover, the 3D feature learned with language assistance is better aligned with the language features, which can benefit various 3D-language multimodal tasks. Experiments on several benchmarks of 3D-only and 3D-language tasks demonstrate the effectiveness of our language-assisted 3D feature learning. Code is available at https://github.com/Asterisci/Language-Assisted-3D.

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

  • Authors: Tan Yu, Ping Li
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.14255
  • Pdf link: https://arxiv.org/pdf/2211.14255
  • Abstract The formidable accomplishment of Transformers in natural language processing has motivated the researchers in the computer vision community to build Vision Transformers. Compared with the Convolution Neural Networks (CNN), a Vision Transformer has a larger receptive field which is capable of characterizing the long-range dependencies. Nevertheless, the large receptive field of Vision Transformer is accompanied by the huge computational cost. To boost efficiency, the window-based Vision Transformers emerge. They crop an image into several local windows, and the self-attention is conducted within each window. To bring back the global receptive field, window-based Vision Transformers have devoted a lot of efforts to achieving cross-window communications by developing several sophisticated operations. In this work, we check the necessity of the key design element of Swin Transformer, the shifted window partitioning. We discover that a simple depthwise convolution is sufficient for achieving effective cross-window communications. Specifically, with the existence of the depthwise convolution, the shifted window configuration in Swin Transformer cannot lead to an additional performance improvement. Thus, we degenerate the Swin Transformer to a plain Window-based (Win) Transformer by discarding sophisticated shifted window partitioning. The proposed Win Transformer is conceptually simpler and easier for implementation than Swin Transformer. Meanwhile, our Win Transformer achieves consistently superior performance than Swin Transformer on multiple computer vision tasks, including image recognition, semantic segmentation, and object detection.

Keyword: transformer

RNTrajRec: Road Network Enhanced Trajectory Recovery with Spatial-Temporal Transformer

  • Authors: Yuqi Chen, Hanyuan Zhang, Weiwei Sun, Baihua Zheng
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.13234
  • Pdf link: https://arxiv.org/pdf/2211.13234
  • Abstract GPS trajectories are the essential foundations for many trajectory-based applications, such as travel time estimation, traffic prediction and trajectory similarity measurement. Most applications require a large amount of high sample rate trajectories to achieve a good performance. However, many real-life trajectories are collected with low sample rate due to energy concern or other constraints.We study the task of trajectory recovery in this paper as a means for increasing the sample rate of low sample trajectories. Currently, most existing works on trajectory recovery follow a sequence-to-sequence diagram, with an encoder to encode a trajectory and a decoder to recover real GPS points in the trajectory. However, these works ignore the topology of road network and only use grid information or raw GPS points as input. Therefore, the encoder model is not able to capture rich spatial information of the GPS points along the trajectory, making the prediction less accurate and lack spatial consistency. In this paper, we propose a road network enhanced transformer-based framework, namely RNTrajRec, for trajectory recovery. RNTrajRec first uses a graph model, namely GridGNN, to learn the embedding features of each road segment. It next develops a Sub-Graph Generation module to represent each GPS point as a sub-graph structure of the road network around the GPS point. It then introduces a spatial-temporal transformer model, namely GPSFormer, to learn rich spatial and temporal features. It finally forwards the outputs of encoder model into a multi-task decoder model to recover the missing GPS points. Extensive experiments based on three large-scale real-life trajectory datasets confirm the effectiveness of our approach.

HouseDiffusion: Vector Floorplan Generation via a Diffusion Model with Discrete and Continuous Denoising

  • Authors: Mohammad Amin Shabani, Sepidehsadat Hosseini, Yasutaka Furukawa
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.13287
  • Pdf link: https://arxiv.org/pdf/2211.13287
  • Abstract The paper presents a novel approach for vector-floorplan generation via a diffusion model, which denoises 2D coordinates of room/door corners with two inference objectives: 1) a single-step noise as the continuous quantity to precisely invert the continuous forward process; and 2) the final 2D coordinate as the discrete quantity to establish geometric incident relationships such as parallelism, orthogonality, and corner-sharing. Our task is graph-conditioned floorplan generation, a common workflow in floorplan design. We represent a floorplan as 1D polygonal loops, each of which corresponds to a room or a door. Our diffusion model employs a Transformer architecture at the core, which controls the attention masks based on the input graph-constraint and directly generates vector-graphics floorplans via a discrete and continuous denoising process. We have evaluated our approach on RPLAN dataset. The proposed approach makes significant improvements in all the metrics against the state-of-the-art with significant margins, while being capable of generating non-Manhattan structures and controlling the exact number of corners per room. A project website with supplementary video and document is here https://aminshabani.github.io/housediffusion.

Rank-One Editing of Encoder-Decoder Models

  • Authors: Vikas Raunak, Arul Menezes
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.13317
  • Pdf link: https://arxiv.org/pdf/2211.13317
  • Abstract Large sequence to sequence models for tasks such as Neural Machine Translation (NMT) are usually trained over hundreds of millions of samples. However, training is just the origin of a model's life-cycle. Real-world deployments of models require further behavioral adaptations as new requirements emerge or shortcomings become known. Typically, in the space of model behaviors, behavior deletion requests are addressed through model retrainings whereas model finetuning is done to address behavior addition requests, both procedures being instances of data-based model intervention. In this work, we present a preliminary study investigating rank-one editing as a direct intervention method for behavior deletion requests in encoder-decoder transformer models. We propose four editing tasks for NMT and show that the proposed editing algorithm achieves high efficacy, while requiring only a single instance of positive example to fix an erroneous (negative) model behavior.

MPT: Mesh Pre-Training with Transformers for Human Pose and Mesh Reconstruction

  • Authors: Kevin Lin, Chung-Ching Lin, Lin Liang, Zicheng Liu, Lijuan Wang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13357
  • Pdf link: https://arxiv.org/pdf/2211.13357
  • Abstract We present Mesh Pre-Training (MPT), a new pre-training framework that leverages 3D mesh data such as MoCap data for human pose and mesh reconstruction from a single image. Existing work in 3D pose and mesh reconstruction typically requires image-mesh pairs as the training data, but the acquisition of 2D-to-3D annotations is difficult. In this paper, we explore how to leverage 3D mesh data such as MoCap data, that does not have RGB images, for pre-training. The key idea is that even though 3D mesh data cannot be used for end-to-end training due to a lack of the corresponding RGB images, it can be used to pre-train the mesh regression transformer subnetwork. We observe that such pre-training not only improves the accuracy of mesh reconstruction from a single image, but also enables zero-shot capability. We conduct mesh pre-training using 2 million meshes. Experimental results show that MPT advances the state-of-the-art results on Human3.6M and 3DPW datasets. We also show that MPT enables transformer models to have zero-shot capability of human mesh reconstruction from real images. In addition, we demonstrate the generalizability of MPT to 3D hand reconstruction, achieving state-of-the-art results on FreiHAND dataset.

On the Importance of Image Encoding in Automated Chest X-Ray Report Generation

  • Authors: Otabek Nazarov, Mohammad Yaqub, Karthik Nandakumar
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.13465
  • Pdf link: https://arxiv.org/pdf/2211.13465
  • Abstract Chest X-ray is one of the most popular medical imaging modalities due to its accessibility and effectiveness. However, there is a chronic shortage of well-trained radiologists who can interpret these images and diagnose the patient's condition. Therefore, automated radiology report generation can be a very helpful tool in clinical practice. A typical report generation workflow consists of two main steps: (i) encoding the image into a latent space and (ii) generating the text of the report based on the latent image embedding. Many existing report generation techniques use a standard convolutional neural network (CNN) architecture for image encoding followed by a Transformer-based decoder for medical text generation. In most cases, CNN and the decoder are trained jointly in an end-to-end fashion. In this work, we primarily focus on understanding the relative importance of encoder and decoder components. Towards this end, we analyze four different image encoding approaches: direct, fine-grained, CLIP-based, and Cluster-CLIP-based encodings in conjunction with three different decoders on the large-scale MIMIC-CXR dataset. Among these encoders, the cluster CLIP visual encoder is a novel approach that aims to generate more discriminative and explainable representations. CLIP-based encoders produce comparable results to traditional CNN-based encoders in terms of NLP metrics, while fine-grained encoding outperforms all other encoders both in terms of NLP and clinical accuracy metrics, thereby validating the importance of image encoder to effectively extract semantic information. GitHub repository: https://github.com/mudabek/encoding-cxr-report-gen

NQE: N-ary Query Embedding for Complex Query Answering over Hyper-relational Knowledge Graphs

  • Authors: Haoran Luo, Haihong E, Yuhao Yang, Gengxian Zhou, Yikai Guo, Tianyu Yao, Zichen Tang, Xueyuan Lin, Kaiyang Wan
  • Subjects: Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.13469
  • Pdf link: https://arxiv.org/pdf/2211.13469
  • Abstract Complex query answering (CQA) is an essential task for multi-hop and logical reasoning on knowledge graphs (KGs). Currently, most approaches are limited to queries among binary relational facts and pay less attention to n-ary facts (n>=2) containing more than two entities, which are more prevalent in the real world. Moreover, previous CQA methods can only make predictions for a few given types of queries and cannot be flexibly extended to more complex logical queries, which significantly limits their applications. To overcome these challenges, in this work, we propose a novel N-ary Query Embedding (NQE) model for CQA over hyper-relational knowledge graphs (HKGs), which include massive n-ary facts. The NQE utilizes a dual-heterogeneous Transformer encoder and fuzzy logic theory to satisfy all n-ary FOL queries, including existential quantifiers, conjunction, disjunction, and negation. We also propose a parallel processing algorithm that can train or predict arbitrary n-ary FOL queries in a single batch, regardless of the kind of each query, with good flexibility and extensibility. In addition, we generate a new CQA dataset WD50K-NFOL, including diverse n-ary FOL queries over WD50K. Experimental results on WD50K-NFOL and other standard CQA datasets show that NQE is the state-of-the-art CQA method over HKGs with good generalization capability. Our code and dataset are publicly available.

Efficient Zero-shot Visual Search via Target and Context-aware Transformer

  • Authors: Zhiwei Ding, Xuezhe Ren, Erwan David, Melissa Vo, Gabriel Kreiman, Mengmi Zhang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.13470
  • Pdf link: https://arxiv.org/pdf/2211.13470
  • Abstract Visual search is a ubiquitous challenge in natural vision, including daily tasks such as finding a friend in a crowd or searching for a car in a parking lot. Human rely heavily on relevant target features to perform goal-directed visual search. Meanwhile, context is of critical importance for locating a target object in complex scenes as it helps narrow down the search area and makes the search process more efficient. However, few works have combined both target and context information in visual search computational models. Here we propose a zero-shot deep learning architecture, TCT (Target and Context-aware Transformer), that modulates self attention in the Vision Transformer with target and contextual relevant information to enable human-like zero-shot visual search performance. Target modulation is computed as patch-wise local relevance between the target and search images, whereas contextual modulation is applied in a global fashion. We conduct visual search experiments on TCT and other competitive visual search models on three natural scene datasets with varying levels of difficulty. TCT demonstrates human-like performance in terms of search efficiency and beats the SOTA models in challenging visual search tasks. Importantly, TCT generalizes well across datasets with novel objects without retraining or fine-tuning. Furthermore, we also introduce a new dataset to benchmark models for invariant visual search under incongruent contexts. TCT manages to search flexibly via target and context modulation, even under incongruent contexts.

Beyond Mahalanobis-Based Scores for Textual OOD Detection

  • Authors: Pierre Colombo, Eduardo D. C. Gomes, Guillaume Staerman, Nathan Noiry, Pablo Piantanida
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2211.13527
  • Pdf link: https://arxiv.org/pdf/2211.13527
  • Abstract Deep learning methods have boosted the adoption of NLP systems in real-life applications. However, they turn out to be vulnerable to distribution shifts over time which may cause severe dysfunctions in production systems, urging practitioners to develop tools to detect out-of-distribution (OOD) samples through the lens of the neural network. In this paper, we introduce TRUSTED, a new OOD detector for classifiers based on Transformer architectures that meets operational requirements: it is unsupervised and fast to compute. The efficiency of TRUSTED relies on the fruitful idea that all hidden layers carry relevant information to detect OOD examples. Based on this, for a given input, TRUSTED consists in (i) aggregating this information and (ii) computing a similarity score by exploiting the training distribution, leveraging the powerful concept of data depth. Our extensive numerical experiments involve 51k model configurations, including various checkpoints, seeds, and datasets, and demonstrate that TRUSTED achieves state-of-the-art performances. In particular, it improves previous AUROC over 3 points.

3D Dual-Fusion: Dual-Domain Dual-Query Camera-LiDAR Fusion for 3D Object Detection

  • Authors: Yecheol Kim, Konyul Park, Minwook Kim, Dongsuk Kum, Jun Won Choi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13529
  • Pdf link: https://arxiv.org/pdf/2211.13529
  • Abstract Fusing data from cameras and LiDAR sensors is an essential technique to achieve robust 3D object detection. One key challenge in camera-LiDAR fusion involves mitigating the large domain gap between the two sensors in terms of coordinates and data distribution when fusing their features. In this paper, we propose a novel camera-LiDAR fusion architecture called, 3D Dual-Fusion, which is designed to mitigate the gap between the feature representations of camera and LiDAR data. The proposed method fuses the features of the camera-view and 3D voxel-view domain and models their interactions through deformable attention. We redesign the transformer fusion encoder to aggregate the information from the two domains. Two major changes include 1) dual query-based deformable attention to fuse the dual-domain features interactively and 2) 3D local self-attention to encode the voxel-domain queries prior to dual-query decoding. The results of an experimental evaluation show that the proposed camera-LiDAR fusion architecture achieved competitive performance on the KITTI and nuScenes datasets, with state-of-the-art performances in some 3D object detection benchmarks categories.

More comprehensive facial inversion for more effective expression recognition

  • Authors: Jiawei Mao, Guangyi Zhao, Yuanqi Chang, Xuesong Yin, Xiaogang Peng, Rui Xu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13564
  • Pdf link: https://arxiv.org/pdf/2211.13564
  • Abstract Facial expression recognition (FER) plays a significant role in the ubiquitous application of computer vision. We revisit this problem with a new perspective on whether it can acquire useful representations that improve FER performance in the image generation process, and propose a novel generative method based on the image inversion mechanism for the FER task, termed Inversion FER (IFER). Particularly, we devise a novel Adversarial Style Inversion Transformer (ASIT) towards IFER to comprehensively extract features of generated facial images. In addition, ASIT is equipped with an image inversion discriminator that measures the cosine similarity of semantic features between source and generated images, constrained by a distribution alignment loss. Finally, we introduce a feature modulation module to fuse the structural code and latent codes from ASIT for the subsequent FER work. We extensively evaluate ASIT on facial datasets such as FFHQ and CelebA-HQ, showing that our approach achieves state-of-the-art facial inversion performance. IFER also achieves competitive results in facial expression recognition datasets such as RAF-DB, SFEW and AffectNet. The code and models are available at https://github.com/Talented-Q/IFER-master.

Ham2Pose: Animating Sign Language Notation into Pose Sequences

  • Authors: Rotem Shalev-Arkushin, Amit Moryossef, Ohad Fried
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2211.13613
  • Pdf link: https://arxiv.org/pdf/2211.13613
  • Abstract Translating spoken languages into Sign languages is necessary for open communication between the hearing and hearing-impaired communities. To achieve this goal, we propose the first method for animating a text written in HamNoSys, a lexical Sign language notation, into signed pose sequences. As HamNoSys is universal, our proposed method offers a generic solution invariant to the target Sign language. Our method gradually generates pose predictions using transformer encoders that create meaningful representations of the text and poses while considering their spatial and temporal information. We use weak supervision for the training process and show that our method succeeds in learning from partial and inaccurate data. Additionally, we offer a new distance measurement for pose sequences, normalized Dynamic Time Warping (nDTW), based on DTW over normalized keypoints trajectories, and validate its correctness using AUTSL, a large-scale Sign language dataset. We show that it measures the distance between pose sequences more accurately than existing measurements and use it to assess the quality of our generated pose sequences. Code for the data pre-processing, the model, and the distance measurement is publicly released for future research.

Cross Aggregation Transformer for Image Restoration

  • Authors: Chen Zheng, Yulun Zhang, Jinjin Gu, Yongbing Zhang, Linghe Kong, Xin Yuan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13654
  • Pdf link: https://arxiv.org/pdf/2211.13654
  • Abstract Recently, Transformer architecture has been introduced into image restoration to replace convolution neural network (CNN) with surprising results. Considering the high computational complexity of Transformer with global attention, some methods use the local square window to limit the scope of self-attention. However, these methods lack direct interaction among different windows, which limits the establishment of long-range dependencies. To address the above issue, we propose a new image restoration model, Cross Aggregation Transformer (CAT). The core of our CAT is the Rectangle-Window Self-Attention (Rwin-SA), which utilizes horizontal and vertical rectangle window attention in different heads parallelly to expand the attention area and aggregate the features cross different windows. We also introduce the Axial-Shift operation for different window interactions. Furthermore, we propose the Locality Complementary Module to complement the self-attention mechanism, which incorporates the inductive bias of CNN (e.g., translation invariance and locality) into Transformer, enabling global-local coupling. Extensive experiments demonstrate that our CAT outperforms recent state-of-the-art methods on several image restoration applications. The code and models are available at https://github.com/zhengchen1999/CAT.

On designing light-weight object trackers through network pruning: Use CNNs or transformers?

  • Authors: Saksham Aggarwal, Taneesh Gupta, Pawan Kumar Sahu, Arnav Chavan, Rishabh Tiwari, Dilip K. Prasad, Deepak K. Gupta
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.13769
  • Pdf link: https://arxiv.org/pdf/2211.13769
  • Abstract Object trackers deployed on low-power devices need to be light-weight, however, most of the current state-of-the-art (SOTA) methods rely on using compute-heavy backbones built using CNNs or transformers. Large sizes of such models do not allow their deployment in low-power conditions and designing compressed variants of large tracking models is of great importance. This paper demonstrates how highly compressed light-weight object trackers can be designed using neural architectural pruning of large CNN and transformer based trackers. Further, a comparative study on architectural choices best suited to design light-weight trackers is provided. A comparison between SOTA trackers using CNNs, transformers as well as the combination of the two is presented to study their stability at various compression ratios. Finally results for extreme pruning scenarios going as low as 1% in some cases are shown to study the limits of network pruning in object tracking. This work provides deeper insights into designing highly efficient trackers from existing SOTA methods.

Question Answering and Question Generation for Finnish

  • Authors: Ilmari Kylliäinen, Roman Yangarber
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2211.13794
  • Pdf link: https://arxiv.org/pdf/2211.13794
  • Abstract Recent advances in the field of language modeling have improved the state-of-the-art in question answering (QA) and question generation (QG). However, the development of modern neural models, their benchmarks, and datasets for training them has mainly focused on English. Finnish, like many other languages, faces a shortage of large QA/QG model training resources, which has prevented experimenting with state-of-the-art QA/QG fine-tuning methods. We present the first neural QA and QG models that work with Finnish. To train the models, we automatically translate the SQuAD dataset and then use normalization methods to reduce the amount of problematic data created during the translation. Using the synthetic data, together with the Finnish partition of the TyDi-QA dataset, we fine-tune several transformer-based models to both QA and QG and evaluate their performance. To the best of our knowledge, the resulting dataset is the first large-scale QA/QG resource for Finnish. This paper also sets the initial benchmarks for Finnish-language QA and QG.

Adaptive Attention Link-based Regularization for Vision Transformers

  • Authors: Heegon Jin, Jongwon Choi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13852
  • Pdf link: https://arxiv.org/pdf/2211.13852
  • Abstract Although transformer networks are recently employed in various vision tasks with outperforming performance, extensive training data and a lengthy training time are required to train a model to disregard an inductive bias. Using trainable links between the channel-wise spatial attention of a pre-trained Convolutional Neural Network (CNN) and the attention head of Vision Transformers (ViT), we present a regularization technique to improve the training efficiency of ViT. The trainable links are referred to as the attention augmentation module, which is trained simultaneously with ViT, boosting the training of ViT and allowing it to avoid the overfitting issue caused by a lack of data. From the trained attention augmentation module, we can extract the relevant relationship between each CNN activation map and each ViT attention head, and based on this, we also propose an advanced attention augmentation module. Consequently, even with a small amount of data, the suggested method considerably improves the performance of ViT while achieving faster convergence during training.

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

  • Authors: Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, Bin Cui
  • Subjects: Machine Learning (cs.LG); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
  • Arxiv link: https://arxiv.org/abs/2211.13878
  • Pdf link: https://arxiv.org/pdf/2211.13878
  • Abstract Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.

AFR-Net: Attention-Driven Fingerprint Recognition Network

  • Authors: Steven A. Grosz, Anil K. Jain
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13897
  • Pdf link: https://arxiv.org/pdf/2211.13897
  • Abstract The use of vision transformers (ViT) in computer vision is increasing due to limited inductive biases (e.g., locality, weight sharing, etc.) and increased scalability compared to other deep learning methods (e.g., convolutional neural networks (CNN)). This has led to some initial studies on the use of ViT for biometric recognition, including fingerprint recognition. In this work, we improve on these initial studies for transformers in fingerprint recognition by i.) evaluating additional attention-based architectures in addition to vanilla ViT, ii.) scaling to larger and more diverse training and evaluation datasets, and iii.) combining the complimentary representations of attention-based and CNN-based embeddings for improved state-of-the-art (SOTA) fingerprint recognition for both authentication (1:1 comparisons) and identification (1:N comparisions). Our combined architecture, AFR-Net (Attention-Driven Fingerprint Recognition Network), outperforms several baseline transformer and CNN-based models, including a SOTA commercial fingerprint system, Verifinger v12.3, across many intra-sensor, cross-sensor (including contact to contactless), and latent to rolled fingerprint matching datasets. Additionally, we propose a realignment strategy using local embeddings extracted from intermediate feature maps within the networks to refine the global embeddings in low certainty situations, which boosts the overall recognition accuracy significantly for all the evaluations across each of the models. This realignment strategy requires no additional training and can be applied as a wrapper to any existing deep learning network (including attention-based, CNN-based, or both) to boost its performance.

Comparison Study Between Token Classification and Sequence Classification In Text Classification

  • Authors: Amir Jafari
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2211.13899
  • Pdf link: https://arxiv.org/pdf/2211.13899
  • Abstract Unsupervised Machine Learning techniques have been applied to Natural Language Processing tasks and surpasses the benchmarks such as GLUE with great success. Building language models approach achieves good results in one language and it can be applied to multiple NLP task such as classification, summarization, generation and etc as an out of box model. Among all the of the classical approaches used in NLP, the masked language modeling is the most used. In general, the only requirement to build a language model is presence of the large corpus of textual data. Text classification engines uses a variety of models from classical and state of art transformer models to classify texts for in order to save costs. Sequence Classifiers are mostly used in the domain of text classification. However Token classifiers also are viable candidate models as well. Sequence Classifiers and Token Classifier both tend to improve the classification predictions due to the capturing the context information differently. This work aims to compare the performance of Sequence Classifier and Token Classifiers and evaluate each model on the same set of data. In this work, we are using a pre-trained model as the base model and Token Classifier and Sequence Classier heads results of these two scoring paradigms with be compared..

A Deep Learning Anomaly Detection Method in Textual Data

  • Authors: Amir Jafari
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2211.13900
  • Pdf link: https://arxiv.org/pdf/2211.13900
  • Abstract In this article, we propose using deep learning and transformer architectures combined with classical machine learning algorithms to detect and identify text anomalies in texts. Deep learning model provides a very crucial context information about the textual data which all textual context are converted to a numerical representation. We used multiple machine learning methods such as Sentence Transformers, Auto Encoders, Logistic Regression and Distance calculation methods to predict anomalies. The method are tested on the texts data and we used syntactic data from different source injected into the original text as anomalies or use them as target. Different methods and algorithm are explained in the field of outlier detection and the results of the best technique is presented. These results suggest that our algorithm could potentially reduce false positive rates compared with other anomaly detection methods that we are testing.

TAOTF: A Two-stage Approximately Orthogonal Training Framework in Deep Neural Networks

  • Authors: Taoyong Cui, Jianze Li, Yuhan Dong, Li Liu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.13902
  • Pdf link: https://arxiv.org/pdf/2211.13902
  • Abstract The orthogonality constraints, including the hard and soft ones, have been used to normalize the weight matrices of Deep Neural Network (DNN) models, especially the Convolutional Neural Network (CNN) and Vision Transformer (ViT), to reduce model parameter redundancy and improve training stability. However, the robustness to noisy data of these models with constraints is not always satisfactory. In this work, we propose a novel two-stage approximately orthogonal training framework (TAOTF) to find a trade-off between the orthogonal solution space and the main task solution space to solve this problem in noisy data scenarios. In the first stage, we propose a novel algorithm called polar decomposition-based orthogonal initialization (PDOI) to find a good initialization for the orthogonal optimization. In the second stage, unlike other existing methods, we apply soft orthogonal constraints for all layers of DNN model. We evaluate the proposed model-agnostic framework both on the natural image and medical image datasets, which show that our method achieves stable and superior performances to existing methods.

Towards Good Practices for Missing Modality Robust Action Recognition

  • Authors: Sangmin Woo, Sumin Lee, Yeonju Park, Muhammad Adi Nugroho, Changick Kim
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.13916
  • Pdf link: https://arxiv.org/pdf/2211.13916
  • Abstract Standard multi-modal models assume the use of the same modalities in training and inference stages. However, in practice, the environment in which multi-modal models operate may not satisfy such assumption. As such, their performances degrade drastically if any modality is missing in the inference stage. We ask: how can we train a model that is robust to missing modalities? This paper seeks a set of good practices for multi-modal action recognition, with a particular interest in circumstances where some modalities are not available at an inference time. First, we study how to effectively regularize the model during training (e.g., data augmentation). Second, we investigate on fusion methods for robustness to missing modalities: we find that transformer-based fusion shows better robustness for missing modality than summation or concatenation. Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding by randomly dropping modality features and tries to reconstruct them with the remaining modality features. Coupling these good practices, we build a model that is not only effective in multi-modal action recognition but also robust to modality missing. Our model achieves the state-of-the-arts on multiple benchmarks and maintains competitive performances even in missing modality scenarios. Codes are available at https://github.com/sangminwoo/ActionMAE.

Mutual Guidance and Residual Integration for Image Enhancement

  • Authors: Kun Zhou, KenKun Liu, Wenbo Li, Xiaoguang Han, Jiangbo Lu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13919
  • Pdf link: https://arxiv.org/pdf/2211.13919
  • Abstract Previous studies show the necessity of global and local adjustment for image enhancement. However, existing convolutional neural networks (CNNs) and transformer-based models face great challenges in balancing the computational efficiency and effectiveness of global-local information usage. Especially, existing methods typically adopt the global-to-local fusion mode, ignoring the importance of bidirectional interactions. To address those issues, we propose a novel mutual guidance network (MGN) to perform effective bidirectional global-local information exchange while keeping a compact architecture. In our design, we adopt a two-branch framework where one branch focuses more on modeling global relations while the other is committed to processing local information. Then, we develop an efficient attention-based mutual guidance approach throughout our framework for bidirectional global-local interactions. As a result, both the global and local branches can enjoy the merits of mutual information aggregation. Besides, to further refine the results produced by our MGN, we propose a novel residual integration scheme following the divide-and-conquer philosophy. The extensive experiments demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance on several public image enhancement benchmarks.

UperFormer: A Multi-scale Transformer-based Decoder for Semantic Segmentation

  • Authors: Jing Xu, Wentao Shi, Pan Gao, Zhengwei Wang, Qizhu Li
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13928
  • Pdf link: https://arxiv.org/pdf/2211.13928
  • Abstract While a large number of recent works on semantic segmentation focus on designing and incorporating a transformer-based encoder, much less attention and vigor have been devoted to transformer-based decoders. For such a task whose hallmark quest is pixel-accurate prediction, we argue that the decoder stage is just as crucial as that of the encoder in achieving superior segmentation performance, by disentangling and refining the high-level cues and working out object boundaries with pixel-level precision. In this paper, we propose a novel transformer-based decoder called UperFormer, which is plug-and-play for hierarchical encoders and attains high quality segmentation results regardless of encoder architecture. UperFormer is equipped with carefully designed multi-head skip attention units and novel upsampling operations. Multi-head skip attention is able to fuse multi-scale features from backbones with those in decoders. The upsampling operation, which incorporates feature from encoder, can be more friendly for object localization. It brings a 0.4% to 3.2% increase compared with traditional upsampling methods. By combining UperFormer with Swin Transformer (Swin-T), a fully transformer-based symmetric network is formed for semantic segmentation tasks. Extensive experiments show that our proposed approach is highly effective and computationally efficient. On Cityscapes dataset, we achieve state-of-the-art performance. On the more challenging ADE20K dataset, our best model yields a single-scale mIoU of 50.18, and a multi-scale mIoU of 51.8, which is on-par with the current state-of-art model, while we drastically cut the number of FLOPs by 53.5%. Our source code and models are publicly available at: https://github.com/shiwt03/UperFormer

TRAC: A Textual Benchmark for Reasoning about Actions and Change

  • Authors: Weinan He, Canming Huang, Zhanhao Xiao, Yongmei Liu
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.13930
  • Pdf link: https://arxiv.org/pdf/2211.13930
  • Abstract Reasoning about actions and change (RAC) is essential to understand and interact with the ever-changing environment. Previous AI research has shown the importance of fundamental and indispensable knowledge of actions, i.e., preconditions and effects. However, traditional methods rely on logical formalization which hinders practical applications. With recent transformer-based language models (LMs), reasoning over text is desirable and seemingly feasible, leading to the question of whether LMs can effectively and efficiently learn to solve RAC problems. We propose four essential RAC tasks as a comprehensive textual benchmark and generate problems in a way that minimizes the influence of other linguistic requirements (e.g., grounding) to focus on RAC. The resulting benchmark, TRAC, encompassing problems of various complexities, facilitates a more granular evaluation of LMs, precisely targeting the structural generalization ability much needed for RAC. Experiments with three high-performing transformers indicates that additional efforts are needed to tackle challenges raised by TRAC.

Spatial-Temporal Attention Network for Open-Set Fine-Grained Image Recognition

  • Authors: Jiayin Sun, Hong Wang, Qiulei Dong
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13940
  • Pdf link: https://arxiv.org/pdf/2211.13940
  • Abstract Triggered by the success of transformers in various visual tasks, the spatial self-attention mechanism has recently attracted more and more attention in the computer vision community. However, we empirically found that a typical vision transformer with the spatial self-attention mechanism could not learn accurate attention maps for distinguishing different categories of fine-grained images. To address this problem, motivated by the temporal attention mechanism in brains, we propose a spatial-temporal attention network for learning fine-grained feature representations, called STAN, where the features learnt by implementing a sequence of spatial self-attention operations corresponding to multiple moments are aggregated progressively. The proposed STAN consists of four modules: a self-attention backbone module for learning a sequence of features with self-attention operations, a spatial feature self-organizing module for facilitating the model training, a spatial-temporal feature learning module for aggregating the re-organized features via a Long Short-Term Memory network, and a context-aware module that is implemented as the forget block of the spatial-temporal feature learning module for preserving/forgetting the long-term memory by utilizing contextual information. Then, we propose a STAN-based method for open-set fine-grained recognition by integrating the proposed STAN network with a linear classifier, called STAN-OSFGR. Extensive experimental results on 3 fine-grained datasets and 2 coarse-grained datasets demonstrate that the proposed STAN-OSFGR outperforms 9 state-of-the-art open-set recognition methods significantly in most cases.

MPCViT: Searching for MPC-friendly Vision Transformer with Heterogeneous Attention

  • Authors: Wenxuan Zeng, Meng Li, Wenjie Xiong, Wenjie Lu, Jin Tan, Runsheng Wang, Ru Huang
  • Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.13955
  • Pdf link: https://arxiv.org/pdf/2211.13955
  • Abstract Secure multi-party computation (MPC) enables computation directly on encrypted data on non-colluding untrusted servers and protects both data and model privacy in deep learning inference. However, existing neural network (NN) architectures, including Vision Transformers (ViTs), are not designed or optimized for MPC protocols and incur significant latency overhead due to the Softmax function in the multi-head attention (MHA). In this paper, we propose an MPC-friendly ViT, dubbed MPCViT, to enable accurate yet efficient ViT inference in MPC. We systematically compare different attention variants in MPC and propose a heterogeneous attention search space, which combines the high-accuracy and MPC-efficient attentions with diverse structure granularities. We further propose a simple yet effective differentiable neural architecture search (NAS) algorithm for fast ViT optimization. MPCViT significantly outperforms prior-art ViT variants in MPC. With the proposed NAS algorithm, our extensive experiments demonstrate that MPCViT achieves 7.9x and 2.8x latency reduction with better accuracy compared to Linformer and MPCFormer on the Tiny-ImageNet dataset, respectively. Further, with proper knowledge distillation (KD), MPCViT even achieves 1.9% better accuracy compared to the baseline ViT with 9.9x latency reduction on the Tiny-ImageNet dataset.

Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers

  • Authors: Khaled Koutini, Shahed Masoudian, Florian Schmid, Hamid Eghbal-zadeh, Jan Schlüter, Gerhard Widmer
  • Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2211.13956
  • Pdf link: https://arxiv.org/pdf/2211.13956
  • Abstract The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Hence, these models can be used as powerful feature extractors, in combination with shallower models as classifiers, for smaller tasks and datasets where the amount of training data is insufficient for learning an end-to-end model from scratch. During the past years, Convolutional Neural Networks (CNNs) have largely been the method of choice for audio processing. However, recently attention-based transformer models have demonstrated great potential in supervised settings, outperforming CNNs. In this work, we investigate the use of audio transformers trained on large-scale datasets to learn general-purpose representations. We study how the different setups in these audio transformers affect the quality of their embeddings. We experiment with the models' time resolution, extracted embedding level, and receptive fields in order to see how they affect performance on a variety of tasks and datasets, following the HEAR 2021 NeurIPS challenge evaluation setup. Our results show that representations extracted by audio transformers outperform CNN representations. Furthermore, we will show that transformers trained on Audioset can be extremely effective representation extractors for a wide range of downstream tasks.

BatmanNet: Bi-branch Masked Graph Transformer Autoencoder for Molecular Representation

  • Authors: Zhen Wang, Zheng Feng, Yanjun Li, Bowen Li, Yongrui Wang, Chulin Sha, Min He, Xiaolin Li
  • Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
  • Arxiv link: https://arxiv.org/abs/2211.13979
  • Pdf link: https://arxiv.org/pdf/2211.13979
  • Abstract Although substantial efforts have been made using graph neural networks (GNNs) for AI-driven drug discovery (AIDD), effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, they often require large-scale datasets and considerable computational resources, which is time-consuming, computationally expensive, and environmentally unfriendly. To alleviate these limitations, we propose a novel pre-training model for molecular representation learning, Bi-branch Masked Graph Transformer Autoencoder (BatmanNet). BatmanNet features two tailored and complementary graph autoencoders to reconstruct the missing nodes and edges from a masked molecular graph. To our surprise, BatmanNet discovered that the highly masked proportion (60%) of the atoms and bonds achieved the best performance. We further propose an asymmetric graph-based encoder-decoder architecture for either nodes and edges, where a transformer-based encoder only takes the visible subset of nodes or edges, and a lightweight decoder reconstructs the original molecule from the latent representation and mask tokens. With this simple yet effective asymmetrical design, our BatmanNet can learn efficiently even from a much smaller-scale unlabeled molecular dataset to capture the underlying structural and semantic information, overcoming a major limitation of current deep neural networks for molecular representation learning. For instance, using only 250K unlabelled molecules as pre-training data, our BatmanNet with 2.575M parameters achieves a 0.5% improvement on the average AUC compared with the current state-of-the-art method with 100M parameters pre-trained on 11M molecules.

Aggregated Text Transformer for Scene Text Detection

  • Authors: Zhao Zhou, Xiangcheng Du, Yingbin Zheng, Cheng Jin
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13984
  • Pdf link: https://arxiv.org/pdf/2211.13984
  • Abstract This paper explores the multi-scale aggregation strategy for scene text detection in natural images. We present the Aggregated Text TRansformer(ATTR), which is designed to represent texts in scene images with a multi-scale self-attention mechanism. Starting from the image pyramid with multiple resolutions, the features are first extracted at different scales with shared weight and then fed into an encoder-decoder architecture of Transformer. The multi-scale image representations are robust and contain rich information on text contents of various sizes. The text Transformer aggregates these features to learn the interaction across different scales and improve text representation. The proposed method detects scene texts by representing each text instance as an individual binary mask, which is tolerant of curve texts and regions with dense instances. Extensive experiments on public scene text detection datasets demonstrate the effectiveness of the proposed framework.

CoMFormer: Continual Learning in Semantic and Panoptic Segmentation

  • Authors: Fabio Cermelli, Matthieu Cord, Arthur Douillard
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.13999
  • Pdf link: https://arxiv.org/pdf/2211.13999
  • Abstract Continual learning for segmentation has recently seen increasing interest. However, all previous works focus on narrow semantic segmentation and disregard panoptic segmentation, an important task with real-world impacts. %a In this paper, we present the first continual learning model capable of operating on both semantic and panoptic segmentation. Inspired by recent transformer approaches that consider segmentation as a mask-classification problem, we design CoMFormer. Our method carefully exploits the properties of transformer architectures to learn new classes over time. Specifically, we propose a novel adaptive distillation loss along with a mask-based pseudo-labeling technique to effectively prevent forgetting. To evaluate our approach, we introduce a novel continual panoptic segmentation benchmark on the challenging ADE20K dataset. Our CoMFormer outperforms all the existing baselines by forgetting less old classes but also learning more effectively new classes. In addition, we also report an extensive evaluation in the large-scale continual semantic segmentation scenario showing that CoMFormer also significantly outperforms state-of-the-art methods.

Molecular Joint Representation Learning via Multi-modal Information

  • Authors: Tianyu Wu, Yang Tang, Qiyu Sun, Luolin Xiong
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
  • Arxiv link: https://arxiv.org/abs/2211.14042
  • Pdf link: https://arxiv.org/pdf/2211.14042
  • Abstract In recent years, artificial intelligence has played an important role on accelerating the whole process of drug discovery. Various of molecular representation schemes of different modals (e.g. textual sequence or graph) are developed. By digitally encoding them, different chemical information can be learned through corresponding network structures. Molecular graphs and Simplified Molecular Input Line Entry System (SMILES) are popular means for molecular representation learning in current. Previous works have done attempts by combining both of them to solve the problem of specific information loss in single-modal representation on various tasks. To further fusing such multi-modal imformation, the correspondence between learned chemical feature from different representation should be considered. To realize this, we propose a novel framework of molecular joint representation learning via Multi-Modal information of SMILES and molecular Graphs, called MMSG. We improve the self-attention mechanism by introducing bond level graph representation as attention bias in Transformer to reinforce feature correspondence between multi-modal information. We further propose a Bidirectional Message Communication Graph Neural Network (BMC GNN) to strengthen the information flow aggregated from graphs for further combination. Numerous experiments on public property prediction datasets have demonstrated the effectiveness of our model.

Spatial-Spectral Transformer for Hyperspectral Image Denoising

  • Authors: Miaoyu Li, Ying Fu, Yulun Zhang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2211.14090
  • Pdf link: https://arxiv.org/pdf/2211.14090
  • Abstract Hyperspectral image (HSI) denoising is a crucial preprocessing procedure for the subsequent HSI applications. Unfortunately, though witnessing the development of deep learning in HSI denoising area, existing convolution-based methods face the trade-off between computational efficiency and capability to model non-local characteristics of HSI. In this paper, we propose a Spatial-Spectral Transformer (SST) to alleviate this problem. To fully explore intrinsic similarity characteristics in both spatial dimension and spectral dimension, we conduct non-local spatial self-attention and global spectral self-attention with Transformer architecture. The window-based spatial self-attention focuses on the spatial similarity beyond the neighboring region. While, spectral self-attention exploits the long-range dependencies between highly correlative bands. Experimental results show that our proposed method outperforms the state-of-the-art HSI denoising methods in quantitative quality and visual results.

Interval-censored Transformer Hawkes: Detecting Information Operations using the Reaction of Social Systems

  • Authors: Quyu Kong, Pio Calderon, Rohit Ram, Olga Boichak, Marian-Andrei Rizoiu
  • Subjects: Social and Information Networks (cs.SI)
  • Arxiv link: https://arxiv.org/abs/2211.14114
  • Pdf link: https://arxiv.org/pdf/2211.14114
  • Abstract Social media is being increasingly weaponized by state-backed actors to elicit reactions, push narratives and sway public opinion. These are known as Information Operations (IO). The covert nature of IO makes their detection difficult. This is further amplified by missing data due to the user and content removal and privacy requirements. This work advances the hypothesis that the very reactions that Information Operations seek to elicit within the target social systems can be used to detect them. We propose an Interval-censored Transformer Hawkes (IC-TH) architecture and a novel data encoding scheme to account for both observed and missing data. We derive a novel log-likelihood function that we deploy together with a contrastive learning procedure. We showcase the performance of IC-TH on three real-world Twitter datasets and two learning tasks: future popularity prediction and item category prediction. The latter is particularly significant. Using the retweeting timing and patterns solely, we can predict the category of YouTube videos, guess whether news publishers are reputable or controversial and, most importantly, identify state-backed IO agent accounts. Additional qualitative investigations uncover that the automatically discovered clusters of Russian-backed agents appear to coordinate their behavior, activating simultaneously to push specific narratives.

PoET: Pose Estimation Transformer for Single-View, Multi-Object 6D Pose Estimation

  • Authors: Thomas Jantos, Mohamed Amin Hamdad, Wolfgang Granig, Stephan Weiss, Jan Steinbrener
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.14125
  • Pdf link: https://arxiv.org/pdf/2211.14125
  • Abstract Accurate 6D object pose estimation is an important task for a variety of robotic applications such as grasping or localization. It is a challenging task due to object symmetries, clutter and occlusion, but it becomes more challenging when additional information, such as depth and 3D models, is not provided. We present a transformer-based approach that takes an RGB image as input and predicts a 6D pose for each object in the image. Besides the image, our network does not require any additional information such as depth maps or 3D object models. First, the image is passed through an object detector to generate feature maps and to detect objects. Then, the feature maps are fed into a transformer with the detected bounding boxes as additional information. Afterwards, the output object queries are processed by a separate translation and rotation head. We achieve state-of-the-art results for RGB-only approaches on the challenging YCB-V dataset. We illustrate the suitability of the resulting model as pose sensor for a 6-DoF state estimation task. Code is available at https://github.com/aau-cns/poet.

Interaction Visual Transformer for Egocentric Action Anticipation

  • Authors: Debaditya Roy, Ramanathan Rajendiran, Basura Fernando
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.14154
  • Pdf link: https://arxiv.org/pdf/2211.14154
  • Abstract Human-object interaction is one of the most important visual cues that has not been explored for egocentric action anticipation. We propose a novel Transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

  • Authors: Tan Yu, Ping Li
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.14255
  • Pdf link: https://arxiv.org/pdf/2211.14255
  • Abstract The formidable accomplishment of Transformers in natural language processing has motivated the researchers in the computer vision community to build Vision Transformers. Compared with the Convolution Neural Networks (CNN), a Vision Transformer has a larger receptive field which is capable of characterizing the long-range dependencies. Nevertheless, the large receptive field of Vision Transformer is accompanied by the huge computational cost. To boost efficiency, the window-based Vision Transformers emerge. They crop an image into several local windows, and the self-attention is conducted within each window. To bring back the global receptive field, window-based Vision Transformers have devoted a lot of efforts to achieving cross-window communications by developing several sophisticated operations. In this work, we check the necessity of the key design element of Swin Transformer, the shifted window partitioning. We discover that a simple depthwise convolution is sufficient for achieving effective cross-window communications. Specifically, with the existence of the depthwise convolution, the shifted window configuration in Swin Transformer cannot lead to an additional performance improvement. Thus, we degenerate the Swin Transformer to a plain Window-based (Win) Transformer by discarding sophisticated shifted window partitioning. The proposed Win Transformer is conceptually simpler and easier for implementation than Swin Transformer. Meanwhile, our Win Transformer achieves consistently superior performance than Swin Transformer on multiple computer vision tasks, including image recognition, semantic segmentation, and object detection.

A System for Morphology-Task Generalization via Unified Representation and Behavior Distillation

  • Authors: Hiroki Furuta, Yusuke Iwasawa, Yutaka Matsuo, Shixiang Shane Gu
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2211.14296
  • Pdf link: https://arxiv.org/pdf/2211.14296
  • Abstract The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce morphology-task graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a morphology-task graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology-task generalization.

RUST: Latent Neural Scene Representations from Unposed Imagery

  • Authors: Mehdi S. M. Sajjadi, Aravindh Mahendran, Thomas Kipf, Etienne Pot, Daniel Duckworth, Mario Lucic, Klaus Greff
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2211.14306
  • Pdf link: https://arxiv.org/pdf/2211.14306
  • Abstract Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.

MAEDAY: MAE for few and zero shot AnomalY-Detection

  • Authors: Eli Schwartz, Assaf Arbelle, Leonid Karlinsky, Sivan Harary, Florian Scheidegger, Sivan Doveh, Raja Giryes
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.14307
  • Pdf link: https://arxiv.org/pdf/2211.14307
  • Abstract The goal of Anomaly-Detection (AD) is to identify outliers, or outlying regions, from some unknown distribution given only a set of positive (good) examples. Few-Shot AD (FSAD) aims to solve the same task with a minimal amount of normal examples. Recent embedding-based methods, that compare the embedding vectors of queries to a set of reference embeddings, have demonstrated impressive results for FSAD, where as little as one good example is provided. A different approach, image-reconstruction-based, has been historically used for AD. The idea is to train a model to recover normal images from corrupted observations, assuming that the model will fail to recover regions when encountered with an out-of-distribution image. However, image-reconstruction-based methods were not yet used in the low-shot regime as they need to be trained on a diverse set of normal images in order to properly perform. We suggest using Masked Auto-Encoder (MAE), a self-supervised transformer model trained for recovering missing image regions based on their surroundings for FSAD. We show that MAE performs well by pre-training on an arbitrary set of natural images (ImageNet) and only fine-tuning on a small set of normal images. We name this method MAEDAY. We further find that MAEDAY provides an orthogonal signal to the embedding-based methods and the ensemble of the two approaches achieves very strong SOTA results. We also present a novel task of Zero-Shot AD (ZSAD) where no normal samples are available at training time. We show that MAEDAY performs surprisingly well at this task. Finally, we provide a new dataset for detecting foreign objects on the ground and demonstrate superior results for this task as well. Code is available at https://github.com/EliSchwartz/MAEDAY .

Keyword: scene understanding

Language-Assisted 3D Feature Learning for Semantic Scene Understanding

  • Authors: Junbo Zhang, Guofan Fan, Guanghan Wang, Zhengyuan Su, Kaisheng Ma, Li Yi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.14091
  • Pdf link: https://arxiv.org/pdf/2211.14091
  • Abstract Learning descriptive 3D features is crucial for understanding 3D scenes with diverse objects and complex structures. However, it is usually unknown whether important geometric attributes and scene context obtain enough emphasis in an end-to-end trained 3D scene understanding network. To guide 3D feature learning toward important geometric attributes and scene context, we explore the help of textual scene descriptions. Given some free-form descriptions paired with 3D scenes, we extract the knowledge regarding the object relationships and object attributes. We then inject the knowledge to 3D feature learning through three classification-based auxiliary tasks. This language-assisted training can be combined with modern object detection and instance segmentation methods to promote 3D semantic scene understanding, especially in a label-deficient regime. Moreover, the 3D feature learned with language assistance is better aligned with the language features, which can benefit various 3D-language multimodal tasks. Experiments on several benchmarks of 3D-only and 3D-language tasks demonstrate the effectiveness of our language-assisted 3D feature learning. Code is available at https://github.com/Asterisci/Language-Assisted-3D.

Learning 3D Scene Priors with 2D Supervision

  • Authors: Yinyu Nie, Angela Dai, Xiaoguang Han, Matthias Nießner
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.14157
  • Pdf link: https://arxiv.org/pdf/2211.14157
  • Abstract Holistic 3D scene understanding entails estimation of both layout configuration and object geometry in a 3D environment. Recent works have shown advances in 3D scene estimation from various input modalities (e.g., images, 3D scans), by leveraging 3D supervision (e.g., 3D bounding boxes or CAD models), for which collection at scale is expensive and often intractable. To address this shortcoming, we propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth. Instead, we rely on 2D supervision from multi-view RGB images. Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories, 3D bounding boxes, and meshes. With our trained autoregressive decoder representing the scene prior, our method facilitates many downstream applications, including scene synthesis, interpolation, and single-view reconstruction. Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction, and achieves state-of-the-art results in scene synthesis against baselines which require for 3D supervision.

Keyword: visual reasoning

There is no result

DongZhouGu avatar Nov 28 '22 02:11 DongZhouGu