Paper-Daily-Notice
Paper-Daily-Notice copied to clipboard
New submissions for Thu, 4 Aug 22
New submissions for Thu, 4 Aug 22
Keyword: SLAM
Present and Future of SLAM in Extreme Underground Environments
- Authors: Kamak Ebadi, Lukas Bernreiter, Harel Biggie, Gavin Catt, Yun Chang, Arghya Chatterjee, Christopher E. Denniston, Simon-Pierre Deschênes, Kyle Harlow, Shehryar Khattak, Lucas Nogueira, Matteo Palieri, Pavel Petráček, Matěj Petrlík, Andrzej Reinke, Vít Krátký, Shibo Zhao, Ali-akbar Agha-mohammadi, Kostas Alexis, Christoffer Heckman, Kasra Khosoussi, Navinda Kottege, Benjamin Morrell, Marco Hutter, Fred Pauling, François Pomerleau, Martin Saska, Sebastian Scherer, Roland Siegwart, Jason L. Williams, Luca Carlone
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2208.01787
- Pdf link: https://arxiv.org/pdf/2208.01787
- Abstract This paper reports on the state of the art in underground SLAM by discussing different SLAM strategies and results across six teams that participated in the three-year-long SubT competition. In particular, the paper has four main goals. First, we review the algorithms, architectures, and systems adopted by the teams; particular emphasis is put on lidar-centric SLAM solutions (the go-to approach for virtually all teams in the competition), heterogeneous multi-robot operation (including both aerial and ground robots), and real-world underground operation (from the presence of obscurants to the need to handle tight computational constraints). We do not shy away from discussing the dirty details behind the different SubT SLAM systems, which are often omitted from technical papers. Second, we discuss the maturity of the field by highlighting what is possible with the current SLAM systems and what we believe is within reach with some good systems engineering. Third, we outline what we believe are fundamental open problems, that are likely to require further research to break through. Finally, we provide a list of open-source SLAM implementations and datasets that have been produced during the SubT challenge and related efforts, and constitute a useful resource for researchers and practitioners.
Evaluation and comparison of eight popular Lidar and Visual SLAM algorithms
- Authors: Bharath Garigipati, Nataliya Strokina, Reza Ghabcheloo
- Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.02063
- Pdf link: https://arxiv.org/pdf/2208.02063
- Abstract In this paper, we evaluate eight popular and open-source 3D Lidar and visual SLAM (Simultaneous Localization and Mapping) algorithms, namely LOAM, Lego LOAM, LIO SAM, HDL Graph, ORB SLAM3, Basalt VIO, and SVO2. We have devised experiments both indoor and outdoor to investigate the effect of the following items: i) effect of mounting positions of the sensors, ii) effect of terrain type and vibration, iii) effect of motion (variation in linear and angular speed). We compare their performance in terms of relative and absolute pose error. We also provide comparison on their required computational resources. We thoroughly analyse and discuss the results and identify the best performing system for the environment cases with our multi-camera and multi-Lidar indoor and outdoor datasets. We hope our findings help one to choose a sensor and the corresponding SLAM algorithm combination suiting their needs, based on their target environment.
Keyword: odometry
There is no result
Keyword: livox
There is no result
Keyword: loam
Evaluation and comparison of eight popular Lidar and Visual SLAM algorithms
- Authors: Bharath Garigipati, Nataliya Strokina, Reza Ghabcheloo
- Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.02063
- Pdf link: https://arxiv.org/pdf/2208.02063
- Abstract In this paper, we evaluate eight popular and open-source 3D Lidar and visual SLAM (Simultaneous Localization and Mapping) algorithms, namely LOAM, Lego LOAM, LIO SAM, HDL Graph, ORB SLAM3, Basalt VIO, and SVO2. We have devised experiments both indoor and outdoor to investigate the effect of the following items: i) effect of mounting positions of the sensors, ii) effect of terrain type and vibration, iii) effect of motion (variation in linear and angular speed). We compare their performance in terms of relative and absolute pose error. We also provide comparison on their required computational resources. We thoroughly analyse and discuss the results and identify the best performing system for the environment cases with our multi-camera and multi-Lidar indoor and outdoor datasets. We hope our findings help one to choose a sensor and the corresponding SLAM algorithm combination suiting their needs, based on their target environment.
Keyword: lidar
Present and Future of SLAM in Extreme Underground Environments
- Authors: Kamak Ebadi, Lukas Bernreiter, Harel Biggie, Gavin Catt, Yun Chang, Arghya Chatterjee, Christopher E. Denniston, Simon-Pierre Deschênes, Kyle Harlow, Shehryar Khattak, Lucas Nogueira, Matteo Palieri, Pavel Petráček, Matěj Petrlík, Andrzej Reinke, Vít Krátký, Shibo Zhao, Ali-akbar Agha-mohammadi, Kostas Alexis, Christoffer Heckman, Kasra Khosoussi, Navinda Kottege, Benjamin Morrell, Marco Hutter, Fred Pauling, François Pomerleau, Martin Saska, Sebastian Scherer, Roland Siegwart, Jason L. Williams, Luca Carlone
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2208.01787
- Pdf link: https://arxiv.org/pdf/2208.01787
- Abstract This paper reports on the state of the art in underground SLAM by discussing different SLAM strategies and results across six teams that participated in the three-year-long SubT competition. In particular, the paper has four main goals. First, we review the algorithms, architectures, and systems adopted by the teams; particular emphasis is put on lidar-centric SLAM solutions (the go-to approach for virtually all teams in the competition), heterogeneous multi-robot operation (including both aerial and ground robots), and real-world underground operation (from the presence of obscurants to the need to handle tight computational constraints). We do not shy away from discussing the dirty details behind the different SubT SLAM systems, which are often omitted from technical papers. Second, we discuss the maturity of the field by highlighting what is possible with the current SLAM systems and what we believe is within reach with some good systems engineering. Third, we outline what we believe are fundamental open problems, that are likely to require further research to break through. Finally, we provide a list of open-source SLAM implementations and datasets that have been produced during the SubT challenge and related efforts, and constitute a useful resource for researchers and practitioners.
SuperLine3D: Self-supervised Line Segmentation and Description for LiDAR Point Cloud
- Authors: Xiangrui Zhao, Sheng Yang, Tianxin Huang, Jun Chen, Teng Ma, Mingyang Li, Yong Liu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.01925
- Pdf link: https://arxiv.org/pdf/2208.01925
- Abstract Poles and building edges are frequently observable objects on urban roads, conveying reliable hints for various computer vision tasks. To repetitively extract them as features and perform association between discrete LiDAR frames for registration, we propose the first learning-based feature segmentation and description model for 3D lines in LiDAR point cloud. To train our model without the time consuming and tedious data labeling process, we first generate synthetic primitives for the basic appearance of target lines, and build an iterative line auto-labeling process to gradually refine line labels on real LiDAR scans. Our segmentation model can extract lines under arbitrary scale perturbations, and we use shared EdgeConv encoder layers to train the two segmentation and descriptor heads jointly. Base on the model, we can build a highly-available global registration module for point cloud registration, in conditions without initial transformation hints. Experiments have demonstrated that our line-based registration method is highly competitive to state-of-the-art point-based approaches. Our code is available at https://github.com/zxrzju/SuperLine3D.git.
Evaluation and comparison of eight popular Lidar and Visual SLAM algorithms
- Authors: Bharath Garigipati, Nataliya Strokina, Reza Ghabcheloo
- Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.02063
- Pdf link: https://arxiv.org/pdf/2208.02063
- Abstract In this paper, we evaluate eight popular and open-source 3D Lidar and visual SLAM (Simultaneous Localization and Mapping) algorithms, namely LOAM, Lego LOAM, LIO SAM, HDL Graph, ORB SLAM3, Basalt VIO, and SVO2. We have devised experiments both indoor and outdoor to investigate the effect of the following items: i) effect of mounting positions of the sensors, ii) effect of terrain type and vibration, iii) effect of motion (variation in linear and angular speed). We compare their performance in terms of relative and absolute pose error. We also provide comparison on their required computational resources. We thoroughly analyse and discuss the results and identify the best performing system for the environment cases with our multi-camera and multi-Lidar indoor and outdoor datasets. We hope our findings help one to choose a sensor and the corresponding SLAM algorithm combination suiting their needs, based on their target environment.
Keyword: loop detection
There is no result
Keyword: nerf
There is no result
Keyword: mapping
Cross-Modal Alignment Learning of Vision-Language Conceptual Systems
- Authors: Taehyeong Kim, Hyeonseop Song, Byoung-Tak Zhang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2208.01744
- Pdf link: https://arxiv.org/pdf/2208.01744
- Abstract Human infants learn the names of objects and develop their own conceptual systems without explicit supervision. In this study, we propose methods for learning aligned vision-language conceptual systems inspired by infants' word learning mechanisms. The proposed model learns the associations of visual objects and words online and gradually constructs cross-modal relational graph networks. Additionally, we also propose an aligned cross-modal representation learning method that learns semantic representations of visual objects and words in a self-supervised manner based on the cross-modal relational graph networks. It allows entities of different modalities with conceptually the same meaning to have similar semantic representation vectors. We quantitatively and qualitatively evaluate our method, including object-to-word mapping and zero-shot learning tasks, showing that the proposed model significantly outperforms the baselines and that each conceptual system is topologically aligned.
BPMN4sML: A BPMN Extension for Serverless Machine Learning. Technology Independent and Interoperable Modeling of Machine Learning Workflows and their Serverless Deployment Orchestration
- Authors: Laurens Martin Tetzlaff
- Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2208.02030
- Pdf link: https://arxiv.org/pdf/2208.02030
- Abstract Machine learning (ML) continues to permeate all layers of academia, industry and society. Despite its successes, mental frameworks to capture and represent machine learning workflows in a consistent and coherent manner are lacking. For instance, the de facto process modeling standard, Business Process Model and Notation (BPMN), managed by the Object Management Group, is widely accepted and applied. However, it is short of specific support to represent machine learning workflows. Further, the number of heterogeneous tools for deployment of machine learning solutions can easily overwhelm practitioners. Research is needed to align the process from modeling to deploying ML workflows. We analyze requirements for standard based conceptual modeling for machine learning workflows and their serverless deployment. Confronting the shortcomings with respect to consistent and coherent modeling of ML workflows in a technology independent and interoperable manner, we extend BPMN's Meta-Object Facility (MOF) metamodel and the corresponding notation and introduce BPMN4sML (BPMN for serverless machine learning). Our extension BPMN4sML follows the same outline referenced by the Object Management Group (OMG) for BPMN. We further address the heterogeneity in deployment by proposing a conceptual mapping to convert BPMN4sML models to corresponding deployment models using TOSCA. BPMN4sML allows technology-independent and interoperable modeling of machine learning workflows of various granularity and complexity across the entire machine learning lifecycle. It aids in arriving at a shared and standardized language to communicate ML solutions. Moreover, it takes the first steps toward enabling conversion of ML workflow model diagrams to corresponding deployment models for serverless deployment via TOSCA.
Evaluation and comparison of eight popular Lidar and Visual SLAM algorithms
- Authors: Bharath Garigipati, Nataliya Strokina, Reza Ghabcheloo
- Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.02063
- Pdf link: https://arxiv.org/pdf/2208.02063
- Abstract In this paper, we evaluate eight popular and open-source 3D Lidar and visual SLAM (Simultaneous Localization and Mapping) algorithms, namely LOAM, Lego LOAM, LIO SAM, HDL Graph, ORB SLAM3, Basalt VIO, and SVO2. We have devised experiments both indoor and outdoor to investigate the effect of the following items: i) effect of mounting positions of the sensors, ii) effect of terrain type and vibration, iii) effect of motion (variation in linear and angular speed). We compare their performance in terms of relative and absolute pose error. We also provide comparison on their required computational resources. We thoroughly analyse and discuss the results and identify the best performing system for the environment cases with our multi-camera and multi-Lidar indoor and outdoor datasets. We hope our findings help one to choose a sensor and the corresponding SLAM algorithm combination suiting their needs, based on their target environment.
Keyword: localization
Distributed Event-Triggered Nonlinear Fusion Estimation under Resource Constraints
- Authors: Rusheng Wang, Bo Chen, Zhongyao Hu, Li Yu
- Subjects: Systems and Control (eess.SY); Networking and Internet Architecture (cs.NI)
- Arxiv link: https://arxiv.org/abs/2208.01812
- Pdf link: https://arxiv.org/pdf/2208.01812
- Abstract This paper studies the event-triggered distributed fusion estimation problems for a class of nonlinear networked multisensor fusion systems without noise statistical characteristics. When considering the limited resource problems of two kinds of communication channels (i.e., sensor-to-remote estimator channel and smart sensor-to-fusion center channel), an event-triggered strategy and a dimensionality reduction strategy are introduced in a unified networked framework to lighten the communication burden. Then, two kinds of compensation strategies in terms of a unified model are designed to restructure the untransmitted information, and the local/fusion estimators are proposed based on the compensation information. Furthermore, the linearization errors caused by the Taylor expansion are modeled by the state-dependent matrices with uncertain parameters when establishing estimation error systems, and then different robust recursive optimization problems are constructed to determine the estimator gains and the fusion criteria. Meanwhile, the stability conditions are also proposed such that the square errors of the designed nonlinear estimators are bounded. Finally, a vehicle localization system is employed to demonstrate the effectiveness and advantages of the proposed methods.
Statistical Attention Localization (SAL): Methodology and Application to Object Classification
- Authors: Yijing Yang, Vasileios Magoulianitis, Xinyu Wang, C.-C. Jay Kuo
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.01823
- Pdf link: https://arxiv.org/pdf/2208.01823
- Abstract A statistical attention localization (SAL) method is proposed to facilitate the object classification task in this work. SAL consists of three steps: 1) preliminary attention window selection via decision statistics, 2) attention map refinement, and 3) rectangular attention region finalization. SAL computes soft-decision scores of local squared windows and uses them to identify salient regions in Step 1. To accommodate object of various sizes and shapes, SAL refines the preliminary result and obtain an attention map of more flexible shape in Step 2. Finally, SAL yields a rectangular attention region using the refined attention map and bounding box regularization in Step 3. As an application, we adopt E-PixelHop, which is an object classification solution based on successive subspace learning (SSL), as the baseline. We apply SAL so as to obtain a cropped-out and resized attention region as an alternative input. Classification results of the whole image as well as the attention region are ensembled to achieve the highest classification accuracy. Experiments on the CIFAR-10 dataset are given to demonstrate the advantage of the SAL-assisted object classification method.
Re-Attention Transformer for Weakly Supervised Object Localization
- Authors: Hui Su, Yue Ye, Zhiwei Chen, Mingli Song, Lechao Cheng
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.01838
- Pdf link: https://arxiv.org/pdf/2208.01838
- Abstract Weakly supervised object localization is a challenging task which aims to localize objects with coarse annotations such as image categories. Existing deep network approaches are mainly based on class activation map, which focuses on highlighting discriminative local region while ignoring the full object. In addition, the emerging transformer-based techniques constantly put a lot of emphasis on the backdrop that impedes the ability to identify complete objects. To address these issues, we present a re-attention mechanism termed token refinement transformer (TRT) that captures the object-level semantics to guide the localization well. Specifically, TRT introduces a novel module named token priority scoring module (TPSM) to suppress the effects of background noise while focusing on the target object. Then, we incorporate the class activation map as the semantically aware input to restrain the attention map to the target object. Extensive experiments on two benchmarks showcase the superiority of our proposed method against existing methods with image category annotations. Source code is available in \url{https://github.com/su-hui-zz/ReAttentionTransformer}.
Joint Sensing and Communications for Deep Reinforcement Learning-based Beam Management in 6G
- Authors: Yujie Yao, Hao Zhou, Melike Erol-Kantarci
- Subjects: Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2208.01880
- Pdf link: https://arxiv.org/pdf/2208.01880
- Abstract User location is a piece of critical information for network management and control. However, location uncertainty is unavoidable in certain settings leading to localization errors. In this paper, we consider the user location uncertainty in the mmWave networks, and investigate joint vision-aided sensing and communications using deep reinforcement learning-based beam management for future 6G networks. In particular, we first extract pixel characteristic-based features from satellite images to improve localization accuracy. Then we propose a UK-medoids based method for user clustering with location uncertainty, and the clustering results are consequently used for the beam management. Finally, we apply the DRL algorithm for intra-beam radio resource allocation. The simulations first show that our proposed vision-aided method can substantially reduce the localization error. The proposed UK-medoids and DRL based scheme (UKM-DRL) is compared with two other schemes: K-means based clustering and DRL based resource allocation (K-DRL) and UK-means based clustering and DRL based resource allocation (UK-DRL). The proposed method has 17.2% higher throughput and 7.7% lower delay than UK-DRL, and more than doubled throughput and 55.8% lower delay than K-DRL.
Negative Frames Matter in Egocentric Visual Query 2D Localization
- Authors: Mengmeng Xu, Cheng-Yang Fu, Yanghao Li, Bernard Ghanem, Juan-Manuel Perez-Rua, Tao Xiang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.01949
- Pdf link: https://arxiv.org/pdf/2208.01949
- Abstract The recently released Ego4D dataset and benchmark significantly scales and diversifies the first-person visual perception data. In Ego4D, the Visual Queries 2D Localization task aims to retrieve objects appeared in the past from the recording in the first-person view. This task requires a system to spatially and temporally localize the most recent appearance of a given object query, where query is registered by a single tight visual crop of the object in a different scene. Our study is based on the three-stage baseline introduced in the Episodic Memory benchmark. The baseline solves the problem by detection and tracking: detect the similar objects in all the frames, then run a tracker from the most confident detection result. In the VQ2D challenge, we identified two limitations of the current baseline. (1) The training configuration has redundant computation. Although the training set has millions of instances, most of them are repetitive and the number of unique object is only around 14.6k. The repeated gradient computation of the same object lead to an inefficient training; (2) The false positive rate is high on background frames. This is due to the distribution gap between training and evaluation. During training, the model is only able to see the clean, stable, and labeled frames, but the egocentric videos also have noisy, blurry, or unlabeled background frames. To this end, we developed a more efficient and effective solution. Concretely, we bring the training loop from ~15 days to less than 24 hours, and we achieve 0.17% spatial-temporal AP, which is 31% higher than the baseline. Our solution got the first ranking on the public leaderboard. Our code is publicly available at https://github.com/facebookresearch/vq2d_cvpr.
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos
- Authors: Juncheng Li, Junlin Xie, Linchao Zhu, Long Qian, Siliang Tang, Wenqiao Zhang, Haochen Shi, Shengyu Zhang, Longhui Wei, Qi Tian, Yueting Zhuang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.01954
- Pdf link: https://arxiv.org/pdf/2208.01954
- Abstract Understanding human emotions is a crucial ability for intelligent robots to provide better human-robot interactions. The existing works are limited to trimmed video-level emotion classification, failing to locate the temporal window corresponding to the emotion. In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles. TEL presents three unique challenges compared to temporal action localization: 1) The emotions have extremely varied temporal dynamics; 2) The emotion cues are embedded in both appearances and complex plots; 3) The fine-grained temporal annotations are complicated and labor-intensive. To address the first two challenges, we propose a novel dilated context integrated network with a coarse-fine two-stream architecture. The coarse stream captures varied temporal dynamics by modeling multi-granularity temporal contexts. The fine stream achieves complex plots understanding by reasoning the dependency between the multi-granularity temporal contexts from the coarse stream and adaptively integrates them into fine-grained video segment features. To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised learning. We contribute a new testing set with 3,000 manually-annotated temporal boundaries so that future research on the TEL problem can be quantitatively evaluated. Extensive experiments show the effectiveness of our approach on temporal emotion localization. The repository of this work is at https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos.
Localization and Classification of Parasitic Eggs in Microscopic Images Using an EfficientDet Detector
- Authors: Nouar AlDahoul (1), Hezerul Abdul Karim (1), Shaira Limson Kee (2), Myles Joshua Toledo Tan (2 and 3) ((1) Faculty of Engineering, Multimedia University, Cyberjaya, Malaysia, (2) Department of Natural Sciences, University of St. La Salle, Bacolod City, Philippines, (3) Department of Chemical Engineering, University of St. La Salle, Bacolod City, Philippines)
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2208.01963
- Pdf link: https://arxiv.org/pdf/2208.01963
- Abstract IPIs caused by protozoan and helminth parasites are among the most common infections in humans in LMICs. They are regarded as a severe public health concern, as they cause a wide array of potentially detrimental health conditions. Researchers have been developing pattern recognition techniques for the automatic identification of parasite eggs in microscopic images. Existing solutions still need improvements to reduce diagnostic errors and generate fast, efficient, and accurate results. Our paper addresses this and proposes a multi-modal learning detector to localize parasitic eggs and categorize them into 11 categories. The experiments were conducted on the novel Chula-ParasiteEgg-11 dataset that was used to train both EfficientDet model with EfficientNet-v2 backbone and EfficientNet-B7+SVM. The dataset has 11,000 microscopic training images from 11 categories. Our results show robust performance with an accuracy of 92%, and an F1 score of 93%. Additionally, the IOU distribution illustrates the high localization capability of the detector.
Evaluation and comparison of eight popular Lidar and Visual SLAM algorithms
- Authors: Bharath Garigipati, Nataliya Strokina, Reza Ghabcheloo
- Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.02063
- Pdf link: https://arxiv.org/pdf/2208.02063
- Abstract In this paper, we evaluate eight popular and open-source 3D Lidar and visual SLAM (Simultaneous Localization and Mapping) algorithms, namely LOAM, Lego LOAM, LIO SAM, HDL Graph, ORB SLAM3, Basalt VIO, and SVO2. We have devised experiments both indoor and outdoor to investigate the effect of the following items: i) effect of mounting positions of the sensors, ii) effect of terrain type and vibration, iii) effect of motion (variation in linear and angular speed). We compare their performance in terms of relative and absolute pose error. We also provide comparison on their required computational resources. We thoroughly analyse and discuss the results and identify the best performing system for the environment cases with our multi-camera and multi-Lidar indoor and outdoor datasets. We hope our findings help one to choose a sensor and the corresponding SLAM algorithm combination suiting their needs, based on their target environment.
DAHiTrA: Damage Assessment Using a Novel Hierarchical Transformer Architecture
- Authors: Navjot Kaur, Cheng-Chun Lee, Ali Mostafavi, Ali Mahdavi-Amiri
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.02205
- Pdf link: https://arxiv.org/pdf/2208.02205
- Abstract This paper presents DAHiTrA, a novel deep-learning model with hierarchical transformers to classify building damages based on satellite images in the aftermath of hurricanes. An automated building damage assessment provides critical information for decision making and resource allocation for rapid emergency response. Satellite imagery provides real-time, high-coverage information and offers opportunities to inform large-scale post-disaster building damage assessment. In addition, deep-learning methods have shown to be promising in classifying building damage. In this work, a novel transformer-based network is proposed for assessing building damage. This network leverages hierarchical spatial features of multiple resolutions and captures temporal difference in the feature domain after applying a transformer encoder on the spatial features. The proposed network achieves state-of-the-art-performance when tested on a large-scale disaster damage dataset (xBD) for building localization and damage classification, as well as on LEVIR-CD dataset for change detection tasks. In addition, we introduce a new high-resolution satellite imagery dataset, Ida-BD (related to the 2021 Hurricane Ida in Louisiana in 2021, for domain adaptation to further evaluate the capability of the model to be applied to newly damaged areas with scarce data. The domain adaptation results indicate that the proposed model can be adapted to a new event with only limited fine-tuning. Hence, the proposed model advances the current state of the art through better performance and domain adaptation. Also, Ida-BD provides a higher-resolution annotated dataset for future studies in this field.
Keyword: transformer
Two-Stream Transformer Architecture for Long Video Understanding
- Authors: Edward Fish, Jon Weinbren, Andrew Gilbert
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
- Arxiv link: https://arxiv.org/abs/2208.01753
- Pdf link: https://arxiv.org/pdf/2208.01753
- Abstract Pure vision transformer architectures are highly effective for short video classification and action recognition tasks. However, due to the quadratic complexity of self attention and lack of inductive bias, transformers are resource intensive and suffer from data inefficiencies. Long form video understanding tasks amplify data and memory efficiency problems in transformers making current approaches unfeasible to implement on data or memory restricted domains. This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features. Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
- Authors: Jun Wang, Mingfei Gao, Yuqian Hu, Ramprasaath R. Selvaraju, Chetan Ramaiah, Ran Xu, Joseph F. JaJa, Larry S. Davis
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.01813
- Pdf link: https://arxiv.org/pdf/2208.01813
- Abstract Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we observe that, in general, the scene text is not fully exploited in the existing datasets -- only a small portion of text in each image participates in the annotated QA activities. This results in a huge waste of useful information. To address this deficiency, we develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image. Specifically, we propose, TAG, a text-aware visual question-answer generation architecture that learns to produce meaningful, and accurate QA samples using a multimodal transformer. The architecture exploits underexplored scene text information and enhances scene understanding of Text-VQA models by combining the generated QA pairs with the initial training data. Extensive experimental results on two well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our proposed TAG effectively enlarges the training data that helps improve the Text-VQA performance without extra labeling effort. Moreover, our model outperforms state-of-the-art approaches that are pre-trained with extra large-scale data. Code will be made publicly available.
Learning Prior Feature and Attention Enhanced Image Inpainting
- Authors: Chenjie Cao, Qiaole Dong, Yanwei Fu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.01837
- Pdf link: https://arxiv.org/pdf/2208.01837
- Abstract Many recent inpainting works have achieved impressive results by leveraging Deep Neural Networks (DNNs) to model various prior information for image restoration. Unfortunately, the performance of these methods is largely limited by the representation ability of vanilla Convolutional Neural Networks (CNNs) backbones.On the other hand, Vision Transformers (ViT) with self-supervised pre-training have shown great potential for many visual recognition and object detection tasks. A natural question is whether the inpainting task can be greatly benefited from the ViT backbone? However, it is nontrivial to directly replace the new backbones in inpainting networks, as the inpainting is an inverse problem fundamentally different from the recognition tasks. To this end, this paper incorporates the pre-training based Masked AutoEncoder (MAE) into the inpainting model, which enjoys richer informative priors to enhance the inpainting process. Moreover, we propose to use attention priors from MAE to make the inpainting model learn more long-distance dependencies between masked and unmasked regions. Sufficient ablations have been discussed about the inpainting and the self-supervised pre-training models in this paper. Besides, experiments on both Places2 and FFHQ demonstrate the effectiveness of our proposed model. Codes and pre-trained models are released in https://github.com/ewrfcas/MAE-FAR.
Re-Attention Transformer for Weakly Supervised Object Localization
- Authors: Hui Su, Yue Ye, Zhiwei Chen, Mingli Song, Lechao Cheng
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.01838
- Pdf link: https://arxiv.org/pdf/2208.01838
- Abstract Weakly supervised object localization is a challenging task which aims to localize objects with coarse annotations such as image categories. Existing deep network approaches are mainly based on class activation map, which focuses on highlighting discriminative local region while ignoring the full object. In addition, the emerging transformer-based techniques constantly put a lot of emphasis on the backdrop that impedes the ability to identify complete objects. To address these issues, we present a re-attention mechanism termed token refinement transformer (TRT) that captures the object-level semantics to guide the localization well. Specifically, TRT introduces a novel module named token priority scoring module (TPSM) to suppress the effects of background noise while focusing on the target object. Then, we incorporate the class activation map as the semantically aware input to restrain the attention map to the target object. Extensive experiments on two benchmarks showcase the superiority of our proposed method against existing methods with image category annotations. Source code is available in \url{https://github.com/su-hui-zz/ReAttentionTransformer}.
Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition
- Authors: Mei Chee Leong, Haosong Zhang, Hui Li Tan, Liyuan Li, Joo Hwee Lim
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.01897
- Pdf link: https://arxiv.org/pdf/2208.01897
- Abstract Fine-grained action recognition is a challenging task in computer vision. As fine-grained datasets have small inter-class variations in spatial and temporal space, fine-grained action recognition model requires good temporal reasoning and discrimination of attribute action semantics. Leveraging on CNN's ability in capturing high level spatial-temporal feature representations and Transformer's modeling efficiency in capturing latent semantics and global dependencies, we investigate two frameworks that combine CNN vision backbone and Transformer Encoder to enhance fine-grained action recognition: 1) a vision-based encoder to learn latent temporal semantics, and 2) a multi-modal video-text cross encoder to exploit additional text input and learn cross association between visual and text semantics. Our experimental results show that both our Transformer encoder frameworks effectively learn latent temporal semantics and cross-modality association, with improved recognition performance over CNN vision model. We achieve new state-of-the-art performance on the FineGym benchmark dataset for both proposed architectures.
SSformer: A Lightweight Transformer for Semantic Segmentation
- Authors: Wentao Shi, Jing Xu, Pan Gao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.02034
- Pdf link: https://arxiv.org/pdf/2208.02034
- Abstract It is well believed that Transformer performs better in semantic segmentation compared to convolutional neural networks. Nevertheless, the original Vision Transformer may lack of inductive biases of local neighborhoods and possess a high time complexity. Recently, Swin Transformer sets a new record in various vision tasks by using hierarchical architecture and shifted windows while being more efficient. However, as Swin Transformer is specifically designed for image classification, it may achieve suboptimal performance on dense prediction-based segmentation task. Further, simply combing Swin Transformer with existing methods would lead to the boost of model size and parameters for the final segmentation model. In this paper, we rethink the Swin Transformer for semantic segmentation, and design a lightweight yet effective transformer model, called SSformer. In this model, considering the inherent hierarchical design of Swin Transformer, we propose a decoder to aggregate information from different layers, thus obtaining both local and global attentions. Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models, while maintaining a smaller model size and lower compute.
KPI-BERT: A Joint Named Entity Recognition and Relation Extraction Model for Financial Reports
- Authors: Lars Hillebrand, Tobias Deußer, Tim Dilmaghani, Bernd Kliem, Rüdiger Loitz, Christian Bauckhage, Rafet Sifa
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2208.02140
- Pdf link: https://arxiv.org/pdf/2208.02140
- Abstract We present KPI-BERT, a system which employs novel methods of named entity recognition (NER) and relation extraction (RE) to extract and link key performance indicators (KPIs), e.g. "revenue" or "interest expenses", of companies from real-world German financial documents. Specifically, we introduce an end-to-end trainable architecture that is based on Bidirectional Encoder Representations from Transformers (BERT) combining a recurrent neural network (RNN) with conditional label masking to sequentially tag entities before it classifies their relations. Our model also introduces a learnable RNN-based pooling mechanism and incorporates domain expert knowledge by explicitly filtering impossible relations. We achieve a substantially higher prediction performance on a new practical dataset of German financial reports, outperforming several strong baselines including a competing state-of-the-art span-based entity tagging approach.
GPPF: A General Perception Pre-training Framework via Sparsely Activated Multi-Task Learning
- Authors: Benyuan Sun, Jin Dai, Zihao Liang, Congying Liu, Yi Yang, Bo Bai
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.02148
- Pdf link: https://arxiv.org/pdf/2208.02148
- Abstract Pre-training over mixtured multi-task, multi-domain, and multi-modal data remains an open challenge in vision perception pre-training. In this paper, we propose GPPF, a General Perception Pre-training Framework, that pre-trains a task-level dynamic network, which is composed by knowledge "legos" in each layers, on labeled multi-task and multi-domain datasets. By inspecting humans' innate ability to learn in complex environment, we recognize and transfer three critical elements to deep networks: (1) simultaneous exposure to diverse cross-task and cross-domain information in each batch. (2) partitioned knowledge storage in separate lego units driven by knowledge sharing. (3) sparse activation of a subset of lego units for both pre-training and downstream tasks. Noteworthy, the joint training of disparate vision tasks is non-trivial due to their differences in input shapes, loss functions, output formats, data distributions, etc. Therefore, we innovatively develop a plug-and-play multi-task training algorithm, which supports Single Iteration Multiple Tasks (SIMT) concurrently training. SIMT lays the foundation of pre-training with large-scale multi-task multi-domain datasets and is proved essential for stable training in our GPPF experiments. Excitingly, the exhaustive experiments show that, our GPPF-R50 model achieves significant improvements of 2.5-5.8 over a strong baseline of the 8 pre-training tasks in GPPF-15M and harvests a range of SOTAs over the 22 downstream tasks with similar computation budgets. We also validate the generalization ability of GPPF to SOTA vision transformers with consistent improvements. These solid experimental results fully prove the effective knowledge learning, storing, sharing, and transfer provided by our novel GPPF framework.
DAHiTrA: Damage Assessment Using a Novel Hierarchical Transformer Architecture
- Authors: Navjot Kaur, Cheng-Chun Lee, Ali Mostafavi, Ali Mahdavi-Amiri
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.02205
- Pdf link: https://arxiv.org/pdf/2208.02205
- Abstract This paper presents DAHiTrA, a novel deep-learning model with hierarchical transformers to classify building damages based on satellite images in the aftermath of hurricanes. An automated building damage assessment provides critical information for decision making and resource allocation for rapid emergency response. Satellite imagery provides real-time, high-coverage information and offers opportunities to inform large-scale post-disaster building damage assessment. In addition, deep-learning methods have shown to be promising in classifying building damage. In this work, a novel transformer-based network is proposed for assessing building damage. This network leverages hierarchical spatial features of multiple resolutions and captures temporal difference in the feature domain after applying a transformer encoder on the spatial features. The proposed network achieves state-of-the-art-performance when tested on a large-scale disaster damage dataset (xBD) for building localization and damage classification, as well as on LEVIR-CD dataset for change detection tasks. In addition, we introduce a new high-resolution satellite imagery dataset, Ida-BD (related to the 2021 Hurricane Ida in Louisiana in 2021, for domain adaptation to further evaluate the capability of the model to be applied to newly damaged areas with scarce data. The domain adaptation results indicate that the proposed model can be adapted to a new event with only limited fine-tuning. Hence, the proposed model advances the current state of the art through better performance and domain adaptation. Also, Ida-BD provides a higher-resolution annotated dataset for future studies in this field.
Keyword: autonomous driving
MixNet: Structured Deep Neural Motion Prediction for Autonomous Racing
- Authors: Phillip Karle, Ferenc Török, Maximilian Geisslinger, Markus Lienkamp
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2208.01862
- Pdf link: https://arxiv.org/pdf/2208.01862
- Abstract Reliably predicting the motion of contestant vehicles surrounding an autonomous racecar is crucial for effective and performant planning. Although highly expressive, deep neural networks are black-box models, making their usage challenging in safety-critical applications, such as autonomous driving. In this paper, we introduce a structured way of forecasting the movement of opposing racecars with deep neural networks. The resulting set of possible output trajectories is constrained. Hence quality guarantees about the prediction can be given. We report the performance of the model by evaluating it together with an LSTM-based encoder-decoder architecture on data acquired from high-fidelity Hardware-in-the-Loop simulations. The proposed approach outperforms the baseline regarding the prediction accuracy but still fulfills the quality guarantees. Thus, a robust real-world application of the model is proven. The presented model was deployed on the racecar of the Technical University of Munich for the Indy Autonomous Challenge 2021. The code used in this research is available as open-source software at www.github.com/TUMFTM/MixNet.