Paper-Daily-Notice icon indicating copy to clipboard operation
Paper-Daily-Notice copied to clipboard

New submissions for Thu, 11 Aug 22

Open zhuhu00 opened this issue 2 years ago • 0 comments

New submissions for Thu, 11 Aug 22

Keyword: SLAM

There is no result

Keyword: odometry

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer

  • Authors: Zhipeng Luo, Changqing Zhou, Liang Pan, Gongjie Zhang, Tianrui Liu, Yueru Luo, Haiyu Zhao, Ziwei Liu, Shijian Lu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.05216
  • Pdf link: https://arxiv.org/pdf/2208.05216
  • Abstract With the prevalence of LiDAR sensors in autonomous driving, 3D object tracking has received increasing attention. In a point cloud sequence, 3D object tracking aims to predict the location and orientation of an object in consecutive frames given an object template. Motivated by the success of transformers, we propose Point Tracking TRansformer (PTTR), which efficiently predicts high-quality 3D tracking results in a coarse-to-fine manner with the help of transformer operations. PTTR consists of three novel designs. 1) Instead of random sampling, we design Relation-Aware Sampling to preserve relevant points to the given template during subsampling. 2) We propose a Point Relation Transformer for effective feature aggregation and feature matching between the template and search region. 3) Based on the coarse tracking results, we employ a novel Prediction Refinement Module to obtain the final refined prediction through local feature pooling. In addition, motivated by the favorable properties of the Bird's-Eye View (BEV) of point clouds in capturing object motion, we further design a more advanced framework named PTTR++, which incorporates both the point-wise view and BEV representation to exploit their complementary effect in generating high-quality tracking results. PTTR++ substantially boosts the tracking performance on top of PTTR with low computational overhead. Extensive experiments over multiple datasets show that our proposed approaches achieve superior 3D tracking accuracy and efficiency.

Keyword: loop detection

There is no result

Keyword: nerf

There is no result

Keyword: mapping

Model-Free Generative Replay for Lifelong Reinforcement Learning: Application to Starcraft-2

  • Authors: Zachary Daniels, Aswin Raghavan, Jesse Hostetler, Abrar Rahman, Indranil Sur, Michael Piacentino, Ajay Divakaran
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2208.05056
  • Pdf link: https://arxiv.org/pdf/2208.05056
  • Abstract One approach to meet the challenges of deep lifelong reinforcement learning (LRL) is careful management of the agent's learning experiences, in order to learn (without forgetting) and build internal meta-models (of the tasks, environments, agents, and world). Generative replay (GR) is a biologically-inspired replay mechanism that augments learning experiences with self-labelled examples drawn from an internal generative model that is updated over time. In this paper, we present a version of GR for LRL that satisfies two desiderata: (a) Introspective density modelling of the latent representations of policies learned using deep RL, and (b) Model-free end-to-end learning. In this work, we study three deep learning architectures for model-free GR. We evaluate our proposed algorithms on three different scenarios comprising tasks from the StarCraft2 and Minigrid domains. We report several key findings showing the impact of the design choices on quantitative metrics that include transfer learning, generalization to unseen tasks, fast adaptation after task change, performance comparable to a task expert, and minimizing catastrophic forgetting. We observe that our GR prevents drift in the features-to-action mapping from the latent vector space of a deep actor-critic agent. We also show improvements in established lifelong learning metrics. We find that the introduction of a small random replay buffer is needed to significantly increase the stability of training, when used in conjunction with the replay buffer and the generated replay buffer. Overall, we find that "hidden replay" (a well-known architecture for class-incremental classification) is the most promising approach that pushes the state-of-the-art in GR for LRL.

Learning to Complete Object Shapes for Object-level Mapping in Dynamic Scenes

  • Authors: Binbin Xu, Andrew J. Davison, Stefan Leutenegger
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2208.05067
  • Pdf link: https://arxiv.org/pdf/2208.05067
  • Abstract In this paper, we propose a novel object-level mapping system that can simultaneously segment, track, and reconstruct objects in dynamic scenes. It can further predict and complete their full geometries by conditioning on reconstructions from depth inputs and a category-level shape prior with the aim that completed object geometry leads to better object reconstruction and tracking accuracy. For each incoming RGB-D frame, we perform instance segmentation to detect objects and build data associations between the detection and the existing object maps. A new object map will be created for each unmatched detection. For each matched object, we jointly optimise its pose and latent geometry representations using geometric residual and differential rendering residual towards its shape prior and completed geometry. Our approach shows better tracking and reconstruction performance compared to methods using traditional volumetric mapping or learned shape prior approaches. We evaluate its effectiveness by quantitatively and qualitatively testing it in both synthetic and real-world sequences.

Prior Knowledge based Advanced Persistent Threats Detection for IoT in a Realistic Benchmark

  • Authors: Yu Shen, Murat Simsek, Burak Kantarci, Hussein T. Mouftah, Mehran Bagheri, Petar Djukic
  • Subjects: Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2208.05089
  • Pdf link: https://arxiv.org/pdf/2208.05089
  • Abstract The number of Internet of Things (IoT) devices being deployed into networks is growing at a phenomenal level, which makes IoT networks more vulnerable in the wireless medium. Advanced Persistent Threat (APT) is malicious to most of the network facilities and the available attack data for training the machine learning-based Intrusion Detection System (IDS) is limited when compared to the normal traffic. Therefore, it is quite challenging to enhance the detection performance in order to mitigate the influence of APT. Therefore, Prior Knowledge Input (PKI) models are proposed and tested using the SCVIC-APT- 2021 dataset. To obtain prior knowledge, the proposed PKI model pre-classifies the original dataset with unsupervised clustering method. Then, the obtained prior knowledge is incorporated into the supervised model to decrease training complexity and assist the supervised model in determining the optimal mapping between the raw data and true labels. The experimental findings indicate that the PKI model outperforms the supervised baseline, with the best macro average F1-score of 81.37%, which is 10.47% higher than the baseline.

A broken FEEC framework for electromagnetic problems on mapped multipatch domains

  • Authors: Yaman Güçlü, Said Hadjout, Martin Campos Pinto
  • Subjects: Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
  • Arxiv link: https://arxiv.org/abs/2208.05238
  • Pdf link: https://arxiv.org/pdf/2208.05238
  • Abstract We present a framework for the structure-preserving approximation of partial differential equations on mapped multipatch domains, extending the classical theory of finite element exterior calculus (FEEC) to discrete de Rham sequences which are broken, i.e., fully discontinuous across the patch interfaces. Following the Conforming/Nonconforming Galerkin (CONGA) schemes developed in [this http URL, arXiv:2109.02553, our approach is based on: (i) the identification of a conforming discrete de Rham sequence with stable commuting projection operators, (ii) the relaxation of the continuity constraints between patches, and (iii) the construction of conforming projections mapping back to the conforming subspaces, allowing to define discrete differentials on the broken sequence. This framework combines the advantages of conforming FEEC discretizations (e.g. commuting projections, discrete duality and Hodge-Helmholtz decompositions) with the data locality and implementation simplicity of interior penalty methods for discontinuous Galerkin discretizations. We apply it to several initial- and boundary-value problems, as well as eigenvalue problems arising in electromagnetics. In each case our formulations are shown to be well posed thanks to an appropriate stabilization of the jumps across the interfaces, and the solutions are extremely robust with respect to the stabilization parameter. Finally we describe a construction using tensor-product splines on mapped cartesian patches, and we detail the associated matrix operators. Our numerical experiments confirm the accuracy and stability of this discrete framework, and they allow us to verify that expected structure-preserving properties such as divergence or harmonic constraints are respected to floating-point accuracy.

Empirical Formal Methods: Guidelines for Performing Empirical Studies on Formal Methods

  • Authors: Maurice H. ter Beek, Alessio Ferrari
  • Subjects: Software Engineering (cs.SE); Formal Languages and Automata Theory (cs.FL)
  • Arxiv link: https://arxiv.org/abs/2208.05266
  • Pdf link: https://arxiv.org/pdf/2208.05266
  • Abstract Empirical studies on formal methods and tools are rare. In this paper, we provide guidelines for such studies. We mention their main ingredients and then define nine different study strategies (laboratory experiments with software and human subjects, usability testing, surveys, qualitative studies, judgment studies, case studies, systematic literature reviews, and systematic mapping studies) and discuss for each of them their crucial characteristics, the difficulties of applying them to formal methods and tools, typical threats to validity, their maturity in formal methods, pointers to external guidelines, and pointers to studies in other fields. We conclude with a number of challenges for empirical formal methods.

Keyword: localization

An Integrated Actuation-Perception Framework for Robotic Leaf Retrieval: Detection, Localization, and Cutting

  • Authors: Merrick Campbell, Amel Dechemi, Konstantinos Karydis
  • Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2208.05032
  • Pdf link: https://arxiv.org/pdf/2208.05032
  • Abstract Contemporary robots in precision agriculture focus primarily on automated harvesting or remote sensing to monitor crop health. Comparatively less work has been performed with respect to collecting physical leaf samples in the field and retaining them for further analysis. Typically, orchard growers manually collect sample leaves and utilize them for stem water potential measurements to analyze tree health and determine irrigation routines. While this technique benefits orchard management, the process of collecting, assessing, and interpreting measurements requires significant human labor and often leads to infrequent sampling. Automated sampling can provide highly accurate and timely information to growers. The first step in such automated in-situ leaf analysis is identifying and cutting a leaf from a tree. This retrieval process requires new methods for actuation and perception. We present a technique for detecting and localizing candidate leaves using point cloud data from a depth camera. This technique is tested on both indoor and outdoor point clouds from avocado trees. We then use a custom-built leaf-cutting end-effector on a 6-DOF robotic arm to test the proposed detection and localization technique by cutting leaves from an avocado tree. Experimental testing with a real avocado tree demonstrates our proposed approach can enable our mobile manipulator and custom end-effector system to successfully detect, localize, and cut leaves.

Quadrotor Autonomous Landing on Moving Platform

  • Authors: Pengyu Wang, Chaoqun Wang, Jiankun Wang, Max Q.-H. Meng
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2208.05201
  • Pdf link: https://arxiv.org/pdf/2208.05201
  • Abstract This paper introduces a quadrotor's autonomous take-off and landing system on a moving platform. The designed system addresses three challenging problems: fast pose estimation, restricted external localization, and effective obstacle avoidance. Specifically, first, we design a landing recognition and positioning system based on the AruCo marker to help the quadrotor quickly calculate the relative pose; second, we leverage a gradient-based local motion planner to generate collision-free reference trajectories rapidly for the quadrotor; third, we build an autonomous state machine that enables the quadrotor to complete its take-off, tracking and landing tasks in full autonomy; finally, we conduct experiments in simulated, real-world indoor and outdoor environments to verify the system's effectiveness and demonstrate its potential.

Consistency-based Self-supervised Learning for Temporal Anomaly Localization

  • Authors: Aniello Panariello, Angelo Porrello, Simone Calderara, Rita Cucchiara
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2208.05251
  • Pdf link: https://arxiv.org/pdf/2208.05251
  • Abstract This work tackles Weakly Supervised Anomaly detection, in which a predictor is allowed to learn not only from normal examples but also from a few labeled anomalies made available during training. In particular, we deal with the localization of anomalous activities within the video stream: this is a very challenging scenario, as training examples come only with video-level annotations (and not frame-level). Several recent works have proposed various regularization terms to address it i.e. by enforcing sparsity and smoothness constraints over the weakly-learned frame-level anomaly scores. In this work, we get inspired by recent advances within the field of self-supervised learning and ask the model to yield the same scores for different augmentations of the same video sequence. We show that enforcing such an alignment improves the performance of the model on XD-Violence.

Location Sensing and Beamforming Design for IRS-Enabled Multi-User ISAC Systems

  • Authors: Zhouyuan Yu, Xiaoling Hu, Chenxi Liu, Mugen Peng, Caijun Zhong
  • Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2208.05300
  • Pdf link: https://arxiv.org/pdf/2208.05300
  • Abstract This paper explores the potential of the intelligent reflecting surface (IRS) in realizing multi-user concurrent communication and localization, using the same time-frequency resources. Specifically, we propose an IRS-enabled multi-user integrated sensing and communication (ISAC) framework, where a distributed semi-passive IRS assists the uplink data transmission from multiple users to the base station (BS) and conducts multi-user localization, simultaneously. We first design an ISAC transmission protocol, where the whole transmission period consists of two periods, i.e., the ISAC period for simultaneous uplink communication and multi-user localization, and the pure communication (PC) period for only uplink data transmission. For the ISAC period, we propose a multi-user location sensing algorithm, which utilizes the uplink communication signals unknown to the IRS, thus removing the requirement of dedicated positioning reference signals in conventional location sensing methods. Based on the sensed users' locations, we propose two novel beamforming algorithms for the ISAC period and PC period, respectively, which can work with discrete phase shifts and require no channel state information (CSI) acquisition. Numerical results show that the proposed multi-user location sensing algorithm can achieve up to millimeter-level positioning accuracy, indicating the advantage of the IRS-enabled ISAC framework. Moreover, the proposed beamforming algorithms with sensed location information and discrete phase shifts can achieve comparable performance to the benchmark considering perfect CSI acquisition and continuous phase shifts, demonstrating how the location information can ensure the communication performance.

Proceedings End-to-End Compositional Models of Vector-Based Semantics

  • Authors: Michael Moortgat (Utrecht University), Gijs Wijnholds (Utrecht University)
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2208.05313
  • Pdf link: https://arxiv.org/pdf/2208.05313
  • Abstract The workshop End-to-End Compositional Models of Vector-Based Semantics was held at NUI Galway on 15 and 16 August 2022 as part of the 33rd European Summer School in Logic, Language and Information (ESSLLI 2022). The workshop was sponsored by the research project 'A composition calculus for vector-based semantic modelling with a localization for Dutch' (Dutch Research Council 360-89-070, 2017-2022). The workshop program was made up of two parts, the first part reporting on the results of the aforementioned project, the second part consisting of contributed papers on related approaches. The present volume collects the contributed papers and the abstracts of the invited talks.

IRS-Aided Non-Orthogonal ISAC Systems: Performance Analysis and Beamforming Design

  • Authors: Zhouyuan Yu, Xiaoling Hu, Chenxi Liu, Mugen Peng, Caijun Zhong
  • Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2208.05324
  • Pdf link: https://arxiv.org/pdf/2208.05324
  • Abstract Intelligent reflecting surface (IRS) has shown its effectiveness in facilitating orthogonal time-division integrated sensing and communications (TD-ISAC), in which the sensing task and the communication task occupy orthogonal time-frequency resources, while the role of IRS in the more interesting scenarios of non-orthogonal ISAC (NO-ISAC) systems has so far remained unclear. In this paper, we consider an IRS-aided NO-ISAC system, where a distributed IRS is deployed to assist concurrent communication and location sensing for a blind-zone user, occupying non-orthogonal/overlapped time-frequency resources. We first propose a modified Cramer-Rao lower bound (CRLB) to characterize the performances of both communication and location sensing in a unified manner. We further derive the closed-form expressions of the modified CRLB in our considered NO-ISAC system, enabling us to identify the fundamental trade-off between the communication and location sensing performances. In addition, by exploiting the modified CRLB, we propose a joint active and passive beamforming design algorithm that achieves a good communication and location sensing trade-off. Through numerical results, we demonstrate the superiority of the IRS-aided NO-ISAC systems over the IRS-aided TD-ISAC systems, in terms of both communication and localization performances. Besides, it is shown that the IRS-aided NO-ISAC system with random communication signals can achieve comparable localization performance to the IRS-aided localization system with dedicated positioning reference signals. Moreover, we investigate the trade-off between communication performance and localization performance and show how the performance of the NO-ISAC system can be significantly boosted by increasing the number of the IRS elements.

MD-Net: Multi-Detector for Local Feature Extraction

  • Authors: Emanuele Santellani (1), Christian Sormann (1), Mattia Rossi (2), Andreas Kuhn (2), Friedrich Fraundorfer (1) ((1) Graz University of Technology, (2) Sony Europe B.V.)
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.05350
  • Pdf link: https://arxiv.org/pdf/2208.05350
  • Abstract Establishing a sparse set of keypoint correspon dences between images is a fundamental task in many computer vision pipelines. Often, this translates into a computationally expensive nearest neighbor search, where every keypoint descriptor at one image must be compared with all the descriptors at the others. In order to lower the computational cost of the matching phase, we propose a deep feature extraction network capable of detecting a predefined number of complementary sets of keypoints at each image. Since only the descriptors within the same set need to be compared across the different images, the matching phase computational complexity decreases with the number of sets. We train our network to predict the keypoints and compute the corresponding descriptors jointly. In particular, in order to learn complementary sets of keypoints, we introduce a novel unsupervised loss which penalizes intersections among the different sets. Additionally, we propose a novel descriptor-based weighting scheme meant to penalize the detection of keypoints with non-discriminative descriptors. With extensive experiments we show that our feature extraction network, trained only on synthetically warped images and in a fully unsupervised manner, achieves competitive results on 3D reconstruction and re-localization tasks at a reduced matching complexity.

Keyword: transformer

Attention Hijacking in Trojan Transformers

  • Authors: Weimin Lyu, Songzhu Zheng, Tengfei Ma, Haibin Ling, Chao Chen
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2208.04946
  • Pdf link: https://arxiv.org/pdf/2208.04946
  • Abstract Trojan attacks pose a severe threat to AI systems. Recent works on Transformer models received explosive popularity and the self-attentions are now indisputable. This raises a central question: Can we reveal the Trojans through attention mechanisms in BERTs and ViTs? In this paper, we investigate the attention hijacking pattern in Trojan AIs, \ie, the trigger token ``kidnaps'' the attention weights when a specific trigger is present. We observe the consistent attention hijacking pattern in Trojan Transformers from both Natural Language Processing (NLP) and Computer Vision (CV) domains. This intriguing property helps us to understand the Trojan mechanism in BERTs and ViTs. We also propose an Attention-Hijacking Trojan Detector (AHTD) to discriminate the Trojan AIs from the clean ones.

CoViT: Real-time phylogenetics for the SARS-CoV-2 pandemic using Vision Transformers

  • Authors: Zuher Jahshan, Leonid Yavits
  • Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
  • Arxiv link: https://arxiv.org/abs/2208.05004
  • Pdf link: https://arxiv.org/pdf/2208.05004
  • Abstract Real-time viral genome detection, taxonomic classification and phylogenetic analysis are critical for efficient tracking and control of viral pandemics such as Covid-19. However, the unprecedented and still growing amounts of viral genome data create a computational bottleneck, which effectively prevents the real-time pandemic tracking. We are attempting to alleviate this bottleneck by modifying and applying Vision Transformer, a recently developed neural network model for image recognition, to taxonomic classification and placement of viral genomes, such as SARS-CoV-2. Our solution, CoViT, places newly acquired samples onto the tree of SARS-CoV-2 lineages. One of the two potential placements returned by CoVit is the true one with the probability of 99.0%. The probability of the correct placement to be found among five potential placements generated by CoViT is 99.8%. The placement time is 1.45ms per individual genome running on NVIDIAs GeForce RTX 2080 Ti GPU. We make CoViT available to research community through GitHub: https://github.com/zuherJahshan/covit.

Collaborative Feature Maps of Networks and Hosts for AI-driven Intrusion Detection

  • Authors: Jinxin Liu, Murat Simsek, Burak Kantarci, Mehran Bagheri, Petar Djukic
  • Subjects: Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2208.05085
  • Pdf link: https://arxiv.org/pdf/2208.05085
  • Abstract Intrusion Detection Systems (IDS) are critical security mechanisms that protect against a wide variety of network threats and malicious behaviors on networks or hosts. As both Network-based IDS (NIDS) or Host-based IDS (HIDS) have been widely investigated, this paper aims to present a Combined Intrusion Detection System (CIDS) that integrates network and host data in order to improve IDS performance. Due to the scarcity of datasets that include both network packet and host data, we present a novel CIDS dataset formation framework that can handle log files from a variety of operating systems and align log entities with network flows. A new CIDS dataset named SCVIC-CIDS-2021 is derived from the meta-data from the well-known benchmark dataset, CIC-IDS-2018 by utilizing the proposed framework. Furthermore, a transformer-based deep learning model named CIDS-Net is proposed that can take network flow and host features as inputs and outperform baseline models that rely on network flow features only. Experimental results to evaluate the proposed CIDS-Net under the SCVIC-CIDS-2021 dataset support the hypothesis for the benefits of combining host and flow features as the proposed CIDS-Net can improve the macro F1 score of baseline solutions by 6.36% (up to 99.89%).

Ghost-free High Dynamic Range Imaging with Context-aware Transformer

  • Authors: Zhen Liu, Yinglong Wang, Bing Zeng, Shuaicheng Liu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.05114
  • Pdf link: https://arxiv.org/pdf/2208.05114
  • Abstract High dynamic range (HDR) deghosting algorithms aim to generate ghost-free HDR images with realistic details. Restricted by the locality of the receptive field, existing CNN-based methods are typically prone to producing ghosting artifacts and intensity distortions in the presence of large motion and severe saturation. In this paper, we propose a novel Context-Aware Vision Transformer (CA-ViT) for ghost-free high dynamic range imaging. The CA-ViT is designed as a dual-branch architecture, which can jointly capture both global and local dependencies. Specifically, the global branch employs a window-based Transformer encoder to model long-range object movements and intensity variations to solve ghosting. For the local branch, we design a local context extractor (LCE) to capture short-range image features and use the channel attention mechanism to select informative local details across the extracted features to complement the global branch. By incorporating the CA-ViT as basic components, we further build the HDR-Transformer, a hierarchical network to reconstruct high-quality ghost-free HDR images. Extensive experiments on three benchmark datasets show that our approach outperforms state-of-the-art methods qualitatively and quantitatively with considerably reduced computational budgets. Codes are available at https://github.com/megvii-research/HDR-Transformer

Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

  • Authors: Zhengang Li, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, Xue Lin, Zhenman Fang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2208.05163
  • Pdf link: https://arxiv.org/pdf/2208.05163
  • Abstract Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, this is the first FPGA-based ViT acceleration framework exploring model quantization. Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0.47% to 1.36% higher Top-1 accuracy under the same bit-width. Compared with the 32-bit floating-point baseline FPGA accelerator, our accelerator achieves around 5.6x improvement on the frame rate (i.e., 56.8 FPS vs. 10.0 FPS) with 0.71% accuracy drop on ImageNet dataset for DeiT-base.

Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer

  • Authors: Zhipeng Luo, Changqing Zhou, Liang Pan, Gongjie Zhang, Tianrui Liu, Yueru Luo, Haiyu Zhao, Ziwei Liu, Shijian Lu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.05216
  • Pdf link: https://arxiv.org/pdf/2208.05216
  • Abstract With the prevalence of LiDAR sensors in autonomous driving, 3D object tracking has received increasing attention. In a point cloud sequence, 3D object tracking aims to predict the location and orientation of an object in consecutive frames given an object template. Motivated by the success of transformers, we propose Point Tracking TRansformer (PTTR), which efficiently predicts high-quality 3D tracking results in a coarse-to-fine manner with the help of transformer operations. PTTR consists of three novel designs. 1) Instead of random sampling, we design Relation-Aware Sampling to preserve relevant points to the given template during subsampling. 2) We propose a Point Relation Transformer for effective feature aggregation and feature matching between the template and search region. 3) Based on the coarse tracking results, we employ a novel Prediction Refinement Module to obtain the final refined prediction through local feature pooling. In addition, motivated by the favorable properties of the Bird's-Eye View (BEV) of point clouds in capturing object motion, we further design a more advanced framework named PTTR++, which incorporates both the point-wise view and BEV representation to exploit their complementary effect in generating high-quality tracking results. PTTR++ substantially boosts the tracking performance on top of PTTR with low computational overhead. Extensive experiments over multiple datasets show that our proposed approaches achieve superior 3D tracking accuracy and efficiency.

Multi-scale Feature Aggregation for Crowd Counting

  • Authors: Xiaoheng Jiang, Xinyi Wu, Hisham Cholakkal, Rao Muhammad Anwer, Jiale Cao Mingliang Xu, Bing Zhou, Yanwei Pang, Fahad Shahbaz Khan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.05256
  • Pdf link: https://arxiv.org/pdf/2208.05256
  • Abstract Convolutional Neural Network (CNN) based crowd counting methods have achieved promising results in the past few years. However, the scale variation problem is still a huge challenge for accurate count estimation. In this paper, we propose a multi-scale feature aggregation network (MSFANet) that can alleviate this problem to some extent. Specifically, our approach consists of two feature aggregation modules: the short aggregation (ShortAgg) and the skip aggregation (SkipAgg). The ShortAgg module aggregates the features of the adjacent convolution blocks. Its purpose is to make features with different receptive fields fused gradually from the bottom to the top of the network. The SkipAgg module directly propagates features with small receptive fields to features with much larger receptive fields. Its purpose is to promote the fusion of features with small and large receptive fields. Especially, the SkipAgg module introduces the local self-attention features from the Swin Transformer blocks to incorporate rich spatial information. Furthermore, we present a local-and-global based counting loss by considering the non-uniform crowd distribution. Extensive experiments on four challenging datasets (ShanghaiTech dataset, UCF_CC_50 dataset, UCF-QNRF Dataset, WorldExpo'10 dataset) demonstrate the proposed easy-to-implement MSFANet can achieve promising results when compared with the previous state-of-the-art approaches.

Arbitrary Point Cloud Upsampling with Spherical Mixture of Gaussians

  • Authors: Anthony Dell'Eva, Marco Orsingher, Massimo Bertozzi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.05274
  • Pdf link: https://arxiv.org/pdf/2208.05274
  • Abstract Generating dense point clouds from sparse raw data benefits downstream 3D understanding tasks, but existing models are limited to a fixed upsampling ratio or to a short range of integer values. In this paper, we present APU-SMOG, a Transformer-based model for Arbitrary Point cloud Upsampling (APU). The sparse input is firstly mapped to a Spherical Mixture of Gaussians (SMOG) distribution, from which an arbitrary number of points can be sampled. Then, these samples are fed as queries to the Transformer decoder, which maps them back to the target surface. Extensive qualitative and quantitative evaluations show that APU-SMOG outperforms state-of-the-art fixed-ratio methods, while effectively enabling upsampling with any scaling factor, including non-integer values, with a single trained model. The code will be made available.

Multi-task Active Learning for Pre-trained Transformer-based Models

  • Authors: Guy Rotman, Roi Reichart
  • Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.05379
  • Pdf link: https://arxiv.org/pdf/2208.05379
  • Abstract Multi-task learning, in which several tasks are jointly learned by a single model, allows NLP models to share information from multiple annotations and may facilitate better predictions when the tasks are inter-related. This technique, however, requires annotating the same text with multiple annotation schemes which may be costly and laborious. Active learning (AL) has been demonstrated to optimize annotation processes by iteratively selecting unlabeled examples whose annotation is most valuable for the NLP model. Yet, multi-task active learning (MT-AL) has not been applied to state-of-the-art pre-trained Transformer-based NLP models. This paper aims to close this gap. We explore various multi-task selection criteria in three realistic multi-task scenarios, reflecting different relations between the participating tasks, and demonstrate the effectiveness of multi-task compared to single-task selection. Our results suggest that MT-AL can be effectively used in order to minimize annotation efforts for multi-task NLP models.

Keyword: autonomous driving

Robust Continual Test-time Adaptation: Instance-aware BN and Prediction-balanced Memory

  • Authors: Taesik Gong, Jongheon Jeong, Taewon Kim, Yewon Kim, Jinwoo Shin, Sung-Ju Lee
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.05117
  • Pdf link: https://arxiv.org/pdf/2208.05117
  • Abstract Test-time adaptation (TTA) is an emerging paradigm that addresses distributional shifts between training and testing phases without additional data acquisition or labeling cost; only unlabeled test data streams are used for continual model adaptation. Previous TTA schemes assume that the test samples are independent and identically distributed (i.i.d.), even though they are often temporally correlated (non-i.i.d.) in application scenarios, e.g., autonomous driving. We discover that most existing TTA methods fail dramatically under such scenarios. Motivated by this, we present a new test-time adaptation scheme that is robust against non-i.i.d. test data streams. Our novelty is mainly two-fold: (a) Instance-Aware Batch Normalization (IABN) that corrects normalization for out-of-distribution samples, and (b) Prediction-balanced Reservoir Sampling (PBRS) that simulates i.i.d. data stream from non-i.i.d. stream in a class-balanced manner. Our evaluation with various datasets, including real-world non-i.i.d. streams, demonstrates that the proposed robust TTA not only outperforms state-of-the-art TTA algorithms in the non-i.i.d. setting, but also achieves comparable performance to those algorithms under the i.i.d. assumption.

Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer

  • Authors: Zhipeng Luo, Changqing Zhou, Liang Pan, Gongjie Zhang, Tianrui Liu, Yueru Luo, Haiyu Zhao, Ziwei Liu, Shijian Lu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.05216
  • Pdf link: https://arxiv.org/pdf/2208.05216
  • Abstract With the prevalence of LiDAR sensors in autonomous driving, 3D object tracking has received increasing attention. In a point cloud sequence, 3D object tracking aims to predict the location and orientation of an object in consecutive frames given an object template. Motivated by the success of transformers, we propose Point Tracking TRansformer (PTTR), which efficiently predicts high-quality 3D tracking results in a coarse-to-fine manner with the help of transformer operations. PTTR consists of three novel designs. 1) Instead of random sampling, we design Relation-Aware Sampling to preserve relevant points to the given template during subsampling. 2) We propose a Point Relation Transformer for effective feature aggregation and feature matching between the template and search region. 3) Based on the coarse tracking results, we employ a novel Prediction Refinement Module to obtain the final refined prediction through local feature pooling. In addition, motivated by the favorable properties of the Bird's-Eye View (BEV) of point clouds in capturing object motion, we further design a more advanced framework named PTTR++, which incorporates both the point-wise view and BEV representation to exploit their complementary effect in generating high-quality tracking results. PTTR++ substantially boosts the tracking performance on top of PTTR with low computational overhead. Extensive experiments over multiple datasets show that our proposed approaches achieve superior 3D tracking accuracy and efficiency.

zhuhu00 avatar Aug 11 '22 03:08 zhuhu00