Paper-Daily-Notice icon indicating copy to clipboard operation
Paper-Daily-Notice copied to clipboard

New submissions for Mon, 11 Jul 22

Open zhuhu00 opened this issue 2 years ago • 0 comments

New submissions for Mon, 11 Jul 22

Keyword: SLAM

RWT-SLAM: Robust Visual SLAM for Highly Weak-textured Environments

  • Authors: Qihao Peng, Zhiyu Xiang, YuanGang Fan, Tengqi Zhao, Xijun Zhao
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03539
  • Pdf link: https://arxiv.org/pdf/2207.03539
  • Abstract As a fundamental task for intelligent robots, visual SLAM has made great progress over the past decades. However, robust SLAM under highly weak-textured environments still remains very challenging. In this paper, we propose a novel visual SLAM system named RWT-SLAM to tackle this problem. We modify LoFTR network which is able to produce dense point matching under low-textured scenes to generate feature descriptors. To integrate the new features into the popular ORB-SLAM framework, we develop feature masks to filter out the unreliable features and employ KNN strategy to strengthen the matching robustness. We also retrained visual vocabulary upon new descriptors for efficient loop closing. The resulting RWT-SLAM is tested in various public datasets such as TUM and OpenLORIS, as well as our own data. The results shows very promising performance under highly weak-textured environments.

Distributed Ranging SLAM for Multiple Robots with Ultra-WideBand and Odometry Measurements

  • Authors: Ran Liu, Zhongyuan Deng, Zhiqiang Cao, Muhammad Shalihan, Billy Pik Lik Lau, Kaixiang Chen, Kaushik Bhowmik, Chau Yuen, U-Xuan Tan
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2207.03700
  • Pdf link: https://arxiv.org/pdf/2207.03700
  • Abstract To accomplish task efficiently in a multiple robots system, a problem that has to be addressed is Simultaneous Localization and Mapping (SLAM). LiDAR (Light Detection and Ranging) has been used for many SLAM solutions due to its superb accuracy, but its performance degrades in featureless environments, like tunnels or long corridors. Centralized SLAM solves the problem with a cloud server, which requires a huge amount of computational resources and lacks robustness against central node failure. To address these issues, we present a distributed SLAM solution to estimate the trajectory of a group of robots using Ultra-WideBand (UWB) ranging and odometry measurements. The proposed approach distributes the processing among the robot team and significantly mitigates the computation concern emerged from the centralized SLAM. Our solution determines the relative pose (also known as loop closure) between two robots by minimizing the UWB ranging measurements taken at different positions when the robots are in close proximity. UWB provides a good distance measure in line-of-sight conditions, but retrieving a precise pose estimation remains a challenge, due to ranging noise and unpredictable path traveled by the robot. To deal with the suspicious loop closures, we use Pairwise Consistency Maximization (PCM) to examine the quality of loop closures and perform outlier rejections. The filtered loop closures are then fused with odometry in a distributed pose graph optimization (DPGO) module to recover the full trajectory of the robot team. Extensive experiments are conducted to validate the effectiveness of the proposed approach.

Continuous Target-free Extrinsic Calibration of a Multi-Sensor System from a Sequence of Static Viewpoints

  • Authors: Philipp Glira, Christoph Weidinger, Johann Weichselbaum
  • Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03785
  • Pdf link: https://arxiv.org/pdf/2207.03785
  • Abstract Mobile robotic applications need precise information about the geometric position of the individual sensors on the platform. This information is given by the extrinsic calibration parameters which define how the sensor is rotated and translated with respect to a fixed reference coordinate system. Erroneous calibration parameters have a negative impact on typical robotic estimation tasks, e.g. SLAM. In this work we propose a new method for a continuous estimation of the calibration parameters during operation of the robot. The parameter estimation is based on the matching of point clouds which are acquired by the sensors from multiple static viewpoints. Consequently, our method does not need any special calibration targets and is applicable to any sensor whose measurements can be converted to point clouds. We demonstrate the suitability of our method by calibrating a multi-sensor system composed by 2 lidar sensors, 3 cameras, and an imaging radar sensor.

BlindSpotNet: Seeing Where We Cannot See

  • Authors: Taichi Fukuda, Kotaro Hasegawa, Shinya Ishizaki, Shohei Nobuhara, Ko Nishino
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03870
  • Pdf link: https://arxiv.org/pdf/2207.03870
  • Abstract We introduce 2D blind spot estimation as a critical visual task for road scene understanding. By automatically detecting road regions that are occluded from the vehicle's vantage point, we can proactively alert a manual driver or a self-driving system to potential causes of accidents (e.g., draw attention to a road region from which a child may spring out). Detecting blind spots in full 3D would be challenging, as 3D reasoning on the fly even if the car is equipped with LiDAR would be prohibitively expensive and error prone. We instead propose to learn to estimate blind spots in 2D, just from a monocular camera. We achieve this in two steps. We first introduce an automatic method for generating ``ground-truth'' blind spot training data for arbitrary driving videos by leveraging monocular depth estimation, semantic segmentation, and SLAM. The key idea is to reason in 3D but from 2D images by defining blind spots as those road regions that are currently invisible but become visible in the near future. We construct a large-scale dataset with this automatic offline blind spot estimation, which we refer to as Road Blind Spot (RBS) dataset. Next, we introduce BlindSpotNet (BSN), a simple network that fully leverages this dataset for fully automatic estimation of frame-wise blind spot probability maps for arbitrary driving videos. Extensive experimental results demonstrate the validity of our RBS Dataset and the effectiveness of our BSN.

Keyword: odometry

Distributed Ranging SLAM for Multiple Robots with Ultra-WideBand and Odometry Measurements

  • Authors: Ran Liu, Zhongyuan Deng, Zhiqiang Cao, Muhammad Shalihan, Billy Pik Lik Lau, Kaixiang Chen, Kaushik Bhowmik, Chau Yuen, U-Xuan Tan
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2207.03700
  • Pdf link: https://arxiv.org/pdf/2207.03700
  • Abstract To accomplish task efficiently in a multiple robots system, a problem that has to be addressed is Simultaneous Localization and Mapping (SLAM). LiDAR (Light Detection and Ranging) has been used for many SLAM solutions due to its superb accuracy, but its performance degrades in featureless environments, like tunnels or long corridors. Centralized SLAM solves the problem with a cloud server, which requires a huge amount of computational resources and lacks robustness against central node failure. To address these issues, we present a distributed SLAM solution to estimate the trajectory of a group of robots using Ultra-WideBand (UWB) ranging and odometry measurements. The proposed approach distributes the processing among the robot team and significantly mitigates the computation concern emerged from the centralized SLAM. Our solution determines the relative pose (also known as loop closure) between two robots by minimizing the UWB ranging measurements taken at different positions when the robots are in close proximity. UWB provides a good distance measure in line-of-sight conditions, but retrieving a precise pose estimation remains a challenge, due to ranging noise and unpredictable path traveled by the robot. To deal with the suspicious loop closures, we use Pairwise Consistency Maximization (PCM) to examine the quality of loop closures and perform outlier rejections. The filtered loop closures are then fused with odometry in a distributed pose graph optimization (DPGO) module to recover the full trajectory of the robot team. Extensive experiments are conducted to validate the effectiveness of the proposed approach.

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Distributed Ranging SLAM for Multiple Robots with Ultra-WideBand and Odometry Measurements

  • Authors: Ran Liu, Zhongyuan Deng, Zhiqiang Cao, Muhammad Shalihan, Billy Pik Lik Lau, Kaixiang Chen, Kaushik Bhowmik, Chau Yuen, U-Xuan Tan
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2207.03700
  • Pdf link: https://arxiv.org/pdf/2207.03700
  • Abstract To accomplish task efficiently in a multiple robots system, a problem that has to be addressed is Simultaneous Localization and Mapping (SLAM). LiDAR (Light Detection and Ranging) has been used for many SLAM solutions due to its superb accuracy, but its performance degrades in featureless environments, like tunnels or long corridors. Centralized SLAM solves the problem with a cloud server, which requires a huge amount of computational resources and lacks robustness against central node failure. To address these issues, we present a distributed SLAM solution to estimate the trajectory of a group of robots using Ultra-WideBand (UWB) ranging and odometry measurements. The proposed approach distributes the processing among the robot team and significantly mitigates the computation concern emerged from the centralized SLAM. Our solution determines the relative pose (also known as loop closure) between two robots by minimizing the UWB ranging measurements taken at different positions when the robots are in close proximity. UWB provides a good distance measure in line-of-sight conditions, but retrieving a precise pose estimation remains a challenge, due to ranging noise and unpredictable path traveled by the robot. To deal with the suspicious loop closures, we use Pairwise Consistency Maximization (PCM) to examine the quality of loop closures and perform outlier rejections. The filtered loop closures are then fused with odometry in a distributed pose graph optimization (DPGO) module to recover the full trajectory of the robot team. Extensive experiments are conducted to validate the effectiveness of the proposed approach.

SST-Calib: Simultaneous Spatial-Temporal Parameter Calibration between LIDAR and Camera

  • Authors: Akio Kodaira, Yiyang Zhou, Pengwei Zang, Wei Zhan, Masayoshi Tomizuka
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2207.03704
  • Pdf link: https://arxiv.org/pdf/2207.03704
  • Abstract With information from multiple input modalities, sensor fusion-based algorithms usually out-perform their single-modality counterparts in robotics. Camera and LIDAR, with complementary semantic and depth information, are the typical choices for detection tasks in complicated driving environments. For most camera-LIDAR fusion algorithms, however, the calibration of the sensor suite will greatly impact the performance. More specifically, the detection algorithm usually requires an accurate geometric relationship among multiple sensors as the input, and it is often assumed that the contents from these sensors are captured at the same time. Preparing such sensor suites involves carefully designed calibration rigs and accurate synchronization mechanisms, and the preparation process is usually done offline. In this work, a segmentation-based framework is proposed to jointly estimate the geometrical and temporal parameters in the calibration of a camera-LIDAR suite. A semantic segmentation mask is first applied to both sensor modalities, and the calibration parameters are optimized through pixel-wise bidirectional loss. We specifically incorporated the velocity information from optical flow for temporal parameters. Since supervision is only performed at the segmentation level, no calibration label is needed within the framework. The proposed algorithm is tested on the KITTI dataset, and the result shows an accurate real-time calibration of both geometric and temporal parameters.

Continuous Target-free Extrinsic Calibration of a Multi-Sensor System from a Sequence of Static Viewpoints

  • Authors: Philipp Glira, Christoph Weidinger, Johann Weichselbaum
  • Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03785
  • Pdf link: https://arxiv.org/pdf/2207.03785
  • Abstract Mobile robotic applications need precise information about the geometric position of the individual sensors on the platform. This information is given by the extrinsic calibration parameters which define how the sensor is rotated and translated with respect to a fixed reference coordinate system. Erroneous calibration parameters have a negative impact on typical robotic estimation tasks, e.g. SLAM. In this work we propose a new method for a continuous estimation of the calibration parameters during operation of the robot. The parameter estimation is based on the matching of point clouds which are acquired by the sensors from multiple static viewpoints. Consequently, our method does not need any special calibration targets and is applicable to any sensor whose measurements can be converted to point clouds. We demonstrate the suitability of our method by calibrating a multi-sensor system composed by 2 lidar sensors, 3 cameras, and an imaging radar sensor.

Decision Trees for Analyzing Influences on the Accuracy of Indoor Localization Systems

  • Authors: Jakob Schyga, Swantje Plambeck, Johannes Hinckeldeyn, Görschwin Fey, Jochen Kreutzfeldt
  • Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2207.03853
  • Pdf link: https://arxiv.org/pdf/2207.03853
  • Abstract Absolute position accuracy is the key performance criterion of an Indoor Localization System (ILS). Since ILS are heterogeneous and complex cyber-physical systems, the localization accuracy depends on various influences from the environment, system configuration, and the application processes. To determine the position accuracy of a system in a reproducible, comparable, and realistic manner, these factors must be taken into account. We propose a strategy for analyzing the influences on the position accuracy of ILS using decision trees in combination with application-related or technology-related categorization. The proposed strategy is validated using empirical data from 120 experiments. The accuracy of an Ultra-Wideband and a LiDAR-based ILS was determined under different application-driven influencing factors, considering the application of autonomous mobile robots in warehouses. Finally, the opportunities and limitations of analyzing decision trees to compare system performance, find a suitable system, optimize the environment or system configuration, and understand the relevance of different influencing factors are presented.

BlindSpotNet: Seeing Where We Cannot See

  • Authors: Taichi Fukuda, Kotaro Hasegawa, Shinya Ishizaki, Shohei Nobuhara, Ko Nishino
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03870
  • Pdf link: https://arxiv.org/pdf/2207.03870
  • Abstract We introduce 2D blind spot estimation as a critical visual task for road scene understanding. By automatically detecting road regions that are occluded from the vehicle's vantage point, we can proactively alert a manual driver or a self-driving system to potential causes of accidents (e.g., draw attention to a road region from which a child may spring out). Detecting blind spots in full 3D would be challenging, as 3D reasoning on the fly even if the car is equipped with LiDAR would be prohibitively expensive and error prone. We instead propose to learn to estimate blind spots in 2D, just from a monocular camera. We achieve this in two steps. We first introduce an automatic method for generating ``ground-truth'' blind spot training data for arbitrary driving videos by leveraging monocular depth estimation, semantic segmentation, and SLAM. The key idea is to reason in 3D but from 2D images by defining blind spots as those road regions that are currently invisible but become visible in the near future. We construct a large-scale dataset with this automatic offline blind spot estimation, which we refer to as Road Blind Spot (RBS) dataset. Next, we introduce BlindSpotNet (BSN), a simple network that fully leverages this dataset for fully automatic estimation of frame-wise blind spot probability maps for arbitrary driving videos. Extensive experimental results demonstrate the validity of our RBS Dataset and the effectiveness of our BSN.

Keyword: loop detection

There is no result

Keyword: nerf

There is no result

Keyword: mapping

On Non-Linear operators for Geometric Deep Learning

  • Authors: Grégoire Sergeant-Perthuis (LML), Jakob Maier, Joan Bruna (CIMS), Edouard Oyallon (ISIR)
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
  • Arxiv link: https://arxiv.org/abs/2207.03485
  • Pdf link: https://arxiv.org/pdf/2207.03485
  • Abstract This work studies operators mapping vector and scalar fields defined over a manifold $\mathcal{M}$, and which commute with its group of diffeomorphisms $\text{Diff}(\mathcal{M})$. We prove that in the case of scalar fields $L^p_\omega(\mathcal{M,\mathbb{R}})$, those operators correspond to point-wise non-linearities, recovering and extending known results on $\mathbb{R}^d$. In the context of Neural Networks defined over $\mathcal{M}$, it indicates that point-wise non-linear operators are the only universal family that commutes with any group of symmetries, and justifies their systematic use in combination with dedicated linear operators commuting with specific symmetries. In the case of vector fields $L^p_\omega(\mathcal{M},T\mathcal{M})$, we show that those operators are solely the scalar multiplication. It indicates that $\text{Diff}(\mathcal{M})$ is too rich and that there is no universal class of non-linear operators to motivate the design of Neural Networks over the symmetries of $\mathcal{M}$.

Deep Learning to Jointly Schema Match, Impute, and Transform Databases

  • Authors: Sandhya Tripathi, Bradley A. Fritz, Mohamed Abdelhack, Michael S. Avidan, Yixin Chen, Christopher R. King
  • Subjects: Databases (cs.DB); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2207.03536
  • Pdf link: https://arxiv.org/pdf/2207.03536
  • Abstract An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, where unit changes and variable shift make simple matching of univariate summaries unsuccessful. We develop two novel procedures to address this problem. First, we demonstrate multiple methods of "fingerprinting" a feature based on its associations to other features. In the setting of even modest prior information, this allows most shared features to be accurately identified. Second, we demonstrate a deep learning algorithm for translation between databases. Unlike prior approaches, our algorithm takes advantage of discovered mappings while identifying surrogates for unshared features and learning transformations. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets, while jointly learning to impute unshared or transformed variables.

Hyper-Universal Policy Approximation: Learning to Generate Actions from a Single Image using Hypernets

  • Authors: Dimitrios C. Gklezakos, Rishi Jha, Rajesh P. N. Rao
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2207.03593
  • Pdf link: https://arxiv.org/pdf/2207.03593
  • Abstract Inspired by Gibson's notion of object affordances in human vision, we ask the question: how can an agent learn to predict an entire action policy for a novel object or environment given only a single glimpse? To tackle this problem, we introduce the concept of Universal Policy Functions (UPFs) which are state-to-action mappings that generalize not only to new goals but most importantly to novel, unseen environments. Specifically, we consider the problem of efficiently learning such policies for agents with limited computational and communication capacity, constraints that are frequently encountered in edge devices. We propose the Hyper-Universal Policy Approximator (HUPA), a hypernetwork-based model to generate small task- and environment-conditional policy networks from a single image, with good generalization properties. Our results show that HUPAs significantly outperform an embedding-based alternative for generated policies that are size-constrained. Although this work is restricted to a simple map-based navigation task, future work includes applying the principles behind HUPAs to learning more general affordances for objects and environments.

Abs-CAM: A Gradient Optimization Interpretable Approach for Explanation of Convolutional Neural Networks

  • Authors: Chunyan Zeng, Kang Yan, Zhifeng Wang, Yan Yu, Shiyan Xia, Nan Zhao
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2207.03648
  • Pdf link: https://arxiv.org/pdf/2207.03648
  • Abstract The black-box nature of Deep Neural Networks (DNNs) severely hinders its performance improvement and application in specific scenes. In recent years, class activation mapping-based method has been widely used to interpret the internal decisions of models in computer vision tasks. However, when this method uses backpropagation to obtain gradients, it will cause noise in the saliency map, and even locate features that are irrelevant to decisions. In this paper, we propose an Absolute value Class Activation Mapping-based (Abs-CAM) method, which optimizes the gradients derived from the backpropagation and turns all of them into positive gradients to enhance the visual features of output neurons' activation, and improve the localization ability of the saliency map. The framework of Abs-CAM is divided into two phases: generating initial saliency map and generating final saliency map. The first phase improves the localization ability of the saliency map by optimizing the gradient, and the second phase linearly combines the initial saliency map with the original image to enhance the semantic information of the saliency map. We conduct qualitative and quantitative evaluation of the proposed method, including Deletion, Insertion, and Pointing Game. The experimental results show that the Abs-CAM can obviously eliminate the noise in the saliency map, and can better locate the features related to decisions, and is superior to the previous methods in recognition and localization tasks.

Stability of Aggregation Graph Neural Networks

  • Authors: Alejandro Parada-Mayorga, Zhiyang Wang, Fernando Gama, Alejandro Ribeiro
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2207.03678
  • Pdf link: https://arxiv.org/pdf/2207.03678
  • Abstract In this paper we study the stability properties of aggregation graph neural networks (Agg-GNNs) considering perturbations of the underlying graph. An Agg-GNN is a hybrid architecture where information is defined on the nodes of a graph, but it is processed block-wise by Euclidean CNNs on the nodes after several diffusions on the graph shift operator. We derive stability bounds for the mapping operator associated to a generic Agg-GNN, and we specify conditions under which such operators can be stable to deformations. We prove that the stability bounds are defined by the properties of the filters in the first layer of the CNN that acts on each node. Additionally, we show that there is a close relationship between the number of aggregations, the filter's selectivity, and the size of the stability constants. We also conclude that in Agg-GNNs the selectivity of the mapping operators is tied to the properties of the filters only in the first layer of the CNN stage. This shows a substantial difference with respect to the stability properties of selection GNNs, where the selectivity of the filters in all layers is constrained by their stability. We provide numerical evidence corroborating the results derived, testing the behavior of Agg-GNNs in real life application scenarios considering perturbations of different magnitude.

Distributed Ranging SLAM for Multiple Robots with Ultra-WideBand and Odometry Measurements

  • Authors: Ran Liu, Zhongyuan Deng, Zhiqiang Cao, Muhammad Shalihan, Billy Pik Lik Lau, Kaixiang Chen, Kaushik Bhowmik, Chau Yuen, U-Xuan Tan
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2207.03700
  • Pdf link: https://arxiv.org/pdf/2207.03700
  • Abstract To accomplish task efficiently in a multiple robots system, a problem that has to be addressed is Simultaneous Localization and Mapping (SLAM). LiDAR (Light Detection and Ranging) has been used for many SLAM solutions due to its superb accuracy, but its performance degrades in featureless environments, like tunnels or long corridors. Centralized SLAM solves the problem with a cloud server, which requires a huge amount of computational resources and lacks robustness against central node failure. To address these issues, we present a distributed SLAM solution to estimate the trajectory of a group of robots using Ultra-WideBand (UWB) ranging and odometry measurements. The proposed approach distributes the processing among the robot team and significantly mitigates the computation concern emerged from the centralized SLAM. Our solution determines the relative pose (also known as loop closure) between two robots by minimizing the UWB ranging measurements taken at different positions when the robots are in close proximity. UWB provides a good distance measure in line-of-sight conditions, but retrieving a precise pose estimation remains a challenge, due to ranging noise and unpredictable path traveled by the robot. To deal with the suspicious loop closures, we use Pairwise Consistency Maximization (PCM) to examine the quality of loop closures and perform outlier rejections. The filtered loop closures are then fused with odometry in a distributed pose graph optimization (DPGO) module to recover the full trajectory of the robot team. Extensive experiments are conducted to validate the effectiveness of the proposed approach.

A Deep Learning-Based Framework for Low Complexity Multi-User MIMO Precoding Design

  • Authors: Maojun Zhang, Jiabao Gao, Caijun Zhong
  • Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2207.03765
  • Pdf link: https://arxiv.org/pdf/2207.03765
  • Abstract Using precoding to suppress multi-user interference is a well-known technique to improve spectra efficiency in multiuser multiple-input multiple-output (MU-MIMO) systems, and the pursuit of high performance and low complexity precoding method has been the focus in the last decade. The traditional algorithms including the zero-forcing (ZF) algorithm and the weighted minimum mean square error (WMMSE) algorithm failed to achieve a satisfactory trade-off between complexity and performance. In this paper, leveraging on the power of deep learning, we propose a low-complexity precoding design framework for MU-MIMO systems. The key idea is to transform the MIMO precoding problem into the multiple-input single-output precoding problem, where the optimal precoding structure can be obtained in closed-form. A customized deep neural network is designed to fit the mapping from the channels to the precoding matrix. In addition, the technique of input dimensionality reduction, network pruning, and recovery module compression are used to further improve the computational efficiency. Furthermore, the extension to the practical MIMO orthogonal frequency-division multiplexing (MIMO-OFDM) system is studied. Simulation results show that the proposed low-complexity precoding scheme achieves similar performance as the WMMSE algorithm with very low computational complexity.

Keyword: localization

Abs-CAM: A Gradient Optimization Interpretable Approach for Explanation of Convolutional Neural Networks

  • Authors: Chunyan Zeng, Kang Yan, Zhifeng Wang, Yan Yu, Shiyan Xia, Nan Zhao
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2207.03648
  • Pdf link: https://arxiv.org/pdf/2207.03648
  • Abstract The black-box nature of Deep Neural Networks (DNNs) severely hinders its performance improvement and application in specific scenes. In recent years, class activation mapping-based method has been widely used to interpret the internal decisions of models in computer vision tasks. However, when this method uses backpropagation to obtain gradients, it will cause noise in the saliency map, and even locate features that are irrelevant to decisions. In this paper, we propose an Absolute value Class Activation Mapping-based (Abs-CAM) method, which optimizes the gradients derived from the backpropagation and turns all of them into positive gradients to enhance the visual features of output neurons' activation, and improve the localization ability of the saliency map. The framework of Abs-CAM is divided into two phases: generating initial saliency map and generating final saliency map. The first phase improves the localization ability of the saliency map by optimizing the gradient, and the second phase linearly combines the initial saliency map with the original image to enhance the semantic information of the saliency map. We conduct qualitative and quantitative evaluation of the proposed method, including Deletion, Insertion, and Pointing Game. The experimental results show that the Abs-CAM can obviously eliminate the noise in the saliency map, and can better locate the features related to decisions, and is superior to the previous methods in recognition and localization tasks.

Learning High-quality Proposals for Acne Detection

  • Authors: Jianwei Zhang, Lei Zhang, Junyou Wang, Xin Wei, Jiaqi Li, Xian Jiang, Dan Du
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03674
  • Pdf link: https://arxiv.org/pdf/2207.03674
  • Abstract Acne detection is crucial for interpretative diagnosis and precise treatment of skin disease. The arbitrary boundary and small size of acne lesions lead to a significant number of poor-quality proposals in two-stage detection. In this paper, we propose a novel head structure for Region Proposal Network to improve the proposals' quality in two ways. At first, a Spatial Aware Double Head(SADH) structure is proposed to disentangle the representation learning for classification and localization from two different spatial perspectives. The proposed SADH ensures a steeper classification confidence gradient and suppresses the proposals having low intersection-over-union(IoU) with the matched ground truth. Then, we propose a Normalized Wasserstein Distance prediction branch to improve the correlation between the proposals' classification scores and IoUs. In addition, to facilitate further research on acne detection, we construct a new dataset named AcneSCU, with high-resolution imageries, precise annotations, and fine-grained lesion categories. Extensive experiments are conducted on both AcneSCU and the public dataset ACNE04, and the results demonstrate the proposed method could improve the proposals' quality, consistently outperforming state-of-the-art approaches. Code and the collected dataset are available in https://github.com/pingguokiller/acnedetection.

Distributed Ranging SLAM for Multiple Robots with Ultra-WideBand and Odometry Measurements

  • Authors: Ran Liu, Zhongyuan Deng, Zhiqiang Cao, Muhammad Shalihan, Billy Pik Lik Lau, Kaixiang Chen, Kaushik Bhowmik, Chau Yuen, U-Xuan Tan
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2207.03700
  • Pdf link: https://arxiv.org/pdf/2207.03700
  • Abstract To accomplish task efficiently in a multiple robots system, a problem that has to be addressed is Simultaneous Localization and Mapping (SLAM). LiDAR (Light Detection and Ranging) has been used for many SLAM solutions due to its superb accuracy, but its performance degrades in featureless environments, like tunnels or long corridors. Centralized SLAM solves the problem with a cloud server, which requires a huge amount of computational resources and lacks robustness against central node failure. To address these issues, we present a distributed SLAM solution to estimate the trajectory of a group of robots using Ultra-WideBand (UWB) ranging and odometry measurements. The proposed approach distributes the processing among the robot team and significantly mitigates the computation concern emerged from the centralized SLAM. Our solution determines the relative pose (also known as loop closure) between two robots by minimizing the UWB ranging measurements taken at different positions when the robots are in close proximity. UWB provides a good distance measure in line-of-sight conditions, but retrieving a precise pose estimation remains a challenge, due to ranging noise and unpredictable path traveled by the robot. To deal with the suspicious loop closures, we use Pairwise Consistency Maximization (PCM) to examine the quality of loop closures and perform outlier rejections. The filtered loop closures are then fused with odometry in a distributed pose graph optimization (DPGO) module to recover the full trajectory of the robot team. Extensive experiments are conducted to validate the effectiveness of the proposed approach.

Decision Trees for Analyzing Influences on the Accuracy of Indoor Localization Systems

  • Authors: Jakob Schyga, Swantje Plambeck, Johannes Hinckeldeyn, Görschwin Fey, Jochen Kreutzfeldt
  • Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2207.03853
  • Pdf link: https://arxiv.org/pdf/2207.03853
  • Abstract Absolute position accuracy is the key performance criterion of an Indoor Localization System (ILS). Since ILS are heterogeneous and complex cyber-physical systems, the localization accuracy depends on various influences from the environment, system configuration, and the application processes. To determine the position accuracy of a system in a reproducible, comparable, and realistic manner, these factors must be taken into account. We propose a strategy for analyzing the influences on the position accuracy of ILS using decision trees in combination with application-related or technology-related categorization. The proposed strategy is validated using empirical data from 120 experiments. The accuracy of an Ultra-Wideband and a LiDAR-based ILS was determined under different application-driven influencing factors, considering the application of autonomous mobile robots in warehouses. Finally, the opportunities and limitations of analyzing decision trees to compare system performance, find a suitable system, optimize the environment or system configuration, and understand the relevance of different influencing factors are presented.

BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization

  • Authors: Sheng Kuang, Kiki van der Heijden, Siamak Mehrkanoon
  • Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2207.03927
  • Pdf link: https://arxiv.org/pdf/2207.03927
  • Abstract Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows barriers in capturing the global acoustic features. To address this issue, we propose a novel end-to-end Binaural Audio Spectrogram Transformer (BAST) model to predict the sound azimuth in both anechoic and reverberation environments. Two modes of implementation, i.e. BAST-SP and BAST-NSP corresponding to BAST model with shared and non-shared parameters respectively, are explored. Our model with subtraction interaural integration and hybrid loss achieves an angular distance of 1.29 degrees and a Mean Square Error of 1e-3 at all azimuths, significantly surpassing CNN based model. The exploratory analysis of the BAST's performance on the left-right hemifields and anechoic and reverberation environments shows its generalization ability as well as the feasibility of binaural Transformers in sound localization. Furthermore, the analysis of the attention maps is provided to give additional insights on the interpretation of the localization process in a natural reverberant environment.

Keyword: transformer

Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection

  • Authors: Xiurong Jiang, Lin Zhu, Yifan Hou, Hui Tian
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03558
  • Pdf link: https://arxiv.org/pdf/2207.03558
  • Abstract RGB-thermal salient object detection (RGB-T SOD) aims to locate the common prominent objects of an aligned visible and thermal infrared image pair and accurately segment all the pixels belonging to those objects. It is promising in challenging scenes such as nighttime and complex backgrounds due to the insensitivity to lighting conditions of thermal images. Thus, the key problem of RGB-T SOD is to make the features from the two modalities complement and adjust each other flexibly, since it is inevitable that any modalities of RGB-T image pairs failure due to challenging scenes such as extreme light conditions and thermal crossover. In this paper, we propose a novel mirror complementary Transformer network (MCNet) for RGB-T SOD. Specifically, we introduce a Transformer-based feature extraction module to effective extract hierarchical features of RGB and thermal images. Then, through the attention-based feature interaction and serial multiscale dilated convolution (SDC) based feature fusion modules, the proposed model achieves the complementary interaction of low-level features and the semantic fusion of deep features. Finally, based on the mirror complementary structure, the salient regions of the two modalities can be accurately extracted even one modality is invalid. To demonstrate the robustness of the proposed model under challenging scenes in real world, we build a novel RGB-T SOD dataset VT723 based on a large public semantic segmentation RGB-T dataset used in the autonomous driving domain. Expensive experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches, including CNN-based and Transformer-based methods. The code and dataset will be released later at https://github.com/jxr326/SwinMCNet.

More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity

  • Authors: Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao Xiao, Boqian Wu, Mykola Pechenizkiy, Decebal Mocanu, Zhangyang Wang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03620
  • Pdf link: https://arxiv.org/pdf/2207.03620
  • Abstract Transformers have quickly shined in the computer vision world since the emergence of Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) seems to be challenged by increasingly effective transformer-based models. Very recently, a couple of advanced convolutional models strike back with large kernels motivated by the local but large attention mechanism, showing appealing performance and efficiency. While one of them, i.e. RepLKNet, impressively manages to scale the kernel size to 31x31 with improved performance, the performance starts to saturate as the kernel size continues growing, compared to the scaling trend of advanced ViTs such as Swin Transformer. In this paper, we explore the possibility of training extreme convolutions larger than 31x31 and test whether the performance gap can be eliminated by strategically enlarging convolutions. This study ends up with a recipe for applying extremely large kernels from the perspective of sparsity, which can smoothly scale up kernels to 61x61 with better performance. Built on this recipe, we propose Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with 51x51 kernels that can perform on par with or better than state-of-the-art hierarchical Transformers and modern ConvNet architectures like ConvNeXt and RepLKNet, on ImageNet classification as well as typical downstream tasks. Our code is available here https://github.com/VITA-Group/SLaK.

Music-driven Dance Regeneration with Controllable Key Pose Constraints

  • Authors: Junfu Pu, Ying Shan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2207.03682
  • Pdf link: https://arxiv.org/pdf/2207.03682
  • Abstract In this paper, we propose a novel framework for music-driven dance motion synthesis with controllable key pose constraint. In contrast to methods that generate dance motion sequences only based on music without any other controllable conditions, this work targets on synthesizing high-quality dance motion driven by music as well as customized poses performed by users. Our model involves two single-modal transformer encoders for music and motion representations and a cross-modal transformer decoder for dance motions generation. The cross-modal transformer decoder achieves the capability of synthesizing smooth dance motion sequences, which keeps a consistency with key poses at corresponding positions, by introducing the local neighbor position embedding. Such mechanism makes the decoder more sensitive to key poses and the corresponding positions. Our dance synthesis model achieves satisfactory performance both on quantitative and qualitative evaluations with extensive experiments, which demonstrates the effectiveness of our proposed method.

VidConv: A modernized 2D ConvNet for Efficient Video Recognition

  • Authors: Chuong H. Nguyen, Su Huynh, Vinh Nguyen, Ngoc Nguyen
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03782
  • Pdf link: https://arxiv.org/pdf/2207.03782
  • Abstract Since being introduced in 2020, Vision Transformers (ViT) has been steadily breaking the record for many vision tasks and are often described as ``all-you-need" to replace ConvNet. Despite that, ViTs are generally computational, memory-consuming, and unfriendly for embedded devices. In addition, recent research shows that standard ConvNet if redesigned and trained appropriately can compete favorably with ViT in terms of accuracy and scalability. In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition. Particularly, our main target is to serve for industrial product deployment, such as FPGA boards in which only standard operations are supported. Therefore, our network simply consists of 2D convolutions, without using any 3D convolution, long-range attention plugin, or Transformer blocks. While being trained with much fewer epochs (5x-10x), our backbone surpasses the methods using (2+1)D and 3D convolution, and achieve comparable results with ViT on two benchmark datasets.

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

  • Authors: Yongqi Wang, Zhou Zhao
  • Subjects: Sound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2207.03800
  • Pdf link: https://arxiv.org/pdf/2207.03800
  • Abstract Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves $19.76\times$ speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality.

Boosting Zero-shot Learning via Contrastive Optimization of Attribute Representations

  • Authors: Yu Du, Miaojing Shi, Fangyun Wei, Guoqi Li
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03824
  • Pdf link: https://arxiv.org/pdf/2207.03824
  • Abstract Zero-shot learning (ZSL) aims to recognize classes that do not have samples in the training set. One representative solution is to directly learn an embedding function associating visual features with corresponding class semantics for recognizing new classes. Many methods extend upon this solution, and recent ones are especially keen on extracting rich features from images, e.g. attribute features. These attribute features are normally extracted within each individual image; however, the common traits for features across images yet belonging to the same attribute are not emphasized. In this paper, we propose a new framework to boost ZSL by explicitly learning attribute prototypes beyond images and contrastively optimizing them with attribute-level features within images. Besides the novel architecture, two elements are highlighted for attribute representations: a new prototype generation module is designed to generate attribute prototypes from attribute semantics; a hard example-based contrastive optimization scheme is introduced to reinforce attribute-level features in the embedding space. We explore two alternative backbones, CNN-based and transformer-based, to build our framework and conduct experiments on three standard benchmarks, CUB, SUN, AwA2. Results on these benchmarks demonstrate that our method improves the state of the art by a considerable margin. Our codes will be available at https://github.com/dyabel/CoAR-ZSL.git

Consecutive Pretraining: A Knowledge Transfer Learning Strategy with Relevant Unlabeled Data for Remote Sensing Domain

  • Authors: Tong Zhang, Peng Gao, Hao Dong, Yin Zhuang, Guanqun Wang, Wei Zhang, He Chen
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03860
  • Pdf link: https://arxiv.org/pdf/2207.03860
  • Abstract Currently, under supervised learning, a model pretrained by a large-scale nature scene dataset and then fine-tuned on a few specific task labeling data is the paradigm that has dominated the knowledge transfer learning. It has reached the status of consensus solution for task-aware model training in remote sensing domain (RSD). Unfortunately, due to different categories of imaging data and stiff challenges of data annotation, there is not a large enough and uniform remote sensing dataset to support large-scale pretraining in RSD. Moreover, pretraining models on large-scale nature scene datasets by supervised learning and then directly fine-tuning on diverse downstream tasks seems to be a crude method, which is easily affected by inevitable labeling noise, severe domain gaps and task-aware discrepancies. Thus, in this paper, considering the self-supervised pretraining and powerful vision transformer (ViT) architecture, a concise and effective knowledge transfer learning strategy called ConSecutive PreTraining (CSPT) is proposed based on the idea of not stopping pretraining in natural language processing (NLP), which can gradually bridge the domain gap and transfer knowledge from the nature scene domain to the RSD. The proposed CSPT also can release the huge potential of unlabeled data for task-aware model training. Finally, extensive experiments are carried out on twelve datasets in RSD involving three types of downstream tasks (e.g., scene classification, object detection and land cover classification) and two types of imaging data (e.g., optical and SAR). The results show that by utilizing the proposed CSPT for task-aware model training, almost all downstream tasks in RSD can outperform the previous method of supervised pretraining-then-fine-tuning and even surpass the state-of-the-art (SOTA) performance without any expensive labeling consumption and careful model design.

Learning Sequential Descriptors for Sequence-based Visual Place Recognition

  • Authors: Riccardo Mereu, Gabriele Trivigno, Gabriele Berton, Carlo Masone, Barbara Caputo
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03868
  • Pdf link: https://arxiv.org/pdf/2207.03868
  • Abstract In robotics, Visual Place Recognition is a continuous process that receives as input a video stream to produce a hypothesis of the robot's current position within a map of known places. This task requires robust, scalable, and efficient techniques for real applications. This work proposes a detailed taxonomy of techniques using sequential descriptors, highlighting different mechanism to fuse the information from the individual images. This categorization is supported by a complete benchmark of experimental results that provides evidence on the strengths and weaknesses of these different architectural choices. In comparison to existing sequential descriptors methods, we further investigate the viability of Transformers instead of CNN backbones, and we propose a new ad-hoc sequence-level aggregator called SeqVLAD, which outperforms prior state of the art on different datasets. The code is available at https://github.com/vandal-vpr/vg-transformers.

RePFormer: Refinement Pyramid Transformer for Robust Facial Landmark Detection

  • Authors: Jinpeng Li, Haibo Jin, Shengcai Liao, Ling Shao, Pheng-Ann Heng
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03917
  • Pdf link: https://arxiv.org/pdf/2207.03917
  • Abstract This paper presents a Refinement Pyramid Transformer (RePFormer) for robust facial landmark detection. Most facial landmark detectors focus on learning representative image features. However, these CNN-based feature representations are not robust enough to handle complex real-world scenarios due to ignoring the internal structure of landmarks, as well as the relations between landmarks and context. In this work, we formulate the facial landmark detection task as refining landmark queries along pyramid memories. Specifically, a pyramid transformer head (PTH) is introduced to build both homologous relations among landmarks and heterologous relations between landmarks and cross-scale contexts. Besides, a dynamic landmark refinement (DLR) module is designed to decompose the landmark regression into an end-to-end refinement procedure, where the dynamically aggregated queries are transformed to residual coordinates predictions. Extensive experimental results on four facial landmark detection benchmarks and their various subsets demonstrate the superior performance and high robustness of our framework.

BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization

  • Authors: Sheng Kuang, Kiki van der Heijden, Siamak Mehrkanoon
  • Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2207.03927
  • Pdf link: https://arxiv.org/pdf/2207.03927
  • Abstract Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows barriers in capturing the global acoustic features. To address this issue, we propose a novel end-to-end Binaural Audio Spectrogram Transformer (BAST) model to predict the sound azimuth in both anechoic and reverberation environments. Two modes of implementation, i.e. BAST-SP and BAST-NSP corresponding to BAST model with shared and non-shared parameters respectively, are explored. Our model with subtraction interaural integration and hybrid loss achieves an angular distance of 1.29 degrees and a Mean Square Error of 1e-3 at all azimuths, significantly surpassing CNN based model. The exploratory analysis of the BAST's performance on the left-right hemifields and anechoic and reverberation environments shows its generalization ability as well as the feasibility of binaural Transformers in sound localization. Furthermore, the analysis of the attention maps is provided to give additional insights on the interpretation of the localization process in a natural reverberant environment.

CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination

  • Authors: Hyounghun Kim, Abhay Zala, Mohit Bansal
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03961
  • Pdf link: https://arxiv.org/pdf/2207.03961
  • Abstract As humans, we can modify our assumptions about a scene by imagining alternative objects or concepts in our minds. For example, we can easily anticipate the implications of the sun being overcast by rain clouds (e.g., the street will get wet) and accordingly prepare for that. In this paper, we introduce a new task/dataset called Commonsense Reasoning for Counterfactual Scene Imagination (CoSIm) which is designed to evaluate the ability of AI systems to reason about scene change imagination. In this task/dataset, models are given an image and an initial question-response pair about the image. Next, a counterfactual imagined scene change (in textual form) is applied, and the model has to predict the new response to the initial question based on this scene change. We collect 3.5K high-quality and challenging data instances, with each instance consisting of an image, a commonsense question with a response, a description of a counterfactual change, a new response to the question, and three distractor responses. Our dataset contains various complex scene change types (such as object addition/removal/state change, event description, environment change, etc.) that require models to imagine many different scenarios and reason about the changed scenes. We present a baseline model based on a vision-language Transformer (i.e., LXMERT) and ablation studies. Through human evaluation, we demonstrate a large human-model performance gap, suggesting room for promising future work on this challenging counterfactual, scene imagination task. Our code and dataset are publicly available at: https://github.com/hyounghk/CoSIm

k-means Mask Transformer

  • Authors: Qihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hatwig Adam, Alan Yuille, Liang-Chieh Chen
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.04044
  • Pdf link: https://arxiv.org/pdf/2207.04044
  • Abstract The rise of transformers in vision tasks not only advances network backbone designs, but also starts a brand-new page to achieve end-to-end image recognition (e.g., object detection and panoptic segmentation). Originated from Natural Language Processing (NLP), transformer architectures, consisting of self-attention and cross-attention, effectively learn long-range interactions between elements in a sequence. However, we observe that most existing transformer-based vision models simply borrow the idea from NLP, neglecting the crucial difference between languages and images, particularly the extremely large sequence length of spatially flattened pixel features. This subsequently impedes the learning in cross-attention between pixel features and object queries. In this paper, we rethink the relationship between pixels and object queries and propose to reformulate the cross-attention learning as a clustering process. Inspired by the traditional k-means clustering algorithm, we develop a k-means Mask Xformer (kMaX-DeepLab) for segmentation tasks, which not only improves the state-of-the-art, but also enjoys a simple and elegant design. As a result, our kMaX-DeepLab achieves a new state-of-the-art performance on COCO val set with 58.0% PQ, and Cityscapes val set with 68.4% PQ, 44.0% AP, and 83.5% mIoU without test-time augmentation or external dataset. We hope our work can shed some light on designing transformers tailored for vision tasks. Code and models are available at https://github.com/google-research/deeplab2

Keyword: autonomous driving

Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection

  • Authors: Xiurong Jiang, Lin Zhu, Yifan Hou, Hui Tian
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2207.03558
  • Pdf link: https://arxiv.org/pdf/2207.03558
  • Abstract RGB-thermal salient object detection (RGB-T SOD) aims to locate the common prominent objects of an aligned visible and thermal infrared image pair and accurately segment all the pixels belonging to those objects. It is promising in challenging scenes such as nighttime and complex backgrounds due to the insensitivity to lighting conditions of thermal images. Thus, the key problem of RGB-T SOD is to make the features from the two modalities complement and adjust each other flexibly, since it is inevitable that any modalities of RGB-T image pairs failure due to challenging scenes such as extreme light conditions and thermal crossover. In this paper, we propose a novel mirror complementary Transformer network (MCNet) for RGB-T SOD. Specifically, we introduce a Transformer-based feature extraction module to effective extract hierarchical features of RGB and thermal images. Then, through the attention-based feature interaction and serial multiscale dilated convolution (SDC) based feature fusion modules, the proposed model achieves the complementary interaction of low-level features and the semantic fusion of deep features. Finally, based on the mirror complementary structure, the salient regions of the two modalities can be accurately extracted even one modality is invalid. To demonstrate the robustness of the proposed model under challenging scenes in real world, we build a novel RGB-T SOD dataset VT723 based on a large public semantic segmentation RGB-T dataset used in the autonomous driving domain. Expensive experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches, including CNN-based and Transformer-based methods. The code and dataset will be released later at https://github.com/jxr326/SwinMCNet.

Efficient Game-Theoretic Planning with Prediction Heuristic for Socially-Compliant Autonomous Driving

  • Authors: Chenran Li, Tu Trinh, Letian Wang, Changliu Liu, Masayoshi Tomizuka, Wei Zhan
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2207.03673
  • Pdf link: https://arxiv.org/pdf/2207.03673
  • Abstract Planning under social interactions with other agents is an essential problem for autonomous driving. As the actions of the autonomous vehicle in the interactions affect and are also affected by other agents, autonomous vehicles need to efficiently infer the reaction of the other agents. Most existing approaches formulate the problem as a generalized Nash equilibrium problem solved by optimization-based methods. However, they demand too much computational resource and easily fall into the local minimum due to the non-convexity. Monte Carlo Tree Search (MCTS) successfully tackles such issues in game-theoretic problems. However, as the interaction game tree grows exponentially, the general MCTS still requires a huge amount of iterations to reach the optima. In this paper, we introduce an efficient game-theoretic trajectory planning algorithm based on general MCTS by incorporating a prediction algorithm as a heuristic. On top of it, a social-compliant reward and a Bayesian inference algorithm are designed to generate diverse driving behaviors and identify the other driver's driving preference. Results demonstrate the effectiveness of the proposed framework with datasets containing naturalistic driving behavior in highly interactive scenarios.

zhuhu00 avatar Jul 11 '22 03:07 zhuhu00