Paper-Daily-Notice icon indicating copy to clipboard operation
Paper-Daily-Notice copied to clipboard

New submissions for Fri, 6 May 22

Open zhuhu00 opened this issue 2 years ago • 0 comments

Keyword: SLAM

BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking

  • Authors: Dorian Henning, Tristan Laidlow, Stefan Leutenegger
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2205.02301
  • Pdf link: https://arxiv.org/pdf/2205.02301
  • Abstract Estimating human motion from video is an active research area due to its many potential applications. Most state-of-the-art methods predict human shape and posture estimates for individual images and do not leverage the temporal information available in video. Many "in the wild" sequences of human motion are captured by a moving camera, which adds the complication of conflated camera and human motion to the estimation. We therefore present BodySLAM, a monocular SLAM system that jointly estimates the position, shape, and posture of human bodies, as well as the camera trajectory. We also introduce a novel human motion model to constrain sequential body postures and observe the scale of the scene. Through a series of experiments on video sequences of human motion captured by a moving monocular camera, we demonstrate that BodySLAM improves estimates of all human body parameters and camera poses when compared to estimating these separately.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Accelerating Path Planning for Autonomous Driving with Hardware-assisted Memorization

  • Authors: Mulong Luo, G. Edward Suh
  • Subjects: Robotics (cs.RO); Hardware Architecture (cs.AR)
  • Arxiv link: https://arxiv.org/abs/2205.02754
  • Pdf link: https://arxiv.org/pdf/2205.02754
  • Abstract Path planning for autonomous driving with dynamic obstacles poses a challenge because it needs to perform a higher-dimensional search (including time) while still meeting real-time constraints. This paper proposes an algorithm-hardware co-optimization approach to accelerate path planning with high-dimensional search space. First, we reduce the time for a nearest neighbor search and collision detection by mapping nodes and obstacles to a lower-dimensional space and memorizing recent search results. Then, we propose a hardware extension for efficient memorization. The experimental results on a modern processor and a cycle-level simulator show that the execution time is reduced significantly.

Keyword: mapping

Fine-Grained Address Segmentation for Attention-Based Variable-Degree Prefetching

  • Authors: Pengmiao Zhang, Ajitesh Srivastava, Anant V. Nori, Rajgopal Kannan, Viktor K. Prasanna
  • Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2205.02269
  • Pdf link: https://arxiv.org/pdf/2205.02269
  • Abstract Machine learning algorithms have shown potential to improve prefetching performance by accurately predicting future memory accesses. Existing approaches are based on the modeling of text prediction, considering prefetching as a classification problem for sequence prediction. However, the vast and sparse memory address space leads to large vocabulary, which makes this modeling impractical. The number and order of outputs for multiple cache line prefetching are also fundamentally different from text prediction. We propose TransFetch, a novel way to model prefetching. To reduce vocabulary size, we use fine-grained address segmentation as input. To predict unordered sets of future addresses, we use delta bitmaps for multiple outputs. We apply an attention-based network to learn the mapping between input and output. Prediction experiments demonstrate that address segmentation achieves 26% - 36% higher F1-score than delta inputs and 15% - 24% higher F1-score than page & offset inputs for SPEC 2006, SPEC 2017, and GAP benchmarks. Simulation results show that TransFetch achieves 38.75% IPC improvement compared with no prefetching, outperforming the best-performing rule-based prefetcher BOP by 10.44%, and ML-based prefetcher Voyager by 6.64%.

Operator inference for non-intrusive model reduction with nonlinear manifolds

  • Authors: Rudy Geelen, Stephen Wright, Karen Willcox
  • Subjects: Numerical Analysis (math.NA)
  • Arxiv link: https://arxiv.org/abs/2205.02304
  • Pdf link: https://arxiv.org/pdf/2205.02304
  • Abstract This paper proposes a novel approach for learning a data-driven quadratic manifold from high-dimensional data, then employing the quadratic manifold to derive efficient physics-based reduced-order models. The key ingredient of the approach is a polynomial mapping between high-dimensional states and a low-dimensional embedding. This mapping comprises two parts: a representation in a linear subspace (computed in this work using the proper orthogonal decomposition) and a quadratic component. The approach can be viewed as a form of data-driven closure modeling, since the quadratic component introduces directions into the approximation that lie in the orthogonal complement of the linear subspace, but without introducing any additional degrees of freedom to the low-dimensional representation. Combining the quadratic manifold approximation with the operator inference method for projection-based model reduction leads to a scalable non-intrusive approach for learning reduced-order models of dynamical systems. Applying the new approach to transport-dominated systems of partial differential equations illustrates the gains in efficiency that can be achieved over approximation in a linear subspace.

Towards Robust and Semantically Organised Latent Representations for Unsupervised Text Style Transfer

  • Authors: Sharan Narasimhan, Suvodip Dey, Maunendra Sankar Desarkar
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2205.02309
  • Pdf link: https://arxiv.org/pdf/2205.02309
  • Abstract Recent studies show that auto-encoder based approaches successfully perform language generation, smooth sentence interpolation, and style transfer over unseen attributes using unlabelled datasets in a zero-shot manner. The latent space geometry of such models is organised well enough to perform on datasets where the style is "coarse-grained" i.e. a small fraction of words alone in a sentence are enough to determine the overall style label. A recent study uses a discrete token-based perturbation approach to map "similar" sentences ("similar" defined by low Levenshtein distance/ high word overlap) close by in latent space. This definition of "similarity" does not look into the underlying nuances of the constituent words while mapping latent space neighbourhoods and therefore fails to recognise sentences with different style-based semantics while mapping latent neighbourhoods. We introduce EPAAEs (Embedding Perturbed Adversarial AutoEncoders) which completes this perturbation model, by adding a finely adjustable noise component on the continuous embeddings space. We empirically show that this (a) produces a better organised latent space that clusters stylistically similar sentences together, (b) performs best on a diverse set of text style transfer tasks than similar denoising-inspired baselines, and (c) is capable of fine-grained control of Style Transfer strength. We also extend the text style transfer tasks to NLI datasets and show that these more complex definitions of style are learned best by EPAAE. To the best of our knowledge, extending style transfer to NLI tasks has not been explored before.

FastRE: Towards Fast Relation Extraction with Convolutional Encoder and Improved Cascade Binary Tagging Framework

  • Authors: Guozheng Li, Xu Chen, Peng Wang, Jiafeng Xie, Qiqing Luo
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2205.02490
  • Pdf link: https://arxiv.org/pdf/2205.02490
  • Abstract Recent work for extracting relations from texts has achieved excellent performance. However, most existing methods pay less attention to the efficiency, making it still challenging to quickly extract relations from massive or streaming text data in realistic scenarios. The main efficiency bottleneck is that these methods use a Transformer-based pre-trained language model for encoding, which heavily affects the training speed and inference speed. To address this issue, we propose a fast relation extraction model (FastRE) based on convolutional encoder and improved cascade binary tagging framework. Compared to previous work, FastRE employs several innovations to improve efficiency while also keeping promising performance. Concretely, FastRE adopts a novel convolutional encoder architecture combined with dilated convolution, gated unit and residual connection, which significantly reduces the computation cost of training and inference, while maintaining the satisfactory performance. Moreover, to improve the cascade binary tagging framework, FastRE first introduces a type-relation mapping mechanism to accelerate tagging efficiency and alleviate relation redundancy, and then utilizes a position-dependent adaptive thresholding strategy to obtain higher tagging accuracy and better model generalization. Experimental results demonstrate that FastRE is well balanced between efficiency and performance, and achieves 3-10x training speed, 7-15x inference speed faster, and 1/100 parameters compared to the state-of-the-art models, while the performance is still competitive.

Parametric Reshaping of Portraits in Videos

  • Authors: Xiangjun Tang, Wenxin Sun, Yong-Liang Yang, Xiaogang Jin
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2205.02538
  • Pdf link: https://arxiv.org/pdf/2205.02538
  • Abstract Sharing short personalized videos to various social media networks has become quite popular in recent years. This raises the need for digital retouching of portraits in videos. However, applying portrait image editing directly on portrait video frames cannot generate smooth and stable video sequences. To this end, we present a robust and easy-to-use parametric method to reshape the portrait in a video to produce smooth retouched results. Given an input portrait video, our method consists of two main stages: stabilized face reconstruction, and continuous video reshaping. In the first stage, we start by estimating face rigid pose transformations across video frames. Then we jointly optimize multiple frames to reconstruct an accurate face identity, followed by recovering face expressions over the entire video. In the second stage, we first reshape the reconstructed 3D face using a parametric reshaping model reflecting the weight change of the face, and then utilize the reshaped 3D face to guide the warping of video frames. We develop a novel signed distance function based dense mapping method for the warping between face contours before and after reshaping, resulting in stable warped video frames with minimum distortions. In addition, we use the 3D structure of the face to correct the dense mapping to achieve temporal consistency. We generate the final result by minimizing the background distortion through optimizing a content-aware warping mesh. Extensive experiments show that our method is able to create visually pleasing results by adjusting a simple reshaping parameter, which facilitates portrait video editing for social media and visual effects.

Real-time Controllable Motion Transition for Characters

  • Authors: Xiangjun Tang, He Wang, Bo Hu, Xu Gong, Ruifan Yi, Qilong Kou, Xiaogang Jin
  • Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2205.02540
  • Pdf link: https://arxiv.org/pdf/2205.02540
  • Abstract Real-time in-between motion generation is universally required in games and highly desirable in existing animation pipelines. Its core challenge lies in the need to satisfy three critical conditions simultaneously: quality, controllability and speed, which renders any methods that need offline computation (or post-processing) or cannot incorporate (often unpredictable) user control undesirable. To this end, we propose a new real-time transition method to address the aforementioned challenges. Our approach consists of two key components: motion manifold and conditional transitioning. The former learns the important low-level motion features and their dynamics; while the latter synthesizes transitions conditioned on a target frame and the desired transition duration. We first learn a motion manifold that explicitly models the intrinsic transition stochasticity in human motions via a multi-modal mapping mechanism. Then, during generation, we design a transition model which is essentially a sampling strategy to sample from the learned manifold, based on the target frame and the aimed transition duration. We validate our method on different datasets in tasks where no post-processing or offline computation is allowed. Through exhaustive evaluation and comparison, we show that our method is able to generate high-quality motions measured under multiple metrics. Our method is also robust under various target frames (with extreme cases).

Accelerating Path Planning for Autonomous Driving with Hardware-assisted Memorization

  • Authors: Mulong Luo, G. Edward Suh
  • Subjects: Robotics (cs.RO); Hardware Architecture (cs.AR)
  • Arxiv link: https://arxiv.org/abs/2205.02754
  • Pdf link: https://arxiv.org/pdf/2205.02754
  • Abstract Path planning for autonomous driving with dynamic obstacles poses a challenge because it needs to perform a higher-dimensional search (including time) while still meeting real-time constraints. This paper proposes an algorithm-hardware co-optimization approach to accelerate path planning with high-dimensional search space. First, we reduce the time for a nearest neighbor search and collision detection by mapping nodes and obstacles to a lower-dimensional space and memorizing recent search results. Then, we propose a hardware extension for efficient memorization. The experimental results on a modern processor and a cycle-level simulator show that the execution time is reduced significantly.

Dual Octree Graph Networks for Learning Adaptive Volumetric Shape Representations

  • Authors: Peng-Shuai Wang, Yang Liu, Xin Tong
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2205.02825
  • Pdf link: https://arxiv.org/pdf/2205.02825
  • Abstract We present an adaptive deep representation of volumetric fields of 3D shapes and an efficient approach to learn this deep representation for high-quality 3D shape reconstruction and auto-encoding. Our method encodes the volumetric field of a 3D shape with an adaptive feature volume organized by an octree and applies a compact multilayer perceptron network for mapping the features to the field value at each 3D position. An encoder-decoder network is designed to learn the adaptive feature volume based on the graph convolutions over the dual graph of octree nodes. The core of our network is a new graph convolution operator defined over a regular grid of features fused from irregular neighboring octree nodes at different levels, which not only reduces the computational and memory cost of the convolutions over irregular neighboring octree nodes, but also improves the performance of feature learning. Our method effectively encodes shape details, enables fast 3D shape reconstruction, and exhibits good generality for modeling 3D shapes out of training categories. We evaluate our method on a set of reconstruction tasks of 3D shapes and scenes and validate its superiority over other existing approaches. Our code, data, and trained models are available at https://wang-ps.github.io/dualocnn.

Cross-view Transformers for real-time Map-view Semantic Segmentation

  • Authors: Brady Zhou, Philipp Krähenbühl
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2205.02833
  • Pdf link: https://arxiv.org/pdf/2205.02833
  • Abstract We present cross-view transformers, an efficient attention-based model for map-view semantic segmentation from multiple cameras. Our architecture implicitly learns a mapping from individual camera views into a canonical map-view representation using a camera-aware cross-view attention mechanism. Each camera uses positional embeddings that depend on its intrinsic and extrinsic calibration. These embeddings allow a transformer to learn the mapping across different views without ever explicitly modeling it geometrically. The architecture consists of a convolutional image encoder for each view and cross-view transformer layers to infer a map-view semantic segmentation. Our model is simple, easily parallelizable, and runs in real-time. The presented architecture performs at state-of-the-art on the nuScenes dataset, with 4x faster inference speeds. Code is available at https://github.com/bradyz/cross_view_transformers.

Keyword: localization

Uncertainty-Based Non-Parametric Active Peak Detection

  • Authors: Praneeth Narayanamurthy, Urbashi Mitra
  • Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2205.02376
  • Pdf link: https://arxiv.org/pdf/2205.02376
  • Abstract Active, non-parametric peak detection is considered. As a use case, active source localization is examined and an uncertainty-based sampling scheme algorithm to effectively localize the peak from a few energy measurements is designed. It is shown that under very mild conditions, the source localization error with $m$ actively chosen energy measurements scales as $O(\log^2 m/m)$. Numerically, it is shown that in low-sample regimes, the proposed method enjoys superior performance on several types of data and outperforms the state-of-the-art passive source localization approaches and in the low sample regime, can outperform greedy methods as well.

ImPosIng: Implicit Pose Encoding for Efficient Camera Pose Estimation

  • Authors: Arthur Moreau, Thomas Gilles, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, Arnaud de La Fortelle
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2205.02638
  • Pdf link: https://arxiv.org/pdf/2205.02638
  • Abstract We propose a novel learning-based formulation for camera pose estimation that can perform relocalization accurately and in real-time in city-scale environments. Camera pose estimation algorithms determine the position and orientation from which an image has been captured, using a set of geo-referenced images or 3D scene representation. Our new localization paradigm, named Implicit Pose Encoding (ImPosing), embeds images and camera poses into a common latent representation with 2 separate neural networks, such that we can compute a similarity score for each image-pose pair. By evaluating candidates through the latent space in a hierarchical manner, the camera position and orientation are not directly regressed but incrementally refined. Compared to the representation used in structure-based relocalization methods, our implicit map is memory bounded and can be properly explored to improve localization performances against learning-based regression approaches. In this paper, we describe how to effectively optimize our learned modules, how to combine them to achieve real-time localization, and demonstrate results on diverse large scale scenarios that significantly outperform prior work in accuracy and computational efficiency.

Unsupervised Mismatch Localization in Cross-Modal Sequential Data

  • Authors: Wei Wei, Huang Hengguan, Gu Xiangming, Wang Hao, Wang Ye
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2205.02670
  • Pdf link: https://arxiv.org/pdf/2205.02670
  • Abstract Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume the content involved in the two modalities is perfectly matched and thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, named mismatch localization variational autoencoder (ML-VAE), that decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a model is very challenging due to the discrete latent variables with complex dependencies involved. We propose a novel and effective training procedure which estimates the hard assignments of the discrete latent variables over a specifically designed lattice and updates the parameters of neural networks alternatively. Our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations for model training.

zhuhu00 avatar May 06 '22 13:05 zhuhu00