New submissions for Tue, 2 Aug 22

Open zhuhu00 opened this issue 2 years ago • 0 comments

New submissions for Tue, 2 Aug 22

Keyword: SLAM

Visual-Inertial SLAM with Tightly-Coupled Dropout-Tolerant GPS Fusion

Authors: Simon Boche, Xingxing Zuo, Simon Schaefer, Stefan Leutenegger
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2208.00709
Pdf link: https://arxiv.org/pdf/2208.00709
Abstract Robotic applications are continuously striving towards higher levels of autonomy. To achieve that goal, a highly robust and accurate state estimation is indispensable. Combining visual and inertial sensor modalities has proven to yield accurate and locally consistent results in short-term applications. Unfortunately, visual-inertial state estimators suffer from the accumulation of drift for long-term trajectories. To eliminate this drift, global measurements can be fused into the state estimation pipeline. The most known and widely available source of global measurements is the Global Positioning System (GPS). In this paper, we propose a novel approach that fully combines stereo Visual-Inertial Simultaneous Localisation and Mapping (SLAM), including visual loop closures, with the fusion of global sensor modalities in a tightly-coupled and optimisation-based framework. Incorporating measurement uncertainties, we provide a robust criterion to solve the global reference frame initialisation problem. Furthermore, we propose a loop-closure-like optimisation scheme to compensate drift accumulated during outages in receiving GPS signals. Experimental validation on datasets and in a real-world experiment demonstrates the robustness of our approach to GPS dropouts as well as its capability to estimate highly accurate and globally consistent trajectories compared to existing state-of-the-art methods.

Keyword: odometry

Benchmarking Visual-Inertial Deep Multimodal Fusion for Relative Pose Regression and Odometry-aided Absolute Pose Regression

Authors: Felix Ott, Nisha Lakshmana Raichur, David Rügamer, Tobias Feigl, Heiko Neumann, Bernd Bischl, Christopher Mutschler
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00919
Pdf link: https://arxiv.org/pdf/2208.00919
Abstract Visual-inertial localization is a key problem in computer vision and robotics applications such as virtual reality, self-driving cars, and aerial vehicles. The goal is to estimate an accurate pose of an object when either the environment or the dynamics are known. Recent methods directly regress the pose using convolutional and spatio-temporal networks. Absolute pose regression (APR) techniques predict the absolute camera pose from an image input in a known scene. Odometry methods perform relative pose regression (RPR) that predicts the relative pose from a known object dynamic (visual or inertial inputs). The localization task can be improved by retrieving information of both data sources for a cross-modal setup, which is a challenging problem due to contradictory tasks. In this work, we conduct a benchmark to evaluate deep multimodal fusion based on PGO and attention networks. Auxiliary and Bayesian learning are integrated for the APR task. We show accuracy improvements for the RPR-aided APR task and for the RPR-RPR task for aerial vehicles and hand-held devices. We conduct experiments on the EuRoC MAV and PennCOSYVIO datasets, and record a novel industry dataset.

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

PolarMix: A General Data Augmentation Technique for LiDAR Point Clouds

Authors: Aoran Xiao, Jiaxing Huang, Dayan Guan, Kaiwen Cui, Shijian Lu, Ling Shao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.00223
Pdf link: https://arxiv.org/pdf/2208.00223
Abstract LiDAR point clouds, which are usually scanned by rotating LiDAR sensors continuously, capture precise geometry of the surrounding environment and are crucial to many autonomous detection and navigation tasks. Though many 3D deep architectures have been developed, efficient collection and annotation of large amounts of point clouds remain one major challenge in the analytic and understanding of point cloud data. This paper presents PolarMix, a point cloud augmentation technique that is simple and generic but can mitigate the data constraint effectively across different perception tasks and scenarios. PolarMix enriches point cloud distributions and preserves point cloud fidelity via two cross-scan augmentation strategies that cut, edit, and mix point clouds along the scanning direction. The first is scene-level swapping which exchanges point cloud sectors of two LiDAR scans that are cut along the azimuth axis. The second is instance-level rotation and paste which crops point instances from one LiDAR scan, rotates them by multiple angles (to create multiple copies), and paste the rotated point instances into other scans. Extensive experiments show that PolarMix achieves superior performance consistently across different perception tasks and scenarios. In addition, it can work as plug-and-play for various 3D deep architectures and also performs well for unsupervised domain adaptation.

Keyword: loop detection

There is no result

Keyword: nerf

Distilled Low Rank Neural Radiance Field with Quantization for Light Field Compression

Authors: Jinglei Shi, Christine Guillemot
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2208.00164
Pdf link: https://arxiv.org/pdf/2208.00164
Abstract In this paper, we propose a novel light field compression method based on a Quantized Distilled Low Rank Neural Radiance Field (QDLR-NeRF) representation. While existing compression methods encode the set of light field sub-aperture images, our proposed method instead learns an implicit scene representation in the form of a Neural Radiance Field (NeRF), which also enables view synthesis. For reducing its size, the model is first learned under a Low Rank (LR) constraint using a Tensor Train (TT) decomposition in an Alternating Direction Method of Multipliers (ADMM) optimization framework. To further reduce the model size, the components of the tensor train decomposition need to be quantized. However, performing the optimization of the NeRF model by simultaneously taking the low rank constraint and the rate-constrained weight quantization into consideration is challenging. To deal with this difficulty, we introduce a network distillation operation that separates the low rank approximation and the weight quantization in the network training. The information from the initial LR constrained NeRF (LR-NeRF) is distilled to a model of a much smaller dimension (DLR-NeRF) based on the TT decomposition of the LR-NeRF. An optimized global codebook is then learned to quantize all TT components, producing the final QDLRNeRF. Experimental results show that our proposed method yields better compression efficiency compared with state-of-the-art methods, and it additionally has the advantage of allowing the synthesis of any light field view with a high quality.

MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures

Authors: Zhiqin Chen, Thomas Funkhouser, Peter Hedman, Andrea Tagliasacchi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.00277
Pdf link: https://arxiv.org/pdf/2208.00277
Abstract Neural Radiance Fields (NeRFs) have demonstrated amazing ability to synthesize images of 3D scenes from novel views. However, they rely upon specialized volumetric rendering algorithms based on ray marching that are mismatched to the capabilities of widely deployed graphics hardware. This paper introduces a new NeRF representation based on textured polygons that can synthesize novel images efficiently with standard rendering pipelines. The NeRF is represented as a set of polygons with textures representing binary opacities and feature vectors. Traditional rendering of the polygons with a z-buffer yields an image with features at every pixel, which are interpreted by a small, view-dependent MLP running in a fragment shader to produce a final pixel color. This approach enables NeRFs to be rendered with the traditional polygon rasterization pipeline, which provides massive pixel-level parallelism, achieving interactive frame rates on a wide range of compute platforms, including mobile phones.

DoF-NeRF: Depth-of-Field Meets Neural Radiance Fields

Authors: Zijin Wu, Xingyi Li, Juewen Peng, Hao Lu, Zhiguo Cao, Weicai Zhong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00945
Pdf link: https://arxiv.org/pdf/2208.00945
Abstract Neural Radiance Field (NeRF) and its variants have exhibited great success on representing 3D scenes and synthesizing photo-realistic novel views. However, they are generally based on the pinhole camera model and assume all-in-focus inputs. This limits their applicability as images captured from the real world often have finite depth-of-field (DoF). To mitigate this issue, we introduce DoF-NeRF, a novel neural rendering approach that can deal with shallow DoF inputs and can simulate DoF effect. In particular, it extends NeRF to simulate the aperture of lens following the principles of geometric optics. Such a physical guarantee allows DoF-NeRF to operate views with different focus configurations. Benefiting from explicit aperture modeling, DoF-NeRF also enables direct manipulation of DoF effect by adjusting virtual aperture and focus parameters. It is plug-and-play and can be inserted into NeRF-based frameworks. Experiments on synthetic and real-world datasets show that, DoF-NeRF not only performs comparably with NeRF in the all-in-focus setting, but also can synthesize all-in-focus novel views conditioned on shallow DoF inputs. An interesting application of DoF-NeRF to DoF rendering is also demonstrated. The source code will be made available at https://github.com/zijinwuzijin/DoF-NeRF.

Keyword: mapping

On infrastructure for facilitation of inner source in small development teams

Authors: Johan Linåker, Maria Krantz, Martin Höst
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2208.00037
Pdf link: https://arxiv.org/pdf/2208.00037
Abstract The phenomenon of adopting open source software development practices in a corporate environment is known by many names, one being inner source. The objective of this study is to investigate how an organization consisting of small development teams can benefit from adopting inner source and assess the level of applicability. The research has been conducted as a case study at a software development company. Data collection was carried out through interviews and a series of focus group meetings, and then analyzed by mapping it to an available framework. The analysis shows that the organization possesses potential, and also identified a number of challenges and benefits of special importance to the case company. To address these challenges, the case study synthesized the organizational and infrastructural needs of the organization in a requirements specification describing a technical infrastructure, also known as a software forge, with an adapted organizational context and work process.

Neural Correspondence Field for Object Pose Estimation

Authors: Lin Huang, Tomas Hodan, Lingni Ma, Linguang Zhang, Luan Tran, Christopher Twigg, Po-Chen Wu, Junsong Yuan, Cem Keskin, Robert Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2208.00113
Pdf link: https://arxiv.org/pdf/2208.00113
Abstract We propose a method for estimating the 6DoF pose of a rigid object with an available 3D model from a single RGB image. Unlike classical correspondence-based methods which predict 3D object coordinates at pixels of the input image, the proposed method predicts 3D object coordinates at 3D query points sampled in the camera frustum. The move from pixels to 3D points, which is inspired by recent PIFu-style methods for 3D reconstruction, enables reasoning about the whole object, including its (self-)occluded parts. For a 3D query point associated with a pixel-aligned image feature, we train a fully-connected neural network to predict: (i) the corresponding 3D object coordinates, and (ii) the signed distance to the object surface, with the first defined only for query points in the surface vicinity. We call the mapping realized by this network as Neural Correspondence Field. The object pose is then robustly estimated from the predicted 3D-3D correspondences by the Kabsch-RANSAC algorithm. The proposed method achieves state-of-the-art results on three BOP datasets and is shown superior especially in challenging cases with occlusion. The project website is at: linhuang17.github.io/NCF.

Learning Feature Decomposition for Domain Adaptive Monocular Depth Estimation

Authors: Shao-Yuan Lo, Wei Wang, Jim Thomas, Jingjing Zheng, Vishal M. Patel, Cheng-Hao Kuo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00160
Pdf link: https://arxiv.org/pdf/2208.00160
Abstract Monocular depth estimation (MDE) has attracted intense study due to its low cost and critical functions for robotic tasks such as localization, mapping and obstacle detection. Supervised approaches have led to great success with the advance of deep learning, but they rely on large quantities of ground-truth depth annotations that are expensive to acquire. Unsupervised domain adaptation (UDA) transfers knowledge from labeled source data to unlabeled target data, so as to relax the constraint of supervised learning. However, existing UDA approaches may not completely align the domain gap across different datasets because of the domain shift problem. We believe better domain alignment can be achieved via well-designed feature decomposition. In this paper, we propose a novel UDA method for MDE, referred to as Learning Feature Decomposition for Adaptation (LFDA), which learns to decompose the feature space into content and style components. LFDA only attempts to align the content component since it has a smaller domain gap. Meanwhile, it excludes the style component which is specific to the source domain from training the primary task. Furthermore, LFDA uses separate feature distribution estimations to further bridge the domain gap. Extensive experiments on three domain adaptative MDE scenarios show that the proposed method achieves superior accuracy and lower computational cost compared to the state-of-the-art approaches.

Efficient Compilation and Mapping of Fixed Function Combinational Logic onto Digital Signal Processors Targeting Neural Network Inference and Utilizing High-level Synthesis

Authors: Soheil Nazar Shahsavani, Arash Fayyazi, Mahdi Nazemi, Massoud Pedram
Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.00302
Pdf link: https://arxiv.org/pdf/2208.00302
Abstract Recent efforts for improving the performance of neural network (NN) accelerators that meet today's application requirements have given rise to a new trend of logic-based NN inference relying on fixed function combinational logic. Mapping such large Boolean functions with many input variables and product terms to digital signal processors (DSPs) on Field-programmable gate arrays (FPGAs) needs a novel framework considering the structure and the reconfigurability of DSP blocks during this process. The proposed methodology in this paper maps the fixed function combinational logic blocks to a set of Boolean functions where Boolean operations corresponding to each function are mapped to DSP devices rather than look-up tables (LUTs) on the FPGAs to take advantage of the high performance, low latency, and parallelism of DSP blocks. % This paper also presents an innovative design and optimization methodology for compilation and mapping of NNs, utilizing fixed function combinational logic to DSPs on FPGAs employing high-level synthesis flow. % Our experimental evaluations across several \REVone{datasets} and selected NNs demonstrate the comparable performance of our framework in terms of the inference latency and output accuracy compared to prior art FPGA-based NN accelerators employing DSPs.

GitHub Marketplace for Practitioners and Researchers to Date: A Systematic Analysis of the Knowledge Mobilization Gap in Open Source Software Automation

Authors: Sk Golam Saroar, Waseefa Ahmed, Maleknaz Nayebi
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2208.00332
Pdf link: https://arxiv.org/pdf/2208.00332
Abstract Marketplaces for distributing software products and services have been getting increasing popularity. GitHub, which is most known for its version control functionality through Git, launched its own marketplace in 2017. GitHub Marketplace hosts third party apps and actions to automate workflows in software teams. Currently, this marketplace hosts 440 Apps and 7,878 Actions across 32 different categories. Overall, 419 Third party developers released their apps on this platform which 111 distinct customers adopted. The popularity and accessibility of GitHub projects have made this platform and the projects hosted on it one of the most frequent subjects for experimentation in the software engineering research. A simple Google Scholar search shows that 24,100 Research papers have discussed GitHub within the Software Engineering field since 2017, but none have looked into the marketplace. The GitHub Marketplace provides a unique source of information on the tools used by the practitioners in the Open Source Software (OSS) ecosystem for automating their project's workflow. In this study, we (i) mine and provide a descriptive overview of the GitHub Marketplace, (ii) perform a systematic mapping of research studies in automation for open source software, and (iii) compare the state of the art with the state of the practice on the automation tools. We conclude the paper by discussing the potential of GitHub Marketplace for knowledge mobilization and collaboration within the field. This is the first study on the GitHub Marketplace in the field.

An Experimental Study on Learning Correlated Equilibrium in Routing Games

Authors: Yixian Zhu, Ketan Savla
Subjects: Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2208.00391
Pdf link: https://arxiv.org/pdf/2208.00391
Abstract We study route choice in a repeated routing game where an uncertain state of nature determines link latency functions, and agents receive private route recommendation. The state is sampled in an i.i.d. manner in every round from a publicly known distribution, and the recommendations are generated by a randomization policy whose mapping from the state is known publicly. In a one-shot setting, the agents are said to obey recommendation if it gives the smallest travel time in a posteriori expectation. A plausible extension to repeated setting is that the likelihood of following recommendation in a round is related to regret from previous rounds. If the regret is of satisficing type with respect to a default choice and is averaged over past rounds and over all agents, then the asymptotic outcome under an obedient recommendation policy coincides with the one-shot outcome. We report findings from an experiment with one participant at a time engaged in repeated route choice decision on computer. In every round, the participant is shown travel time distribution for each route, a route recommendation generated by an obedient policy, and a rating suggestive of average experience of previous participants with the quality of recommendation. Upon entering route choice, the actual travel times are revealed. The participant evaluates the quality of recommendation by submitting a review. This is combined with historical reviews to update rating for the next round. Data analysis from 33 participants each with 100 rounds suggests moderate negative correlation between the display rating and the average regret, and a strong positive correlation between the rating and the likelihood of following recommendation. Overall, under obedient recommendation policy, the rating converges close to its maximum value by the end of the experiments in conjunction with very high frequency of following recommendations.

Intelligent decision-making method of TBM operating parameters based on multiple constraints and objective optimization

Authors: Bin Liu, Jiwen Wang, Ruirui Wang, Yaxu Wang, Guangzu Zhao
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.00404
Pdf link: https://arxiv.org/pdf/2208.00404
Abstract The decision-making of TBM operating parameters has an important guiding significance for TBM safe and efficient construction, and it has been one of the research hotpots in the field of TBM tunneling. For this purpose, this paper introduces rock-breaking rules into machine learning method, and a rock-machine mapping dual-driven by physical-rule and data-mining is established with high accuracy. This dual-driven mappings are subsequently used as objective function and constraints to build a decision-making method for TBM operating parameters. By searching the revolution per minute and penetration corresponding to the extremum of the objective function subject to the constraints, the optimal operating parameters can be obtained. This method is verified in the field of the Second Water Source Channel of Hangzhou, China, resulting in the average penetration rate increased by 11.3%, and the total cost decreased by 10.0%, which proves the practicability and effectiveness of the developed decision-making model.

Accurate Polygonal Mapping of Buildings in Satellite Imagery

Authors: Bowen Xu, Jiakun Xu, Nan Xue, Gui-Song Xia
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00609
Pdf link: https://arxiv.org/pdf/2208.00609
Abstract This paper studies the problem of polygonal mapping of buildings by tackling the issue of mask reversibility that leads to a notable performance gap between the predicted masks and polygons from the learning-based methods. We addressed such an issue by exploiting the hierarchical supervision (of bottom-level vertices, mid-level line segments and the high-level regional masks) and proposed a novel interaction mechanism of feature embedding sourced from different levels of supervision signals to obtain reversible building masks for polygonal mapping of buildings. As a result, we show that the learned reversible building masks take all the merits of the advances of deep convolutional neural networks for high-performing polygonal mapping of buildings. In the experiments, we evaluated our method on the two public benchmarks of AICrowd and Inria. On the AICrowd dataset, our proposed method obtains unanimous improvements on the metrics of AP, APboundary and PoLiS. For the Inria dataset, our proposed method also obtains very competitive results on the metrics of IoU and Accuracy. The models and source code are available at https://github.com/SarahwXU.

Iterative shaping of optical potentials for one-dimensional Bose-Einstein condensates

Authors: Andreas Deutschmann-Olek, Mohammadamin Tajik, Martino Calzavara, Jörg Schmiedmayer, Tommaso Calarco, Andreas Kugi
Subjects: Systems and Control (eess.SY); Quantum Gases (cond-mat.quant-gas); Optics (physics.optics); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2208.00706
Pdf link: https://arxiv.org/pdf/2208.00706
Abstract The ability to manipulate clouds of ultra-cold atoms is crucial for modern experiments on quantum manybody systems and quantum thermodynamics as well as future metrological applications of Bose-Einstein condensate. While optical manipulation offers almost arbitrary flexibility, the precise control of the resulting dipole potentials and the mitigation of unwanted disturbances is quite involved and only heuristic algorithms with rather slow convergence rates are available up to now. This paper thus suggests the application of iterative learning control (ILC) methods to generate fine-tuned effective potentials in the presence of uncertainties and external disturbances. Therefore, the given problem is reformulated to obtain a one-dimensional tracking problem by using a quasicontinuous input mapping which can be treated by established ILC methods. Finally, the performance of the proposed concept is illustrated in a simulation scenario.

Visual-Inertial SLAM with Tightly-Coupled Dropout-Tolerant GPS Fusion

Authors: Simon Boche, Xingxing Zuo, Simon Schaefer, Stefan Leutenegger
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2208.00709
Pdf link: https://arxiv.org/pdf/2208.00709
Abstract Robotic applications are continuously striving towards higher levels of autonomy. To achieve that goal, a highly robust and accurate state estimation is indispensable. Combining visual and inertial sensor modalities has proven to yield accurate and locally consistent results in short-term applications. Unfortunately, visual-inertial state estimators suffer from the accumulation of drift for long-term trajectories. To eliminate this drift, global measurements can be fused into the state estimation pipeline. The most known and widely available source of global measurements is the Global Positioning System (GPS). In this paper, we propose a novel approach that fully combines stereo Visual-Inertial Simultaneous Localisation and Mapping (SLAM), including visual loop closures, with the fusion of global sensor modalities in a tightly-coupled and optimisation-based framework. Incorporating measurement uncertainties, we provide a robust criterion to solve the global reference frame initialisation problem. Furthermore, we propose a loop-closure-like optimisation scheme to compensate drift accumulated during outages in receiving GPS signals. Experimental validation on datasets and in a real-world experiment demonstrates the robustness of our approach to GPS dropouts as well as its capability to estimate highly accurate and globally consistent trajectories compared to existing state-of-the-art methods.

Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning

Authors: Baturay Saglam, Dogan C. Cicek, Furkan B. Mutlu, Suleyman S. Kozat
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2208.00755
Pdf link: https://arxiv.org/pdf/2208.00755
Abstract Compared to on-policy policy gradient techniques, off-policy model-free deep reinforcement learning (RL) approaches that use previously gathered data can improve sampling efficiency. However, off-policy learning becomes challenging when the discrepancy between the distributions of the policy of interest and the policies that collected the data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories that increases the computational complexity and induce additional problems such as vanishing or exploding gradients. Moreover, their generalization to continuous action domains is strictly limited as they require action probabilities, which is unsuitable for deterministic policies. To overcome these limitations, we introduce an alternative off-policy correction algorithm for continuous action spaces, Actor-Critic Off-Policy Correction (AC-Off-POC), to mitigate the potential drawbacks introduced by the previously collected data. Through a novel discrepancy measure computed by the agent's most recent action decisions on the states of the randomly sampled batch of transitions, the approach does not require actual or estimated action probabilities for any policy and offers an adequate one-step importance sampling. Theoretical results show that the introduced approach can achieve a contraction mapping with a fixed unique point, which allows a "safe" off-policy learning. Our empirical results suggest that AC-Off-POC consistently improves the state-of-the-art and attains higher returns in fewer steps than the competing methods by efficiently scheduling the learning rate in Q-learning and policy optimization.

Learning to Navigate using Visual Sensor Networks

Authors: Jan Blumenkamp, Qingbiao Li, Binyu Wang, Zhe Liu, Amanda Prorok
Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2208.00759
Pdf link: https://arxiv.org/pdf/2208.00759
Abstract We consider the problem of navigating a mobile robot towards a target in an unknown environment that is endowed with visual sensors, where neither the robot nor the sensors have access to global positioning information and only use first-person-view images. While prior work in sensor network based navigation uses explicit mapping and planning techniques, and are often aided by external positioning systems, we propose a vision-only based learning approach that leverages a Graph Neural Network (GNN) to encode and communicate relevant viewpoint information to the mobile robot. During navigation, the robot is guided by a model that we train through imitation learning to approximate optimal motion primitives, thereby predicting the effective cost-to-go (to the target). In our experiments, we first demonstrate generalizability to previously unseen environments with various sensor layouts. Simulation results show that by utilizing communication among the sensors and robot, we can achieve a $18.1%$ improvement in success rate while decreasing path detour mean by $29.3%$ and variability by $48.4%$. This is done without requiring a global map, positioning data, nor pre-calibration of the sensor network. Second, we perform a zero-shot transfer of our model from simulation to the real world. To this end, we train a `translator' model that translates between {latent encodings of} real and simulated images so that the navigation policy (which is trained entirely in simulation) can be used directly on the real robot, without additional fine-tuning. Physical experiments demonstrate our effectiveness in various cluttered environments.

Underwater autonomous mapping and characterization of marine debris in urban water bodies

Authors: Trygve Olav Fossum, Øystein Sture, Petter Norgren-Aamot, Ingrid Myrnes Hansen, Bjørn Christian Kvisvik, Anne Christine Knag
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2208.00802
Pdf link: https://arxiv.org/pdf/2208.00802
Abstract Marine debris originating from human activity has been accumulating in underwater environments such as oceans, lakes, and rivers for decades. The extent, type, and amount of waste is hard to assess as the exact mechanisms for spread are not understood, yielding unknown consequences for the marine environment and human health. Methods for detecting and mapping marine debris is therefore vital in order to gain insight into pollution dynamics, which in turn can be used to effectively plan and execute physical removal. Using an autonomous underwater vehicle (AUV), equipped with an underwater hyperspectral imager (UHI) and stereo-camera, marine debris was autonomously detected, mapped and quantified in the sheltered bay Store Lungegaardsvann in Bergen, Norway.

Keyword: localization

Learning Feature Decomposition for Domain Adaptive Monocular Depth Estimation

Authors: Shao-Yuan Lo, Wei Wang, Jim Thomas, Jingjing Zheng, Vishal M. Patel, Cheng-Hao Kuo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00160
Pdf link: https://arxiv.org/pdf/2208.00160
Abstract Monocular depth estimation (MDE) has attracted intense study due to its low cost and critical functions for robotic tasks such as localization, mapping and obstacle detection. Supervised approaches have led to great success with the advance of deep learning, but they rely on large quantities of ground-truth depth annotations that are expensive to acquire. Unsupervised domain adaptation (UDA) transfers knowledge from labeled source data to unlabeled target data, so as to relax the constraint of supervised learning. However, existing UDA approaches may not completely align the domain gap across different datasets because of the domain shift problem. We believe better domain alignment can be achieved via well-designed feature decomposition. In this paper, we propose a novel UDA method for MDE, referred to as Learning Feature Decomposition for Adaptation (LFDA), which learns to decompose the feature space into content and style components. LFDA only attempts to align the content component since it has a smaller domain gap. Meanwhile, it excludes the style component which is specific to the source domain from training the primary task. Furthermore, LFDA uses separate feature distribution estimations to further bridge the domain gap. Extensive experiments on three domain adaptative MDE scenarios show that the proposed method achieves superior accuracy and lower computational cost compared to the state-of-the-art approaches.

One-Shot Medical Landmark Localization by Edge-Guided Transform and Noisy Landmark Refinement

Authors: Zihao Yin, Ping Gong, Chunyu Wang, Yizhou Yu, Yizhou Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00453
Pdf link: https://arxiv.org/pdf/2208.00453
Abstract As an important upstream task for many medical applications, supervised landmark localization still requires non-negligible annotation costs to achieve desirable performance. Besides, due to cumbersome collection procedures, the limited size of medical landmark datasets impacts the effectiveness of large-scale self-supervised pre-training methods. To address these challenges, we propose a two-stage framework for one-shot medical landmark localization, which first infers landmarks by unsupervised registration from the labeled exemplar to unlabeled targets, and then utilizes these noisy pseudo labels to train robust detectors. To handle the significant structure variations, we learn an end-to-end cascade of global alignment and local deformations, under the guidance of novel loss functions which incorporate edge information. In stage II, we explore self-consistency for selecting reliable pseudo labels and cross-consistency for semi-supervised learning. Our method achieves state-of-the-art performances on public datasets of different body parts, which demonstrates its general applicability.

Modeling Human Response to Robot Errors for Timely Error Detection

Authors: Maia Stiber, Russell Taylor, Chien-Ming Huang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2208.00565
Pdf link: https://arxiv.org/pdf/2208.00565
Abstract In human-robot collaboration, robot errors are inevitable -- damaging user trust, willingness to work together, and task performance. Prior work has shown that people naturally respond to robot errors socially and that in social interactions it is possible to use human responses to detect errors. However, there is little exploration in the domain of non-social, physical human-robot collaboration such as assembly and tool retrieval. In this work, we investigate how people's organic, social responses to robot errors may be used to enable timely automatic detection of errors in physical human-robot interactions. We conducted a data collection study to obtain facial responses to train a real-time detection algorithm and a case study to explore the generalizability of our method with different task settings and errors. Our results show that natural social responses are effective signals for timely detection and localization of robot errors even in non-social contexts and that our method is robust across a variety of task contexts, robot errors, and user responses. This work contributes to robust error detection without detailed task specifications.

Similarity-based web element localization for robust test automation

Authors: Michel Nass, Emil Alégroth, Robert Feldt, Maurizio Leotta, Filippo Ricca
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2208.00677
Pdf link: https://arxiv.org/pdf/2208.00677
Abstract Non-robust (fragile) test execution is a commonly reported challenge in GUI-based test automation, despite much research and several proposed solutions. A test script needs to be resilient to (minor) changes in the tested application but, at the same time, fail when detecting potential issues that require investigation. Test script fragility is a multi-faceted problem, but one crucial challenge is reliably identifying and locating the correct target web elements when the website evolves between releases or otherwise fails and reports an issue. This paper proposes and evaluates a novel approach called similarity-based web element localization (Similo), which leverages information from multiple web element locator parameters to identify a target element using a weighted similarity score. The experimental study compares Similo to a baseline approach for web element localization. To get an extensive empirical basis, we target 40 of the most popular websites on the Internet in our evaluation. Robustness is considered by counting the number of web elements found in a recent website version compared to how many of these existed in an older version. Results of the experiment show that Similo outperforms the baseline representing the current state-of-the-art; it failed to locate the correct target web element in 72 out of 598 considered cases compared to 146 failed cases for the baseline approach. This study presents evidence that quantifying the similarity between multiple attributes of web elements when trying to locate them, as in our proposed Similo approach, is beneficial. With acceptable efficiency, Similo gives significantly higher effectiveness (i.e., robustness) than the baseline web element localization approach.

Design Guidelines for Apache Kafka Driven Data Management and Distribution in Smart Cities

Authors: Theofanis P. Raptis, Claudio Cicconetti, Manolis Falelakis, Tassos Kanellos, Tomás Pariente Lobo
Subjects: Networking and Internet Architecture (cs.NI); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2208.00786
Pdf link: https://arxiv.org/pdf/2208.00786
Abstract Smart city management is going through a remarkable transition, in terms of quality and diversity of services provided to the end-users. The stakeholders that deliver pervasive applications are now able to address fundamental challenges in the big data value chain, from data acquisition, data analysis and processing, data storage and curation, and data visualisation in real scenarios. Industry 4.0 is pushing this trend forward, demanding for servitization of products and data, also for the smart cities sector where humans, sensors and devices are operating in strict collaboration. The data produced by the ubiquitous devices must be processed quickly to allow the implementation of reactive services such as situational awareness, video surveillance and geo-localization, while always ensuring the safety and privacy of involved citizens. This paper proposes a modular architecture to (i) leverage innovative technologies for data acquisition, management and distribution (such as Apache Kafka and Apache NiFi), (ii) develop a multi-layer engineering solution for revealing valuable and hidden societal knowledge in smart cities environment, and (iii) tackle the main issues in tasks involving complex data flows and provide general guidelines to solve them. We derived some guidelines from an experimental setting performed together with leading industrial technical departments to accomplish an efficient system for monitoring and servitization of smart city assets, with a scalable platform that confirms its usefulness in numerous smart city use cases with different needs.

DSLA: Dynamic smooth label assignment for efficient anchor-free object detection

Authors: Hu Su, Yonghao He, Jiabin Zhang, Wei Zou, Bin Fan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00817
Pdf link: https://arxiv.org/pdf/2208.00817
Abstract Anchor-free detectors basically formulate object detection as dense classification and regression. For popular anchor-free detectors, it is common to introduce an individual prediction branch to estimate the quality of localization. The following inconsistencies are observed when we delve into the practices of classification and quality estimation. Firstly, for some adjacent samples which are assigned completely different labels, the trained model would produce similar classification scores. This violates the training objective and leads to performance degradation. Secondly, it is found that detected bounding boxes with higher confidences contrarily have smaller overlaps with the corresponding ground-truth. Accurately localized bounding boxes would be suppressed by less accurate ones in the Non-Maximum Suppression (NMS) procedure. To address the inconsistency problems, the Dynamic Smooth Label Assignment (DSLA) method is proposed. Based on the concept of centerness originally developed in FCOS, a smooth assignment strategy is proposed. The label is smoothed to a continuous value in [0, 1] to make a steady transition between positive and negative samples. Intersection-of-Union (IoU) is predicted dynamically during training and is coupled with the smoothed label. The dynamic smooth label is assigned to supervise the classification branch. Under such supervision, quality estimation branch is naturally merged into the classification branch, which simplifies the architecture of anchor-free detector. Comprehensive experiments are conducted on the MS COCO benchmark. It is demonstrated that, DSLA can significantly boost the detection accuracy by alleviating the above inconsistencies for anchor-free detectors. Our codes are released at https://github.com/YonghaoHe/DSLA.

Benchmarking Visual-Inertial Deep Multimodal Fusion for Relative Pose Regression and Odometry-aided Absolute Pose Regression

Authors: Felix Ott, Nisha Lakshmana Raichur, David Rügamer, Tobias Feigl, Heiko Neumann, Bernd Bischl, Christopher Mutschler
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00919
Pdf link: https://arxiv.org/pdf/2208.00919
Abstract Visual-inertial localization is a key problem in computer vision and robotics applications such as virtual reality, self-driving cars, and aerial vehicles. The goal is to estimate an accurate pose of an object when either the environment or the dynamics are known. Recent methods directly regress the pose using convolutional and spatio-temporal networks. Absolute pose regression (APR) techniques predict the absolute camera pose from an image input in a known scene. Odometry methods perform relative pose regression (RPR) that predicts the relative pose from a known object dynamic (visual or inertial inputs). The localization task can be improved by retrieving information of both data sources for a cross-modal setup, which is a challenging problem due to contradictory tasks. In this work, we conduct a benchmark to evaluate deep multimodal fusion based on PGO and attention networks. Auxiliary and Bayesian learning are integrated for the APR task. We show accuracy improvements for the RPR-aided APR task and for the RPR-RPR task for aerial vehicles and hand-held devices. We conduct experiments on the EuRoC MAV and PennCOSYVIO datasets, and record a novel industry dataset.

Information-Aware Guidance for Magnetic Anomaly based Navigation

Authors: J. Humberto Ramos, Jaejeong Shin, Kyle Volle, Paul Buzaud, Kevin Brink, Prashant Ganesh
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2208.00988
Pdf link: https://arxiv.org/pdf/2208.00988
Abstract In the absence of an absolute positioning system, such as GPS, autonomous vehicles are subject to accumulation of positional error which can interfere with reliable performance. Improved navigational accuracy without GPS enables vehicles to achieve a higher degree of autonomy and reliability, both in terms of decision making and safety. This paper details the use of two navigation systems for autonomous agents using magnetic field anomalies to localize themselves within a map; both techniques use the information content in the environment in distinct ways and are aimed at reducing the localization uncertainty. The first method is based on a nonlinear observability metric of the vehicle model, while the second is an information theory based technique which minimizes the expected entropy of the system. These conditions are used to design guidance laws that minimize the localization uncertainty and are verified both in simulation and hardware experiments are presented for the observability approach.

Robust Change Detection Based on Neural Descriptor Fields

Authors: Jiahui Fu, Yilun Du, Kurran Singh, Joshua B. Tenenbaum, John J. Leonard
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.01014
Pdf link: https://arxiv.org/pdf/2208.01014
Abstract The ability to reason about changes in the environment is crucial for robots operating over extended periods of time. Agents are expected to capture changes during operation so that actions can be followed to ensure a smooth progression of the working session. However, varying viewing angles and accumulated localization errors make it easy for robots to falsely detect changes in the surrounding world due to low observation overlap and drifted object associations. In this paper, based on the recently proposed category-level Neural Descriptor Fields (NDFs), we develop an object-level online change detection approach that is robust to partially overlapping observations and noisy localization results. Utilizing the shape completion capability and SE(3)-equivariance of NDFs, we represent objects with compact shape codes encoding full object shapes from partial observations. The objects are then organized in a spatial tree structure based on object centers recovered from NDFs for fast queries of object neighborhoods. By associating objects via shape code similarity and comparing local object-neighbor spatial layout, our proposed approach demonstrates robustness to low observation overlap and localization noises. We conduct experiments on both synthetic and real-world sequences and achieve improved change detection results compared to multiple baseline methods. Project webpage: https://yilundu.github.io/ndf_change

Keyword: transformer

Dynamically Retrieving Knowledge via Query Generation for informative dialogue response

Authors: Zhongtian Hu, Yangqi Chen, Yushuang Liu, Lifang Wang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2208.00128
Pdf link: https://arxiv.org/pdf/2208.00128
Abstract Knowledge-driven dialogue generation has recently made remarkable breakthroughs. Compared with general dialogue systems, superior knowledge-driven dialogue systems can generate more informative and knowledgeable responses with pre-provided knowledge. However, in practical applications, the dialogue system cannot be provided with corresponding knowledge in advance. In order to solve the problem, we design a knowledge-driven dialogue system named DRKQG (\emph{Dynamically Retrieving Knowledge via Query Generation for informative dialogue response}). Specifically, the system can be divided into two modules: query generation module and dialogue generation module. First, a time-aware mechanism is utilized to capture context information and a query can be generated for retrieving knowledge. Then, we integrate copy Mechanism and Transformers, which allows the response generation module produces responses derived from the context and retrieved knowledge. Experimental results at LIC2022, Language and Intelligence Technology Competition, show that our module outperforms the baseline model by a large margin on automatic evaluation metrics, while human evaluation by Baidu Linguistics team shows that our system achieves impressive results in Factually Correct and Knowledgeable.

Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding

Authors: Hao Wen, Yunze Liu, Jingwei Huang, Bo Duan, Li Yi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00281
Pdf link: https://arxiv.org/pdf/2208.00281
Abstract This paper proposes a 4D backbone for long-term point cloud video understanding. A typical way to capture spatial-temporal context is using 4Dconv or transformer without hierarchy. However, those methods are neither effective nor efficient enough due to camera motion, scene changes, sampling patterns, and the complexity of 4D data. To address those issues, we leverage the primitive plane as a mid-level representation to capture the long-term spatial-temporal context in 4D point cloud videos and propose a novel hierarchical backbone named Point Primitive Transformer(PPTr), which is mainly composed of intra-primitive point transformers and primitive transformers. Extensive experiments show that PPTr outperforms the previous state of the arts on different tasks

Doubly Deformable Aggregation of Covariance Matrices for Few-shot Segmentation

Authors: Zhitong Xiong, Haopeng Li, Xiao Xiang Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00306
Pdf link: https://arxiv.org/pdf/2208.00306
Abstract Training semantic segmentation models with few annotated samples has great potential in various real-world applications. For the few-shot segmentation task, the main challenge is how to accurately measure the semantic correspondence between the support and query samples with limited training data. To address this problem, we propose to aggregate the learnable covariance matrices with a deformable 4D Transformer to effectively predict the segmentation map. Specifically, in this work, we first devise a novel hard example mining mechanism to learn covariance kernels for the Gaussian process. The learned covariance kernel functions have great advantages over existing cosine similarity-based methods in correspondence measurement. Based on the learned covariance kernels, an efficient doubly deformable 4D Transformer module is designed to adaptively aggregate feature similarity maps into segmentation results. By combining these two designs, the proposed method can not only set new state-of-the-art performance on public benchmarks, but also converge extremely faster than existing methods. Experiments on three public datasets have demonstrated the effectiveness of our method.

One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning

Authors: Zhipeng Zhang, Zhimin Wei, Zhongzhen Huang, Rui Niu, Peng Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00361
Pdf link: https://arxiv.org/pdf/2208.00361
Abstract Referring Expression Comprehension (REC) is one of the most important tasks in visual reasoning that requires a model to detect the target object referred by a natural language expression. Among the proposed pipelines, the one-stage Referring Expression Comprehension (OSREC) has become the dominant trend since it merges the region proposal and selection stages. Many state-of-the-art OSREC models adopt a multi-hop reasoning strategy because a sequence of objects is frequently mentioned in a single expression which needs multi-hop reasoning to analyze the semantic relation. However, one unsolved issue of these models is that the number of reasoning steps needs to be pre-defined and fixed before inference, ignoring the varying complexity of expressions. In this paper, we propose a Dynamic Multi-step Reasoning Network, which allows the reasoning steps to be dynamically adjusted based on the reasoning state and expression complexity. Specifically, we adopt a Transformer module to memorize & process the reasoning state and a Reinforcement Learning strategy to dynamically infer the reasoning steps. The work achieves the state-of-the-art performance or significant improvements on several REC datasets, ranging from RefCOCO (+, g) with short expressions, to Ref-Reasoning, a dataset with long and complex compositional expressions.

Less is More: Consistent Video Depth Estimation with Masked Frames Modeling

Authors: Yiran Wang, Zhiyu Pan, Xingyi Li, Zhiguo Cao, Ke Xian, Jianming Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00380
Pdf link: https://arxiv.org/pdf/2208.00380
Abstract Temporal consistency is the key challenge of video depth estimation. Previous works are based on additional optical flow or camera poses, which is time-consuming. By contrast, we derive consistency with less information. Since videos inherently exist with heavy temporal redundancy, a missing frame could be recovered from neighboring ones. Inspired by this, we propose the frame masking network (FMNet), a spatial-temporal transformer network predicting the depth of masked frames based on their neighboring frames. By reconstructing masked temporal features, the FMNet can learn intrinsic inter-frame correlations, which leads to consistency. Compared with prior arts, experimental results demonstrate that our approach achieves comparable spatial accuracy and higher temporal consistency without any additional information. Our work provides a new perspective on consistent video depth estimation.

STrajNet: Occupancy Flow Prediction via Multi-modal Swin Transformer

Authors: Haochen Liu, Zhiyu Huang, Chen Lv
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2208.00394
Pdf link: https://arxiv.org/pdf/2208.00394
Abstract Making an accurate prediction of occupancy and flow is essential to enable better safety and interaction for autonomous vehicles under complex traffic scenarios. This work proposes STrajNet: a multi-modal Swin Transformerbased framework for effective scene occupancy and flow predictions. We employ Swin Transformer to encode the image and interaction-aware motion representations and propose a cross-attention module to inject motion awareness into grid cells across different time steps. Flow and occupancy predictions are then decoded through temporalsharing Pyramid decoders. The proposed method shows competitive prediction accuracy and other evaluation metrics in the Waymo Open Dataset benchmark.

Neural Knowledge Bank for Pretrained Transformers

Authors: Damai Dai, Wenbin Jiang, Qingxiu Dong, Yajuan Lyu, Qiaoqiao She, Zhifang Sui
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2208.00399
Pdf link: https://arxiv.org/pdf/2208.00399
Abstract The ability of pretrained Transformers to remember factual knowledge is essential for knowledge-intense downstream tasks such as closed-book question answering. Existing work has shown that pretrained Transformers can recall or leverage factual knowledge that appears in the pretraining corpus to some degree. However, due to the limit of the model capacity, the ability of pretrained models to remember factual knowledge is also limited. Dai et al. (2022) find that the Feed-Forward Networks (FFNs) in pretrained Transformers store factual knowledge in a memory-like manner. Inspired by this finding, we propose a Neural Knowledge Bank (NKB) to store extra factual knowledge for pretrained Transformers. To be specific, we also regard FFNs as key-value memories, and extend them with additional memory slots. During knowledge injection, we fix the original model and inject factual knowledge into the extended memory slots, so there will be no catastrophic forgetting for the pretrained model. In addition, the view of FFNs as key-value memories makes the NKB highly interpretable. We use three closed-book question answering datasets to show our strong ability to store extra factual knowledge. Also, we prove that the NKB will not degrade the general language generation ability of pretrained models through two representative generation tasks, summarization and machine translation. Further, we thoroughly analyze the NKB to reveal its working mechanism and present the meaning of its keys and values in a human-readable way. On top of it, we perform a preliminary attempt to directly update the factual knowledge in the NKB without any additional training.

Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition

Authors: Xudong Xie, Ling Fu, Zhifei Zhang, Zhaowen Wang, Xiang Bai
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00438
Pdf link: https://arxiv.org/pdf/2208.00438
Abstract Artistic text recognition is an extremely challenging task with a wide range of applications. However, current scene text recognition methods mainly focus on irregular text while have not explored artistic text specifically. The challenges of artistic text recognition include the various appearance with special-designed fonts and effects, the complex connections and overlaps between characters, and the severe interference from background patterns. To alleviate these problems, we propose to recognize the artistic text at three levels. Firstly, corner points are applied to guide the extraction of local features inside characters, considering the robustness of corner structures to appearance and shape. In this way, the discreteness of the corner points cuts off the connection between characters, and the sparsity of them improves the robustness for background interference. Secondly, we design a character contrastive loss to model the character-level feature, improving the feature representation for character classification. Thirdly, we utilize Transformer to learn the global feature on image-level and model the global relationship of the corner points, with the assistance of a corner-query cross-attention mechanism. Besides, we provide an artistic text dataset to benchmark the performance. Experimental results verify the significant superiority of our proposed method on artistic text recognition and also achieve state-of-the-art performance on several blurred and perspective datasets.

Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval

Authors: Sheng-Chieh Lin, Minghan Li, Jimmy Lin
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2208.00511
Pdf link: https://arxiv.org/pdf/2208.00511
Abstract Pre-trained transformers has declared its success in many NLP tasks. One thread of work focuses on training bi-encoder models (i.e., dense retrievers) to effectively encode sentences or passages into single-vector dense vectors for efficient approximate nearest neighbor (ANN) search. However, recent work has demonstrated that transformers pre-trained with mask language modeling (MLM) are not capable of effectively aggregating text information into a single dense vector due to task-mismatch between pre-training and fine-tuning. Therefore, computationally expensive techniques have been adopted to train dense retrievers, such as large batch size, knowledge distillation or post pre-training. In this work, we present a simple approach to effectively aggregate textual representation from the pre-trained transformer into a dense vector. Extensive experiments show that our approach improves the robustness of the single-vector approach under both in-domain and zero-shot evaluations without any computationally expensive training techniques. Our work demonstrates that MLM pre-trained transformers can be used to effectively encode text information into a single-vector for dense retrieval. Code are available at: https://github.com/castorini/dhr

CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning

Authors: Mahdi Saleh, Yige Wang, Nassir Navab, Benjamin Busam, Federico Tombari
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2208.00524
Pdf link: https://arxiv.org/pdf/2208.00524
Abstract Processing 3D data efficiently has always been a challenge. Spatial operations on large-scale point clouds, stored as sparse data, require extra cost. Attracted by the success of transformers, researchers are using multi-head attention for vision tasks. However, attention calculations in transformers come with quadratic complexity in the number of inputs and miss spatial intuition on sets like point clouds. We redesign set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We propose our local attention unit, which captures features in a spatial neighborhood. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. Finally, to mitigate the non-heterogeneity of point clouds, we propose an efficient Multi-Scale Tokenization (MST), which extracts scale-invariant tokens for attention operations. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods while requiring significantly fewer computations. Our proposed architecture predicts segmentation labels with around half the latency and parameter count of the previous most efficient method with comparable performance. The code is available at https://github.com/YigeWang-WHU/CloudAttention.

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Authors: Tan Nguyen, Richard G. Baraniuk, Robert M. Kirby, Stanley J. Osher, Bao Wang
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2208.00579
Pdf link: https://arxiv.org/pdf/2208.00579
Abstract Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emph{momentum transformer}, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.

SiamixFormer: A Siamese Transformer Network For Building Detection And Change Detection From Bi-Temporal Remote Sensing Images

Authors: Amir mohammadian, Foad Ghaderi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00657
Pdf link: https://arxiv.org/pdf/2208.00657
Abstract Building detection and change detection using remote sensing images can help urban and rescue planning. Moreover, they can be used for building damage assessment after natural disasters. Currently, most of the existing models for building detection use only one image (pre-disaster image) to detect buildings. This is based on the idea that post-disaster images reduce the model's performance because of presence of destroyed buildings. In this paper, we propose a siamese model, called SiamixFormer, which uses pre- and post-disaster images as input. Our model has two encoders and has a hierarchical transformer architecture. The output of each stage in both encoders is given to a temporal transformer for feature fusion in a way that query is generated from pre-disaster images and (key, value) is generated from post-disaster images. To this end, temporal features are also considered in feature fusion. Another advantage of using temporal transformers in feature fusion is that they can better maintain large receptive fields generated by transformer encoders compared with CNNs. Finally, the output of the temporal transformer is given to a simple MLP decoder at each stage. The SiamixFormer model is evaluated on xBD, and WHU datasets, for building detection and on LEVIR-CD and CDD datasets for change detection and could outperform the state-of-the-art.

Local Perception-Aware Transformer for Aerial Tracking

Authors: Changhong Fu, Weiyu Peng, Sihang Li, Junjie Ye, Ziang Cao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00662
Pdf link: https://arxiv.org/pdf/2208.00662
Abstract Transformer-based visual object tracking has been utilized extensively. However, the Transformer structure is lack of enough inductive bias. In addition, only focusing on encoding the global feature does harm to modeling local details, which restricts the capability of tracking in aerial robots. Specifically, with local-modeling to global-search mechanism, the proposed tracker replaces the global encoder by a novel local-recognition encoder. In the employed encoder, a local-recognition attention and a local element correction network are carefully designed for reducing the global redundant information interference and increasing local inductive bias. Meanwhile, the latter can model local object details precisely under aerial view through detail-inquiry net. The proposed method achieves competitive accuracy and robustness in several authoritative aerial benchmarks with 316 sequences in total. The proposed tracker's practicability and efficiency have been validated by the real-world tests.

Efficient Long-Text Understanding with Short-Text Models

Authors: Maor Ivgi, Uri Shaham, Jonathan Berant
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.00748
Pdf link: https://arxiv.org/pdf/2208.00748
Abstract Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pretraining from scratch. In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. Specifically, we partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks (fusion-in-decoder). We illustrate through controlled experiments that SLED offers a viable strategy for long text understanding and evaluate our approach on SCROLLS, a benchmark with seven datasets across a wide range of language understanding tasks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.

$\textrm{D}^3\textrm{Former}$: Debiased Dual Distilled Transformer for Incremental Learning

Authors: Abdelrahman Mohamed, Rushali Grandhe, KJ Joseph, Salman Khan, Fahad Khan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.00777
Pdf link: https://arxiv.org/pdf/2208.00777
Abstract Class incremental learning (CIL) involves learning a classification model where groups of new classes are encountered in every learning phase. The goal is to learn a unified model performant on all the classes observed so far. Given the recent popularity of Vision Transformers (ViTs) in conventional classification settings, an interesting question is to study their continual learning behaviour. In this work, we develop a Debiased Dual Distilled Transformer for CIL dubbed $\textrm{D}^3\textrm{Former}$. The proposed model leverages a hybrid nested ViT design to ensure data efficiency and scalability to small as well as large datasets. In contrast to a recent ViT based CIL approach, our $\textrm{D}^3\textrm{Former}$ does not dynamically expand its architecture when new tasks are learned and remains suitable for a large number of incremental tasks. The improved CIL behaviour of $\textrm{D}^3\textrm{Former}$ owes to two fundamental changes to the ViT design. First, we treat the incremental learning as a long-tail classification problem where the majority samples from new classes vastly outnumber the limited exemplars available for old classes. To avoid biasness against the minority old classes, we propose to dynamically adjust logits to emphasize on retaining the representations relevant to old tasks. Second, we propose to preserve the configuration of spatial attention maps as the learning progresses across tasks. This helps in reducing catastrophic forgetting via constraining the model to retain the attention on the most discriminative regions. $\textrm{D}^3\textrm{Former}$ obtains favorable results on incremental versions of CIFAR-100, MNIST, SVHN, and ImageNet datasets.

MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild

Authors: Yuanyuan Liu, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, Shiguang Shan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00847
Pdf link: https://arxiv.org/pdf/2208.00847
Abstract Dynamic facial expression recognition (FER) databases provide important data support for affective computing and applications. However, most FER databases are annotated with several basic mutually exclusive emotional categories and contain only one modality, e.g., videos. The monotonous labels and modality cannot accurately imitate human emotions and fulfill applications in the real world. In this paper, we propose MAFW, a large-scale multi-modal compound affective database with 10,045 video-audio clips in the wild. Each clip is annotated with a compound emotional category and a couple of sentences that describe the subjects' affective behaviors in the clip. For the compound emotion annotation, each clip is categorized into one or more of the 11 widely-used emotions, i.e., anger, disgust, fear, happiness, neutral, sadness, surprise, contempt, anxiety, helplessness, and disappointment. To ensure high quality of the labels, we filter out the unreliable annotations by an Expectation Maximization (EM) algorithm, and then obtain 11 single-label emotion categories and 32 multi-label emotion categories. To the best of our knowledge, MAFW is the first in-the-wild multi-modal database annotated with compound emotion annotations and emotion-related captions. Additionally, we also propose a novel Transformer-based expression snippet feature learning method to recognize the compound emotions leveraging the expression-change relations among different emotions and modalities. Extensive experiments on MAFW database show the advantages of the proposed method over other state-of-the-art methods for both uni- and multi-modal FER. Our MAFW database is publicly available from https://mafw-database.github.io/MAFW.

Learning from flowsheets: A generative transformer model for autocompletion of flowsheets

Authors: Gabriel Vogel, Lukas Schulze Balhorn, Artur M. Schweidtmann
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2208.00859
Pdf link: https://arxiv.org/pdf/2208.00859
Abstract We propose a novel method enabling autocompletion of chemical flowsheets. This idea is inspired by the autocompletion of text. We represent flowsheets as strings using the text-based SFILES 2.0 notation and learn the grammatical structure of the SFILES 2.0 language and common patterns in flowsheets using a transformer-based language model. We pre-train our model on synthetically generated flowsheets to learn the flowsheet language grammar. Then, we fine-tune our model in a transfer learning step on real flowsheet topologies. Finally, we use the trained model for causal language modeling to autocomplete flowsheets. Eventually, the proposed method can provide chemical engineers with recommendations during interactive flowsheet synthesis. The results demonstrate a high potential of this approach for future AI-assisted process synthesis.

Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem

Authors: Zheng Wang, Wenjie Ruan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.00906
Pdf link: https://arxiv.org/pdf/2208.00906
Abstract Recent research on the robustness of deep learning has shown that Vision Transformers (ViTs) surpass the Convolutional Neural Networks (CNNs) under some perturbations, e.g., natural corruption, adversarial attacks, etc. Some papers argue that the superior robustness of ViT comes from the segmentation of its input images; others say that the Multi-head Self-Attention (MSA) is the key to preserving the robustness. In this paper, we aim to introduce a principled and unified theoretical framework to investigate such an argument on ViT's robustness. We first theoretically prove that, unlike Transformers in Natural Language Processing, ViTs are Lipschitz continuous. Then we theoretically analyze the adversarial robustness of ViTs from the perspective of the Cauchy Problem, via which we can quantify how the robustness propagates through layers. We demonstrate that the first and last layers are the critical factors to affect the robustness of ViTs. Furthermore, based on our theory, we empirically show that unlike the claims from existing research, MSA only contributes to the adversarial robustness of ViTs under weak adversarial attacks, e.g., FGSM, and surprisingly, MSA actually comprises the model's adversarial robustness under stronger attacks, e.g., PGD attacks.

BabelBERT: Massively Multilingual Transformers Meet a Massively Multilingual Lexical Resource

Authors: Tommaso Green, Simone Paolo Ponzetto, Goran Glavaš
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2208.01018
Pdf link: https://arxiv.org/pdf/2208.01018
Abstract While pretrained language models (PLMs) primarily serve as general purpose text encoders that can be fine-tuned for a wide variety of downstream tasks, recent work has shown that they can also be rewired to produce high-quality word representations (i.e., static word embeddings) and yield good performance in type-level lexical tasks. While existing work primarily focused on lexical specialization of PLMs in monolingual and bilingual settings, in this work we expose massively multilingual transformers (MMTs, e.g., mBERT or XLM-R) to multilingual lexical knowledge at scale, leveraging BabelNet as the readily available rich source of multilingual and cross-lingual type-level lexical knowledge. Concretely, we leverage BabelNet's multilingual synsets to create synonym pairs across $50$ languages and then subject the MMTs (mBERT and XLM-R) to a lexical specialization procedure guided by a contrastive objective. We show that such massively multilingual lexical specialization brings massive gains in two standard cross-lingual lexical tasks, bilingual lexicon induction and cross-lingual word similarity, as well as in cross-lingual sentence retrieval. Crucially, we observe gains for languages unseen in specialization, indicating that the multilingual lexical specialization enables generalization to languages with no lexical constraints. In a series of subsequent controlled experiments, we demonstrate that the pretraining quality of word representations in the MMT for languages involved in specialization has a much larger effect on performance than the linguistic diversity of the set of constraints. Encouragingly, this suggests that lexical tasks involving low-resource languages benefit the most from lexical knowledge of resource-rich languages, generally much more available.

On the Limitations of Sociodemographic Adaptation with Transformers

Authors: Chia-Chien Hung, Anne Lauscher, Dirk Hovy, Simone Paolo Ponzetto, Goran Glavaš
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2208.01029
Pdf link: https://arxiv.org/pdf/2208.01029
Abstract Sociodemographic factors (e.g., gender or age) shape our language. Previous work showed that incorporating specific sociodemographic factors can consistently improve performance for various NLP tasks in traditional NLP models. We investigate whether these previous findings still hold with state-of-the-art pretrained Transformers. We use three common specialization methods proven effective for incorporating external knowledge into pretrained Transformers (e.g., domain-specific or geographic knowledge). We adapt the language representations for the sociodemographic dimensions of gender and age, using continuous language modeling and dynamic multi-task learning for adaptation, where we couple language modeling with the prediction of a sociodemographic class. Our results when employing a multilingual model show substantial performance gains across four languages (English, German, French, and Danish). These findings are in line with the results of previous work and hold promise for successful sociodemographic specialization. However, controlling for confounding factors like domain and language shows that, while sociodemographic adaptation does improve downstream performance, the gains do not always solely stem from sociodemographic knowledge. Our results indicate that sociodemographic specialization, while very important, is still an unresolved problem in NLP.

Keyword: autonomous driving

Robust Trajectory Prediction against Adversarial Attacks

Authors: Yulong Cao, Danfei Xu, Xinshuo Weng, Zhuoqing Mao, Anima Anandkumar, Chaowei Xiao, Marco Pavone
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.00094
Pdf link: https://arxiv.org/pdf/2208.00094
Abstract Trajectory prediction using deep neural networks (DNNs) is an essential component of autonomous driving (AD) systems. However, these methods are vulnerable to adversarial attacks, leading to serious consequences such as collisions. In this work, we identify two key ingredients to defend trajectory prediction models against adversarial attacks including (1) designing effective adversarial training methods and (2) adding domain-specific data augmentation to mitigate the performance degradation on clean data. We demonstrate that our method is able to improve the performance by 46% on adversarial data and at the cost of only 3% performance degradation on clean data, compared to the model trained with clean data. Additionally, compared to existing robust methods, our method can improve performance by 21% on adversarial examples and 9% on clean data. Our robust model is evaluated with a planner to study its downstream impacts. We demonstrate that our model can significantly reduce the severe accident rates (e.g., collisions and off-road driving).

Perspectives on the System-level Design of a Safe Autonomous Driving Stack

Authors: Majd Hawasly, Jonathan Sadeghi, Morris Antonello, Stefano V. Albrecht, John Redford, Subramanian Ramamoorthy
Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2208.00096
Pdf link: https://arxiv.org/pdf/2208.00096
Abstract Achieving safe and robust autonomy is the key bottleneck on the path towards broader adoption of autonomous vehicles technology. This motivates going beyond extrinsic metrics such as miles between disengagement, and calls for approaches that embody safety by design. In this paper, we address some aspects of this challenge, with emphasis on issues of motion planning and prediction. We do this through description of novel approaches taken to solving selected sub-problems within an autonomous driving stack, in the process introducing the design philosophy being adopted within Five. This includes safe-by-design planning, interpretable as well as verifiable prediction, and modelling of perception errors to enable effective sim-to-real and real-to-sim transfer within the testing pipeline of a realistic autonomous system.

enpheeph: A Fault Injection Framework for Spiking and Compressed Deep Neural Networks

Authors: Alessio Colucci, Andreas Steininger, Muhammad Shafique
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.00328
Pdf link: https://arxiv.org/pdf/2208.00328
Abstract Research on Deep Neural Networks (DNNs) has focused on improving performance and accuracy for real-world deployments, leading to new models, such as Spiking Neural Networks (SNNs), and optimization techniques, e.g., quantization and pruning for compressed networks. However, the deployment of these innovative models and optimization techniques introduces possible reliability issues, which is a pillar for DNNs to be widely used in safety-critical applications, e.g., autonomous driving. Moreover, scaling technology nodes have the associated risk of multiple faults happening at the same time, a possibility not addressed in state-of-the-art resiliency analyses. Towards better reliability analysis for DNNs, we present enpheeph, a Fault Injection Framework for Spiking and Compressed DNNs. The enpheeph framework enables optimized execution on specialized hardware devices, e.g., GPUs, while providing complete customizability to investigate different fault models, emulating various reliability constraints and use-cases. Hence, the faults can be executed on SNNs as well as compressed networks with minimal-to-none modifications to the underlying code, a feat that is not achievable by other state-of-the-art tools. To evaluate our enpheeph framework, we analyze the resiliency of different DNN and SNN models, with different compression techniques. By injecting a random and increasing number of faults, we show that DNNs can show a reduction in accuracy with a fault rate as low as 7 x 10 ^ (-7) faults per parameter, with an accuracy drop higher than 40%. Run-time overhead when executing enpheeph is less than 20% of the baseline execution time when executing 100 000 faults concurrently, at least 10x lower than state-of-the-art frameworks, making enpheeph future-proof for complex fault injection scenarios. We release enpheeph at https://github.com/Alexei95/enpheeph.

Learning an Interpretable Model for Driver Behavior Prediction with Inductive Biases

Authors: Salar Arbabi, Davide Tavernini, Saber Fallah, Richard Bowden
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2208.00516
Pdf link: https://arxiv.org/pdf/2208.00516
Abstract To plan safe maneuvers and act with foresight, autonomous vehicles must be capable of accurately predicting the uncertain future. In the context of autonomous driving, deep neural networks have been successfully applied to learning predictive models of human driving behavior from data. However, the predictions suffer from cascading errors, resulting in large inaccuracies over long time horizons. Furthermore, the learned models are black boxes, and thus it is often unclear how they arrive at their predictions. In contrast, rule-based models, which are informed by human experts, maintain long-term coherence in their predictions and are human-interpretable. However, such models often lack the sufficient expressiveness needed to capture complex real-world dynamics. In this work, we begin to close this gap by embedding the Intelligent Driver Model, a popular hand-crafted driver model, into deep neural networks. Our model's transparency can offer considerable advantages, e.g., in debugging the model and more easily interpreting its predictions. We evaluate our approach on a simulated merging scenario, showing that it yields a robust model that is end-to-end trainable and provides greater transparency at no cost to the model's predictive accuracy.

Guidance on the Safety Assurance of Autonomous Systems in Complex Environments (SACE)

Authors: Richard Hawkins, Matt Osborne, Mike Parsons, Mark Nicholson, John McDermid, Ibrahim Habli
Subjects: Software Engineering (cs.SE); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2208.00853
Pdf link: https://arxiv.org/pdf/2208.00853
Abstract Autonomous systems (AS) are systems that have the capability to take decisions free from direct human control. AS are increasingly being considered for adoption for applications where their behaviour may cause harm, such as when used for autonomous driving, medical applications or in domestic environments. For such applications, being able to ensure and demonstrate (assure) the safety of the operation of the AS is crucial for their adoption. This can be particularly challenging where AS operate in complex and changing real-world environments. Establishing justified confidence in the safety of AS requires the creation of a compelling safety case. This document introduces a methodology for the Safety Assurance of Autonomous Systems in Complex Environments (SACE). SACE comprises a set of safety case patterns and a process for (1) systematically integrating safety assurance into the development of the AS and (2) for generating the evidence base for explicitly justifying the acceptable safety of the AS.

Aug 02 '22 04:08 zhuhu00

Paper-Daily-Notice Paper-Daily-Notice copied to clipboard

New submissions for Tue, 2 Aug 22

New submissions for Tue, 2 Aug 22

Keyword: SLAM

Visual-Inertial SLAM with Tightly-Coupled Dropout-Tolerant GPS Fusion

Keyword: odometry

Benchmarking Visual-Inertial Deep Multimodal Fusion for Relative Pose Regression and Odometry-aided Absolute Pose Regression

Keyword: livox

Keyword: loam

Keyword: lidar

PolarMix: A General Data Augmentation Technique for LiDAR Point Clouds

Keyword: loop detection

Keyword: nerf

Distilled Low Rank Neural Radiance Field with Quantization for Light Field Compression

MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures

DoF-NeRF: Depth-of-Field Meets Neural Radiance Fields

Keyword: mapping

On infrastructure for facilitation of inner source in small development teams

Neural Correspondence Field for Object Pose Estimation

Learning Feature Decomposition for Domain Adaptive Monocular Depth Estimation

Efficient Compilation and Mapping of Fixed Function Combinational Logic onto Digital Signal Processors Targeting Neural Network Inference and Utilizing High-level Synthesis

GitHub Marketplace for Practitioners and Researchers to Date: A Systematic Analysis of the Knowledge Mobilization Gap in Open Source Software Automation

An Experimental Study on Learning Correlated Equilibrium in Routing Games

Intelligent decision-making method of TBM operating parameters based on multiple constraints and objective optimization

Accurate Polygonal Mapping of Buildings in Satellite Imagery

Iterative shaping of optical potentials for one-dimensional Bose-Einstein condensates

Visual-Inertial SLAM with Tightly-Coupled Dropout-Tolerant GPS Fusion

Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning

Learning to Navigate using Visual Sensor Networks

Underwater autonomous mapping and characterization of marine debris in urban water bodies

Keyword: localization

Learning Feature Decomposition for Domain Adaptive Monocular Depth Estimation

One-Shot Medical Landmark Localization by Edge-Guided Transform and Noisy Landmark Refinement

Modeling Human Response to Robot Errors for Timely Error Detection

Similarity-based web element localization for robust test automation

Design Guidelines for Apache Kafka Driven Data Management and Distribution in Smart Cities

DSLA: Dynamic smooth label assignment for efficient anchor-free object detection

Benchmarking Visual-Inertial Deep Multimodal Fusion for Relative Pose Regression and Odometry-aided Absolute Pose Regression

Information-Aware Guidance for Magnetic Anomaly based Navigation

Robust Change Detection Based on Neural Descriptor Fields

Keyword: transformer

Dynamically Retrieving Knowledge via Query Generation for informative dialogue response

Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding

Doubly Deformable Aggregation of Covariance Matrices for Few-shot Segmentation

One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning

Less is More: Consistent Video Depth Estimation with Masked Frames Modeling

STrajNet: Occupancy Flow Prediction via Multi-modal Swin Transformer

Neural Knowledge Bank for Pretrained Transformers

Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition

Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval

CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

SiamixFormer: A Siamese Transformer Network For Building Detection And Change Detection From Bi-Temporal Remote Sensing Images

Local Perception-Aware Transformer for Aerial Tracking

Efficient Long-Text Understanding with Short-Text Models

$\textrm{D}^3\textrm{Former}$: Debiased Dual Distilled Transformer for Incremental Learning

MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild

Learning from flowsheets: A generative transformer model for autocompletion of flowsheets

Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem

BabelBERT: Massively Multilingual Transformers Meet a Massively Multilingual Lexical Resource

On the Limitations of Sociodemographic Adaptation with Transformers

Keyword: autonomous driving

Robust Trajectory Prediction against Adversarial Attacks

Perspectives on the System-level Design of a Safe Autonomous Driving Stack

enpheeph: A Fault Injection Framework for Spiking and Compressed Deep Neural Networks

Learning an Interpretable Model for Driver Behavior Prediction with Inductive Biases

Guidance on the Safety Assurance of Autonomous Systems in Complex Environments (SACE)

Paper-Daily-Notice
Paper-Daily-Notice copied to clipboard