Paper-Daily-Notice icon indicating copy to clipboard operation
Paper-Daily-Notice copied to clipboard

New submissions for Thu, 30 Jun 22

Open zhuhu00 opened this issue 2 years ago • 0 comments

New submissions for Thu, 30 Jun 22

Keyword: SLAM

There is no result

Keyword: odometry

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: nerf

There is no result

Keyword: mapping

SRCN3D: Sparse R-CNN 3D Surround-View Camera Object Detection and Tracking for Autonomous Driving

  • Authors: Yining Shi, Jingyan Shen, Yifan Sun, Yunlong Wang, Jiaxin Li, Shiqi Sun, Kun Jiang, Diange Yang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.14451
  • Pdf link: https://arxiv.org/pdf/2206.14451
  • Abstract Detection And Tracking of Moving Objects (DATMO) is an essential component in environmental perception for autonomous driving. While 3D detectors using surround-view cameras are just flourishing, there is a growing tendency of using different transformer-based methods to learn queries in 3D space from 2D feature maps of perspective view. This paper proposes Sparse R-CNN 3D (SRCN3D), a novel two-stage fully-convolutional mapping pipeline for surround-view camera detection and tracking. SRCN3D adopts a cascade structure with twin-track update of both fixed number of proposal boxes and proposal latent features. Proposal boxes are projected to perspective view so as to aggregate Region of Interest (RoI) local features. Based on that, proposal features are refined via a dynamic instance interactive head, which then generates classification and the offsets applied to original bounding boxes. Compared to prior arts, our sparse feature sampling module only utilizes local 2D features for adjustment of each corresponding 3D proposal box, leading to a complete sparse paradigm. The proposal features and appearance features are both taken in data association process in a multi-hypotheses 3D multi-object tracking approach. Extensive experiments on nuScenes dataset demonstrate the effectiveness of our proposed SRCN3D detector and tracker. Code is available at https://github.com/synsin0/SRCN3D.

Enhancing Security of Memristor Computing System Through Secure Weight Mapping

  • Authors: Minhui Zou, Junlong Zhou, Xiaotong Cui, Wei Wang, Shahar Kvatinsky
  • Subjects: Emerging Technologies (cs.ET)
  • Arxiv link: https://arxiv.org/abs/2206.14498
  • Pdf link: https://arxiv.org/pdf/2206.14498
  • Abstract Emerging memristor computing systems have demonstrated great promise in improving the energy efficiency of neural network (NN) algorithms. The NN weights stored in memristor crossbars, however, may face potential theft attacks due to the nonvolatility of the memristor devices. In this paper, we propose to protect the NN weights by mapping selected columns of them in the form of 1's complements and leaving the other columns in their original form, preventing the adversary from knowing the exact representation of each weight. The results show that compared with prior work, our method achieves effectiveness comparable to the best of them and reduces the hardware overhead by more than 18X.

Continuous Switch Model and Heuristics for Mixed-Integer Problems in Power Systems

  • Authors: Aayushya Agarwal, Amritanshu Pandey, Marko Jereminov, Larry Pillegi
  • Subjects: Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2206.14510
  • Pdf link: https://arxiv.org/pdf/2206.14510
  • Abstract Power systems operation and planning analyses often determine dispatch and planning decisions by solving a mixed-integer nonlinear problem (MINLP) with binary variables devices to the grid. Binary variables, in addition to nonlinear AC network constraints, make this NP-hard problem computationally challenging for gradient-based methods to converge to a local minimum due to the discontinuous solution space. In this work, we form an equivalent-circuit model of the MINLP decision problem by representing binary decisions as a relaxed switch model. The relaxed switch model is characterized by a continuous function that preserves the physical behavior of a real-world switch. This effectively transforms the MINLP problem into an NLP problem by mapping the integrality constraints into a relaxed indicator function. To solve the transformed NLP problem, we develop homotopy and damping methods that rely on knowledge from the physical behavior of the switch. With these methods, we empirically demonstrate robust convergences for large realistic systems (>70,000 buses) while satisfying the non-relaxed AC behavior of the system. The methodology is generalizable for various operation and planning problems, including unit commitment, transmission line switching, and shunt placement. Examples demonstrate superior convergence against state-of-the-art commercial tools and other binary-relaxation methods.

Challenges of mapping Vulnerabilities and Exposures to Open-Source Packages

  • Authors: Tobias Dam, Sebastian Neumaier
  • Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2206.14527
  • Pdf link: https://arxiv.org/pdf/2206.14527
  • Abstract Much of the current software depends on open-source components, which in turn have complex dependencies on other open-source libraries. Vulnerabilities in open source therefore have potentially huge impacts. The goal of this work is to get a quantitative overview of the frequency and evolution of existing vulnerabilities in popular software repositories and package managers. To this end, we provide an up-to-date overview of the open source landscape and its most popular package managers. We discuss approaches to map entries of the Common Vulnerabilities and Exposures (CVE) list to open-source libraries. Based on this mapping approaches, we show the frequency and distribution of CVE entries with respect to popular programming languages.

Technical Report for CVPR 2022 LOVEU AQTC Challenge

  • Authors: Hyeonyu Kim, Jongeun Kim, Jeonghun Kang, Sanguk Park, Dongchan Park, Taehwan Kim
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2206.14555
  • Pdf link: https://arxiv.org/pdf/2206.14555
  • Abstract This technical report presents the 2nd winning model for AQTC, a task newly introduced in CVPR 2022 LOng-form VidEo Understanding (LOVEU) challenges. This challenge faces difficulties with multi-step answers, multi-modal, and diverse and changing button representations in video. We address this problem by proposing a new context ground module attention mechanism for more effective feature mapping. In addition, we also perform the analysis over the number of buttons and ablation study of different step networks and video features. As a result, we achieved the overall 2nd place in LOVEU competition track 3, specifically the 1st place in two out of four evaluation metrics. Our code is available at https://github.com/jaykim9870/ CVPR-22_LOVEU_unipyler.

The differential spectrum and boomerang spectrum of a class of locally-APN functions

  • Authors: Zhao Hu, Nian Li, Linjie Xu, Xiangyong Zeng, Xiaohu Tang
  • Subjects: Information Theory (cs.IT)
  • Arxiv link: https://arxiv.org/abs/2206.14613
  • Pdf link: https://arxiv.org/pdf/2206.14613
  • Abstract In this paper, we study the boomerang spectrum of the power mapping $F(x)=x^{k(q-1)}$ over ${\mathbb F}{q^2}$, where $q=p^m$, $p$ is a prime, $m$ is a positive integer and $\gcd(k,q+1)=1$. We first determine the differential spectrum of $F(x)$ and show that $F(x)$ is locally-APN. This extends a result of [IEEE Trans. Inf. Theory 57(12):8127-8137, 2011] from $(p,k)=(2,1)$ to general $(p,k)$. We then determine the boomerang spectrum of $F(x)$ by making use of its differential spectrum, which shows that the boomerang uniformity of $F(x)$ is 4 if $p=2$ and $m$ is odd and otherwise it is 2. Our results not only generalize the results in [Des. Codes Cryptogr. 89:2627-2636, 2021] and [arXiv:2203.00485, 2022] but also extend the example $x^{45}$ over ${\mathbb F}{2^8}$ in [Des. Codes Cryptogr. 89:2627-2636, 2021] into an infinite class of power mappings with boomerang uniformity 2.

Multi-scale Physical Representations for Approximating PDE Solutions with Graph Neural Operators

  • Authors: Léon Migus, Yuan Yin, Jocelyn Ahmed Mazari, Patrick Gallinari
  • Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.14687
  • Pdf link: https://arxiv.org/pdf/2206.14687
  • Abstract Representing physical signals at different scales is among the most challenging problems in engineering. Several multi-scale modeling tools have been developed to describe physical systems governed by \emph{Partial Differential Equations} (PDEs). These tools are at the crossroad of principled physical models and numerical schema. Recently, data-driven models have been introduced to speed-up the approximation of PDE solutions compared to numerical solvers. Among these recent data-driven methods, neural integral operators are a class that learn a mapping between function spaces. These functions are discretized on graphs (meshes) which are appropriate for modeling interactions in physical phenomena. In this work, we study three multi-resolution schema with integral kernel operators that can be approximated with \emph{Message Passing Graph Neural Networks} (MPGNNs). To validate our study, we make extensive MPGNNs experiments with well-chosen metrics considering steady and unsteady PDEs.

Keyword: localization

Stronger Together: Air-Ground Robotic Collaboration Using Semantics

  • Authors: Ian D. Miller, Fernando Cladera, Trey Smith, Camillo Jose Taylor, Vijay Kumar
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2206.14289
  • Pdf link: https://arxiv.org/pdf/2206.14289
  • Abstract In this work, we present an end-to-end heterogeneous multi-robot system framework where ground robots are able to localize, plan, and navigate in a semantic map created in real time by a high-altitude quadrotor. The ground robots choose and deconflict their targets independently, without any external intervention. Moreover, they perform cross-view localization by matching their local maps with the overhead map using semantics. The communication backbone is opportunistic and distributed, allowing the entire system to operate with no external infrastructure aside from GPS for the quadrotor. We extensively tested our system by performing different missions on top of our framework over multiple experiments in different environments. Our ground robots travelled over 6 km autonomously with minimal intervention in the real world and over 96 km in simulation without interventions.

Spectral-Loc: Indoor Localization using Light Spectral Information

  • Authors: Yanxiang Wang, Jiawei Hu, Hong Jia, Wen Hu, Mahbub Hassan, Ashraf Uddin, Brano Kusy, Moustafa Youssef
  • Subjects: Human-Computer Interaction (cs.HC)
  • Arxiv link: https://arxiv.org/abs/2206.14401
  • Pdf link: https://arxiv.org/pdf/2206.14401
  • Abstract For indoor settings, we investigate the impact of location on the spectral distribution of the received light, i.e., the intensity of light for different wavelengths. Our investigations confirm that even under the same light source, different locations exhibit slightly different spectral distribution due to reflections from their localised environment containing different materials or colours. By exploiting this observation, we propose Spectral-Loc, a novel indoor localization system that uses light spectral information to identify the location of the device. With spectral sensors finding their way in latest products and applications, such as white balancing in smartphone photography, Spectral-Loc can be readily deployed without requiring any additional hardware or infrastructure. We prototype Spectral-Loc using a commercial-off-the-shelf light spectral sensor, AS7265x, which can measure light intensity over 18 different wavelength sub-bands. We benchmark the localisation accuracy of Spectral-Loc against the conventional light intensity sensors that provide only a single intensity value. Our evaluations over two different indoor spaces, a meeting room and a large office space, demonstrate that use of light spectral information significantly reduces the localization error for the different percentiles.

Keyword: transformer

Masked World Models for Visual Control

  • Authors: Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, Pieter Abbeel
  • Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2206.14244
  • Pdf link: https://arxiv.org/pdf/2206.14244
  • Abstract Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual representations and dynamics, making it difficult to accurately model the interaction between robots and small objects. In this work, we introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. Specifically, we train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder. We continually update both autoencoder and dynamics model using online samples collected from environment interaction. We demonstrate that our decoupling approach achieves state-of-the-art performance on a variety of visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7% success rate on 50 visual robotic manipulation tasks from Meta-world, while the baseline achieves 67.9%. Code is available on the project website: https://sites.google.com/view/mwm-rl.

ZoDIAC: Zoneout Dropout Injection Attention Calculation

  • Authors: Zanyar Zohourianshahzadi, Jugal Kalita
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.14263
  • Pdf link: https://arxiv.org/pdf/2206.14263
  • Abstract Recently the use of self-attention has yielded to state-of-the-art results in vision-language tasks such as image captioning as well as natural language understanding and generation (NLU and NLG) tasks and computer vision tasks such as image classification. This is since self-attention maps the internal interactions among the elements of input source and target sequences. Although self-attention successfully calculates the attention values and maps the relationships among the elements of input source and target sequence, yet there is no mechanism to control the intensity of attention. In real world, when communicating with each other face to face or vocally, we tend to express different visual and linguistic context with various amounts of intensity. Some words might carry (be spoken with) more stress and weight indicating the importance of that word in the context of the whole sentence. Based on this intuition, we propose Zoneout Dropout Injection Attention Calculation (ZoDIAC) in which the intensities of attention values in the elements of the input sequence are calculated with respect to the context of the elements of input sequence. The results of our experiments reveal that employing ZoDIAC leads to better performance in comparison with the self-attention module in the Transformer model. The ultimate goal is to find out if we could modify self-attention module in the Transformer model with a method that is potentially extensible to other models that leverage on self-attention at their core. Our findings suggest that this particular goal deserves further attention and investigation by the research community. The code for ZoDIAC is available on www.github.com/zanyarz/zodiac .

Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding

  • Authors: Pu Wang, Hugo Van hamme
  • Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2206.14318
  • Pdf link: https://arxiv.org/pdf/2206.14318
  • Abstract End-to-end spoken language understanding (SLU) systems benefit from pretraining on large corpora, followed by fine-tuning on application-specific data. The resulting models are too large for on-edge applications. For instance, BERT-based systems contain over 110M parameters. Observing the model is overparameterized, we propose lean transformer structure where the dimension of the attention mechanism is automatically reduced using group sparsity. We propose a variant where the learned attention subspace is transferred to an attention bottleneck layer. In a low-resource setting and without pre-training, the resulting compact SLU model achieves accuracies competitive with pre-trained large models.

Deformable Graph Transformer

  • Authors: Jinyoung Park, Seongjun Yun, Hyeonjin Park, Jaewoo Kang, Jisu Jeong, Kyung-Min Kim, Jung-woo Ha, Hyunwoo J. Kim
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2206.14337
  • Pdf link: https://arxiv.org/pdf/2206.14337
  • Abstract Transformer-based models have been widely used and achieved state-of-the-art performance in various domains such as natural language processing and computer vision. Recent works show that Transformers can also be generalized to graph-structured data. However, the success is limited to small-scale graphs due to technical challenges such as the quadratic complexity in regards to the number of nodes and non-local aggregation that often leads to inferior generalization performance to conventional graph neural networks. In this paper, to address these issues, we propose Deformable Graph Transformer (DGT) that performs sparse attention with dynamically sampled key and value pairs. Specifically, our framework first constructs multiple node sequences with various criteria to consider both structural and semantic proximity. Then, the sparse attention is applied to the node sequences for learning node representations with a reduced computational cost. We also design simple and effective positional encodings to capture structural similarity and distance between nodes. Experiments demonstrate that our novel graph Transformer consistently outperforms existing Transformer-based models and shows competitive performance compared to state-of-the-art models on 8 graph benchmark datasets including large-scale graphs.

What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0

  • Authors: Michael Kamfonas, Gabriel Alon
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2206.14348
  • Pdf link: https://arxiv.org/pdf/2206.14348
  • Abstract Performance in natural language processing, and specifically for the question-answer task, is typically measured by comparing a model's most confident (primary) prediction to golden answers (the ground truth). We are making the case that it is also useful to quantify how close a model came to predicting a correct answer even for examples that failed. We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth, and show why such a match always exists. For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank. We refer to secondary predictions as those ranking above 0 in descending confidence probability order. We demonstrate how the GR can be used to classify questions and visualize their spectrum of difficulty, from persistent near successes to persistent extreme failures. We derive a new aggregate statistic over entire test sets, named the Golden Rank Interpolated Median (GRIM) that quantifies the proximity of failed predictions to the top choice made by the model. To develop some intuition and explore the applicability of these metrics we use the Stanford Question Answering Dataset (SQuAD-2) and a few popular transformer models from the Hugging Face hub. We first demonstrate that the GRIM is not directly correlated with the F1 and exact match (EM) scores. We then calculate and visualize these scores for various transformer architectures, probe their applicability in error analysis by clustering failed predictions, and compare how they relate to other training diagnostics such as the EM and F1 scores. We finally suggest various research goals, such as broadening data collection for these metrics and their possible use in adversarial training.

Knowledge Distillation of Transformer-based Language Models Revisited

  • Authors: Chengqiang Lu, Jianwei Zhang, Yunfei Chu, Zhengyu Chen, Jingren Zhou, Fei Wu, Haiqing Chen, Hongxia Yang
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2206.14366
  • Pdf link: https://arxiv.org/pdf/2206.14366
  • Abstract In the past few years, transformer-based pre-trained language models have achieved astounding success in both industry and academia. However, the large model size and high run-time latency are serious impediments to applying them in practice, especially on mobile phones and Internet of Things (IoT) devices. To compress the model, considerable literature has grown up around the theme of knowledge distillation (KD) recently. Nevertheless, how KD works in transformer-based models is still unclear. We tease apart the components of KD and propose a unified KD framework. Through the framework, systematic and extensive experiments that spent over 23,000 GPU hours render a comprehensive analysis from the perspectives of knowledge types, matching strategies, width-depth trade-off, initialization, model size, etc. Our empirical results shed light on the distillation in the pre-train language model and with relative significant improvement over previous state-of-the-arts(SOTA). Finally, we provide a best-practice guideline for the KD in transformer-based models.

Framing Algorithmic Recourse for Anomaly Detection

  • Authors: Debanjan Datta, Feng Chen, Naren Ramakrishnan
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
  • Arxiv link: https://arxiv.org/abs/2206.14384
  • Pdf link: https://arxiv.org/pdf/2206.14384
  • Abstract The problem of algorithmic recourse has been explored for supervised machine learning models, to provide more interpretable, transparent and robust outcomes from decision support systems. An unexplored area is that of algorithmic recourse for anomaly detection, specifically for tabular data with only discrete feature values. Here the problem is to present a set of counterfactuals that are deemed normal by the underlying anomaly detection model so that applications can utilize this information for explanation purposes or to recommend countermeasures. We present an approach -- Context preserving Algorithmic Recourse for Anomalies in Tabular data (CARAT), that is effective, scalable, and agnostic to the underlying anomaly detection model. CARAT uses a transformer based encoder-decoder model to explain an anomaly by finding features with low likelihood. Subsequently semantically coherent counterfactuals are generated by modifying the highlighted features, using the overall context of features in the anomalous instance(s). Extensive experiments help demonstrate the efficacy of CARAT.

C2FTrans: Coarse-to-Fine Transformers for Medical Image Segmentation

  • Authors: Xian Lin, Zengqiang Yan, Li Yu, Kwang-Ting Cheng
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2206.14409
  • Pdf link: https://arxiv.org/pdf/2206.14409
  • Abstract Convolutional neural networks (CNN), the most prevailing architecture for deep-learning based medical image analysis, are still functionally limited by their intrinsic inductive biases and inadequate receptive fields. Transformer, born to address this issue, has drawn explosive attention in natural language processing and computer vision due to its remarkable ability in capturing long-range dependency. However, most recent transformer-based methods for medical image segmentation directly apply vanilla transformers as an auxiliary module in CNN-based methods, resulting in severe detail loss due to the rigid patch partitioning scheme in transformers. To address this problem, we propose C2FTrans, a novel multi-scale architecture that formulates medical image segmentation as a coarse-to-fine procedure. C2FTrans mainly consists of a cross-scale global transformer (CGT) which addresses local contextual similarity in CNN and a boundary-aware local transformer (BLT) which overcomes boundary uncertainty brought by rigid patch partitioning in transformers. Specifically, CGT builds global dependency across three different small-scale feature maps to obtain rich global semantic features with an acceptable computational cost, while BLT captures mid-range dependency by adaptively generating windows around boundaries under the guidance of entropy to reduce computational complexity and minimize detail loss based on large-scale feature maps. Extensive experimental results on three public datasets demonstrate the superior performance of C2FTrans against state-of-the-art CNN-based and transformer-based methods with fewer parameters and lower FLOPs. We believe the design of C2FTrans would further inspire future work on developing efficient and lightweight transformers for medical image segmentation. The source code of this paper is publicly available at https://github.com/xianlin7/C2FTrans.

The Lighter The Better: Rethinking Transformers in Medical Image Segmentation Through Adaptive Pruning

  • Authors: Xian Lin, Li Yu, Kwang-Ting Cheng, Zengqiang Yan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2206.14413
  • Pdf link: https://arxiv.org/pdf/2206.14413
  • Abstract Vision transformers have recently set off a new wave in the field of medical image analysis due to their remarkable performance on various computer vision tasks. However, recent hybrid-/transformer-based approaches mainly focus on the benefits of transformers in capturing long-range dependency while ignoring the issues of their daunting computational complexity, high training costs, and redundant dependency. In this paper, we propose to employ adaptive pruning to transformers for medical image segmentation and propose a lightweight and effective hybrid network APFormer. To our best knowledge, this is the first work on transformer pruning for medical image analysis tasks. The key features of APFormer mainly are self-supervised self-attention (SSA) to improve the convergence of dependency establishment, Gaussian-prior relative position embedding (GRPE) to foster the learning of position information, and adaptive pruning to eliminate redundant computations and perception information. Specifically, SSA and GRPE consider the well-converged dependency distribution and the Gaussian heatmap distribution separately as the prior knowledge of self-attention and position embedding to ease the training of transformers and lay a solid foundation for the following pruning operation. Then, adaptive transformer pruning, both query-wise and dependency-wise, is performed by adjusting the gate control parameters for both complexity reduction and performance improvement. Extensive experiments on two widely-used datasets demonstrate the prominent segmentation performance of APFormer against the state-of-the-art methods with much fewer parameters and lower GFLOPs. More importantly, we prove, through ablation studies, that adaptive pruning can work as a plug-n-play module for performance improvement on other hybrid-/transformer-based methods. Code is available at https://github.com/xianlin7/APFormer.

Conditioned Human Trajectory Prediction using Iterative Attention Blocks

  • Authors: Aleksey Postnikov, Aleksander Gamayunov, Gonzalo Ferrer
  • Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2206.14442
  • Pdf link: https://arxiv.org/pdf/2206.14442
  • Abstract Human motion prediction is key to understand social environments, with direct applications in robotics, surveillance, etc. We present a simple yet effective pedestrian trajectory prediction model aimed at pedestrians positions prediction in urban-like environments conditioned by the environment: map and surround agents. Our model is a neural-based architecture that can run several layers of attention blocks and transformers in an iterative sequential fashion, allowing to capture the important features in the environment that improve prediction. We show that without explicit introduction of social masks, dynamical models, social pooling layers, or complicated graph-like structures, it is possible to produce on par results with SoTA models, which makes our approach easily extendable and configurable, depending on the data available. We report results performing similarly with SoTA models on publicly available and extensible-used datasets with unimodal prediction metrics ADE and FDE.

SRCN3D: Sparse R-CNN 3D Surround-View Camera Object Detection and Tracking for Autonomous Driving

  • Authors: Yining Shi, Jingyan Shen, Yifan Sun, Yunlong Wang, Jiaxin Li, Shiqi Sun, Kun Jiang, Diange Yang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.14451
  • Pdf link: https://arxiv.org/pdf/2206.14451
  • Abstract Detection And Tracking of Moving Objects (DATMO) is an essential component in environmental perception for autonomous driving. While 3D detectors using surround-view cameras are just flourishing, there is a growing tendency of using different transformer-based methods to learn queries in 3D space from 2D feature maps of perspective view. This paper proposes Sparse R-CNN 3D (SRCN3D), a novel two-stage fully-convolutional mapping pipeline for surround-view camera detection and tracking. SRCN3D adopts a cascade structure with twin-track update of both fixed number of proposal boxes and proposal latent features. Proposal boxes are projected to perspective view so as to aggregate Region of Interest (RoI) local features. Based on that, proposal features are refined via a dynamic instance interactive head, which then generates classification and the offsets applied to original bounding boxes. Compared to prior arts, our sparse feature sampling module only utilizes local 2D features for adjustment of each corresponding 3D proposal box, leading to a complete sparse paradigm. The proposal features and appearance features are both taken in data association process in a multi-hypotheses 3D multi-object tracking approach. Extensive experiments on nuScenes dataset demonstrate the effectiveness of our proposed SRCN3D detector and tracker. Code is available at https://github.com/synsin0/SRCN3D.

SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences

  • Authors: Guan Shen, Jieru Zhao, Quan Chen, Jingwen Leng, Chao Li, Minyi Guo
  • Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2206.14550
  • Pdf link: https://arxiv.org/pdf/2206.14550
  • Abstract The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse attention mechanisms for long sequences. SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation. We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations, respectively, on typical workloads, i.e., Longformer and ViL.

Improving Deliberation by Text-Only and Semi-Supervised Training

  • Authors: Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang
  • Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2206.14716
  • Pdf link: https://arxiv.org/pdf/2206.14716
  • Abstract Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data. In this work, we propose incorporating text-only and semi-supervised training into an attention-based deliberation model. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, and large-scale text-to-speech and audio-only utterances using joint acoustic and text decoder (JATD) and semi-supervised training, we achieved 4%-12% WER reduction for various tasks compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the Google Voice Search WER by 11% relative. We show that the deliberation model also achieves a positive human side-by-side evaluation compared to the state-of-the-art LM rescorer with reasonable endpointer latencies.

LViT: Language meets Vision Transformer in Medical Image Segmentation

  • Authors: Zihan Li, Yunxiang Li, Qingde Li, You Zhang, Puyang Wang, Dazhou Guo, Le Lu, Dakai Jin, Qingqi Hong
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.14718
  • Pdf link: https://arxiv.org/pdf/2206.14718
  • Abstract Deep learning has been widely used in medical image segmentation and other aspects. However, the performance of existing medical image segmentation models has been limited by the challenge of obtaining sufficient number of high-quality data with the high cost of data annotation. To overcome the limitation, we propose a new vision-language medical image segmentation model LViT (Language meets Vision Transformer). In our model, medical text annotation is introduced to compensate for the quality deficiency in image data. In addition, the text information can guide the generation of pseudo labels to a certain extent and further guarantee the quality of pseudo labels in semi-supervised learning. We also propose the Exponential Pseudo label Iteration mechanism (EPI) to help extend the semi-supervised version of LViT and the Pixel-Level Attention Module (PLAM) to preserve local features of images. In our model, LV (Language-Vision) loss is designed to supervise the training of unlabeled images using text information directly. To validate the performance of LViT, we construct multimodal medical segmentation datasets (image + text) containing pathological images, X-rays,etc. Experimental results show that our proposed LViT has better segmentation performance in both fully and semi-supervised conditions. Code and datasets are available at https://github.com/HUANGLIZI/LViT.

TweetNLP: Cutting-Edge Natural Language Processing for Social Media

  • Authors: Jose Camacho-Collados, Kiamehr Rezaee, Talayeh Riahi, Asahi Ushio, Daniel Loureiro, Dimosthenis Antypas, Joanne Boisson, Luis Espinosa-Anke, Fangyu Liu, Eugenio Martínez-Cámara, Gonzalo Medina, Thomas Buhrmann, Leonardo Neves, Francesco Barbieri
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2206.14774
  • Pdf link: https://arxiv.org/pdf/2206.14774
  • Abstract In this paper we present TweetNLP, an integrated platform for Natural Language Processing (NLP) in social media. TweetNLP supports a diverse set of NLP tasks, including generic focus areas such as sentiment analysis and named entity recognition, as well as social media-specific tasks such as emoji prediction and offensive language identification. Task-specific systems are powered by reasonably-sized Transformer-based language models specialized on social media text (in particular, Twitter) which can be run without the need for dedicated hardware or cloud services. The main contributions of TweetNLP are: (1) an integrated Python library for a modern toolkit supporting social media analysis using our various task-specific models adapted to the social domain; (2) an interactive online demo for codeless experimentation using our models; and (3) a tutorial covering a wide variety of typical social media applications.

Keyword: autonomous driving

Non-local Evasive Overtaking of Downstream Incidents in Distributed Behavior Planning of Connected Vehicles

  • Authors: Abdul Rahman Kreidieh, Yashar Farid, Kentaro Oguchi
  • Subjects: Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2206.14391
  • Pdf link: https://arxiv.org/pdf/2206.14391
  • Abstract The prevalence of high-speed vehicle-to-everything (V2X) communication will likely significantly influence the future of vehicle autonomy. In several autonomous driving applications, however, the role such systems will play is seldom understood. In this paper, we explore the role of communication signals in enhancing the performance of lane change assistance systems in situations where downstream bottlenecks restrict the mobility of a few lanes. Building off of prior work on modeling lane change incentives, we design a controller that 1) encourages automated vehicles to subvert lanes in which distant downstream delays are likely to occur, while also 2) ignoring greedy local incentives when such delays are needed to maintain a specific route. Numerical results on different traffic conditions and penetration rates suggest that the model successfully subverts a significant portion of delays brought about by downstream bottlenecks, both globally and from the perspective of the controlled vehicles.

SRCN3D: Sparse R-CNN 3D Surround-View Camera Object Detection and Tracking for Autonomous Driving

  • Authors: Yining Shi, Jingyan Shen, Yifan Sun, Yunlong Wang, Jiaxin Li, Shiqi Sun, Kun Jiang, Diange Yang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.14451
  • Pdf link: https://arxiv.org/pdf/2206.14451
  • Abstract Detection And Tracking of Moving Objects (DATMO) is an essential component in environmental perception for autonomous driving. While 3D detectors using surround-view cameras are just flourishing, there is a growing tendency of using different transformer-based methods to learn queries in 3D space from 2D feature maps of perspective view. This paper proposes Sparse R-CNN 3D (SRCN3D), a novel two-stage fully-convolutional mapping pipeline for surround-view camera detection and tracking. SRCN3D adopts a cascade structure with twin-track update of both fixed number of proposal boxes and proposal latent features. Proposal boxes are projected to perspective view so as to aggregate Region of Interest (RoI) local features. Based on that, proposal features are refined via a dynamic instance interactive head, which then generates classification and the offsets applied to original bounding boxes. Compared to prior arts, our sparse feature sampling module only utilizes local 2D features for adjustment of each corresponding 3D proposal box, leading to a complete sparse paradigm. The proposal features and appearance features are both taken in data association process in a multi-hypotheses 3D multi-object tracking approach. Extensive experiments on nuScenes dataset demonstrate the effectiveness of our proposed SRCN3D detector and tracker. Code is available at https://github.com/synsin0/SRCN3D.

zhuhu00 avatar Jun 30 '22 03:06 zhuhu00