Paper-Daily-Notice icon indicating copy to clipboard operation
Paper-Daily-Notice copied to clipboard

New submissions for Mon, 13 Jun 22

Open zhuhu00 opened this issue 2 years ago • 0 comments

New submissions for Mon, 13 Jun 22

Keyword: SLAM

Experimental Evaluation of Visual-Inertial Odometry Systems for Arable Farming

  • Authors: Javier Cremona, Román Comelli, Taihú Pire
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2206.05066
  • Pdf link: https://arxiv.org/pdf/2206.05066
  • Abstract The farming industry constantly seeks the automation of different processes involved in agricultural production, such as sowing, harvesting and weed control. The use of mobile autonomous robots to perform those tasks is of great interest. Arable lands present hard challenges for Simultaneous Localization and Mapping (SLAM) systems, key for mobile robotics, given the visual difficulty due to the highly repetitive scene and the crop leaves movement caused by the wind. In recent years, several Visual-Inertial Odometry (VIO) and SLAM systems have been developed. They have proved to be robust and capable of achieving high accuracy in indoor and outdoor urban environments. However, they were not properly assessed in agricultural fields. In this work we assess the most relevant state-of-the-art VIO systems in terms of accuracy and processing time on arable lands in order to better understand how they behave on these environments. In particular, the evaluation is carried out on a collection of sensor data recorded by our wheeled robot in a soybean field, which was publicly released as the Rosario Dataset. The evaluation shows that the highly repetitive appearance of the environment, the strong vibration produced by the rough terrain and the movement of the leaves caused by the wind, expose the limitations of the current state-of-the-art VIO and SLAM systems. We analyze the systems failures and highlight the observed drawbacks, including initialization failures, tracking loss and sensitivity to IMU saturation. Finally, we conclude that even though certain systems like ORB-SLAM3 and S-MSCKF show good results with respect to others, more improvements should be done to make them reliable in agricultural fields for certain applications such as soil tillage of crop rows and pesticide spraying.

Keyword: odometry

Experimental Evaluation of Visual-Inertial Odometry Systems for Arable Farming

  • Authors: Javier Cremona, Román Comelli, Taihú Pire
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2206.05066
  • Pdf link: https://arxiv.org/pdf/2206.05066
  • Abstract The farming industry constantly seeks the automation of different processes involved in agricultural production, such as sowing, harvesting and weed control. The use of mobile autonomous robots to perform those tasks is of great interest. Arable lands present hard challenges for Simultaneous Localization and Mapping (SLAM) systems, key for mobile robotics, given the visual difficulty due to the highly repetitive scene and the crop leaves movement caused by the wind. In recent years, several Visual-Inertial Odometry (VIO) and SLAM systems have been developed. They have proved to be robust and capable of achieving high accuracy in indoor and outdoor urban environments. However, they were not properly assessed in agricultural fields. In this work we assess the most relevant state-of-the-art VIO systems in terms of accuracy and processing time on arable lands in order to better understand how they behave on these environments. In particular, the evaluation is carried out on a collection of sensor data recorded by our wheeled robot in a soybean field, which was publicly released as the Rosario Dataset. The evaluation shows that the highly repetitive appearance of the environment, the strong vibration produced by the rough terrain and the movement of the leaves caused by the wind, expose the limitations of the current state-of-the-art VIO and SLAM systems. We analyze the systems failures and highlight the observed drawbacks, including initialization failures, tracking loss and sensitivity to IMU saturation. Finally, we conclude that even though certain systems like ORB-SLAM3 and S-MSCKF show good results with respect to others, more improvements should be done to make them reliable in agricultural fields for certain applications such as soil tillage of crop rows and pesticide spraying.

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: nerf

NeRF-In: Free-Form NeRF Inpainting with RGB-D Priors

  • Authors: Hao-Kang Liu, I-Chao Shen, Bing-Yu Chen
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
  • Arxiv link: https://arxiv.org/abs/2206.04901
  • Pdf link: https://arxiv.org/pdf/2206.04901
  • Abstract Though Neural Radiance Field (NeRF) demonstrates compelling novel view synthesis results, it is still unintuitive to edit a pre-trained NeRF because the neural network's parameters and the scene geometry/appearance are often not explicitly associated. In this paper, we introduce the first framework that enables users to remove unwanted objects or retouch undesired regions in a 3D scene represented by a pre-trained NeRF without any category-specific data and training. The user first draws a free-form mask to specify a region containing unwanted objects over a rendered view from the pre-trained NeRF. Our framework first transfers the user-provided mask to other rendered views and estimates guiding color and depth images within these transferred masked regions. Next, we formulate an optimization problem that jointly inpaints the image content in all masked regions across multiple views by updating the NeRF model's parameters. We demonstrate our framework on diverse scenes and show it obtained visual plausible and structurally consistent results across multiple views using shorter time and less user manual efforts.

Improved Direct Voxel Grid Optimization for Radiance Fields Reconstruction

  • Authors: Cheng Sun, Min Sun, Hwann-Tzong Chen
  • Subjects: Graphics (cs.GR)
  • Arxiv link: https://arxiv.org/abs/2206.05085
  • Pdf link: https://arxiv.org/pdf/2206.05085
  • Abstract In this technical report, we improve the DVGO framework (called DVGOv2), which is based on Pytorch and uses the simplest dense grid representation. First, we re-implement part of the Pytorch operations with cuda, achieving 2-3x speedup. The cuda extension is automatically compiled just in time. Second, we extend DVGO to support Forward-facing and Unbounded Inward-facing capturing. Third, we improve the space time complexity of the distortion loss proposed by mip-NeRF 360 from O(N^2) to O(N). The distortion loss improves our quality and training speed. Our efficient implementation could allow more future works to benefit from the loss.

Keyword: mapping

Communication Efficient Distributed Learning for Kernelized Contextual Bandits

  • Authors: Chuanhao Li, Huazheng Wang, Mengdi Wang, Hongning Wang
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2206.04835
  • Pdf link: https://arxiv.org/pdf/2206.04835
  • Abstract We tackle the communication efficiency challenge of learning kernelized contextual bandits in a distributed setting. Despite the recent advances in communication-efficient distributed bandit learning, existing solutions are restricted to simple models like multi-armed bandits and linear bandits, which hamper their practical utility. In this paper, instead of assuming the existence of a linear reward mapping from the features to the expected rewards, we consider non-linear reward mappings, by letting agents collaboratively search in a reproducing kernel Hilbert space (RKHS). This introduces significant challenges in communication efficiency as distributed kernel learning requires the transfer of raw data, leading to a communication cost that grows linearly w.r.t. time horizon $T$. We addresses this issue by equipping all agents to communicate via a common Nystr"{o}m embedding that gets updated adaptively as more data points are collected. We rigorously proved that our algorithm can attain sub-linear rate in both regret and communication cost.

Out of Sight, Out of Mind: A Source-View-Wise Feature Aggregation for Multi-View Image-Based Rendering

  • Authors: Geonho Cha, Chaehun Shin, Sungroh Yoon, Dongyoon Wee
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2206.04906
  • Pdf link: https://arxiv.org/pdf/2206.04906
  • Abstract To estimate the volume density and color of a 3D point in the multi-view image-based rendering, a common approach is to inspect the consensus existence among the given source image features, which is one of the informative cues for the estimation procedure. To this end, most of the previous methods utilize equally-weighted aggregation features. However, this could make it hard to check the consensus existence when some outliers, which frequently occur by occlusions, are included in the source image feature set. In this paper, we propose a novel source-view-wise feature aggregation method, which facilitates us to find out the consensus in a robust way by leveraging local structures in the feature set. We first calculate the source-view-wise distance distribution for each source feature for the proposed aggregation. After that, the distance distribution is converted to several similarity distributions with the proposed learnable similarity mapping functions. Finally, for each element in the feature set, the aggregation features are extracted by calculating the weighted means and variances, where the weights are derived from the similarity distributions. In experiments, we validate the proposed method on various benchmark datasets, including synthetic and real image scenes. The experimental results demonstrate that incorporating the proposed features improves the performance by a large margin, resulting in the state-of-the-art performance.

Density-optimized Intersection-free Mapping and Matrix Multiplication for Join-Project Operations (extended version)

  • Authors: Zichun Huang, Shimin Chen
  • Subjects: Databases (cs.DB)
  • Arxiv link: https://arxiv.org/abs/2206.04995
  • Pdf link: https://arxiv.org/pdf/2206.04995
  • Abstract A Join-Project operation is a join operation followed by a duplicate eliminating projection operation. It is used in a large variety of applications, including entity matching, set analytics, and graph analytics. Previous work proposes a hybrid design that exploits the classical solution (i.e., join and deduplication), and MM (matrix multiplication) to process the sparse and the dense portions of the input data, respectively. However, we observe three problems in the state-of-the-art solution: 1) The outputs of the sparse and dense portions overlap, requiring an extra deduplication step; 2) Its table-to-matrix transformation makes an over-simplified assumption of the attribute values; and 3) There is a mismatch between the employed MM in BLAS packages and the characteristics of the Join-Project operation. In this paper, we propose DIM3, an optimized algorithm for the Join-Project operation. To address 1), we propose an intersection-free partition method to completely remove the final deduplication step. For 2), we develop an optimized design for mapping attribute values to natural numbers. For 3), we propose DenseEC and SparseBMM algorithms to exploit the structure of Join-Project for better efficiency. Moreover, we extend DIM3 to consider partial result caching and support Join-op queries, including Join-Aggregate and MJP (Multi-way Joins with Projection). Experimental results using both real-world and synthetic data sets show that DIM3 outperforms previous Join-Project solutions by a factor of 2.3x-18x. Compared to RDBMSs, DIM3 achieves orders of magnitude speedups.

Experimental Evaluation of Visual-Inertial Odometry Systems for Arable Farming

  • Authors: Javier Cremona, Román Comelli, Taihú Pire
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2206.05066
  • Pdf link: https://arxiv.org/pdf/2206.05066
  • Abstract The farming industry constantly seeks the automation of different processes involved in agricultural production, such as sowing, harvesting and weed control. The use of mobile autonomous robots to perform those tasks is of great interest. Arable lands present hard challenges for Simultaneous Localization and Mapping (SLAM) systems, key for mobile robotics, given the visual difficulty due to the highly repetitive scene and the crop leaves movement caused by the wind. In recent years, several Visual-Inertial Odometry (VIO) and SLAM systems have been developed. They have proved to be robust and capable of achieving high accuracy in indoor and outdoor urban environments. However, they were not properly assessed in agricultural fields. In this work we assess the most relevant state-of-the-art VIO systems in terms of accuracy and processing time on arable lands in order to better understand how they behave on these environments. In particular, the evaluation is carried out on a collection of sensor data recorded by our wheeled robot in a soybean field, which was publicly released as the Rosario Dataset. The evaluation shows that the highly repetitive appearance of the environment, the strong vibration produced by the rough terrain and the movement of the leaves caused by the wind, expose the limitations of the current state-of-the-art VIO and SLAM systems. We analyze the systems failures and highlight the observed drawbacks, including initialization failures, tracking loss and sensitivity to IMU saturation. Finally, we conclude that even though certain systems like ORB-SLAM3 and S-MSCKF show good results with respect to others, more improvements should be done to make them reliable in agricultural fields for certain applications such as soil tillage of crop rows and pesticide spraying.

Keyword: localization

Deep learning-enhanced ensemble-based data assimilation for high-dimensional nonlinear dynamical systems

  • Authors: Ashesh Chattopadhyay, Ebrahim Nabizadeh, Eviatar Bach, Pedram Hassanzadeh
  • Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an); Fluid Dynamics (physics.flu-dyn); Geophysics (physics.geo-ph)
  • Arxiv link: https://arxiv.org/abs/2206.04811
  • Pdf link: https://arxiv.org/pdf/2206.04811
  • Abstract Data assimilation (DA) is a key component of many forecasting models in science and engineering. DA allows one to estimate better initial conditions using an imperfect dynamical model of the system and noisy/sparse observations available from the system. Ensemble Kalman filter (EnKF) is a DA algorithm that is widely used in applications involving high-dimensional nonlinear dynamical systems. However, EnKF requires evolving large ensembles of forecasts using the dynamical model of the system. This often becomes computationally intractable, especially when the number of states of the system is very large, e.g., for weather prediction. With small ensembles, the estimated background error covariance matrix in the EnKF algorithm suffers from sampling error, leading to an erroneous estimate of the analysis state (initial condition for the next forecast cycle). In this work, we propose hybrid ensemble Kalman filter (H-EnKF), which is applied to a two-layer quasi-geostrophic flow system as a test case. This framework utilizes a pre-trained deep learning-based data-driven surrogate that inexpensively generates and evolves a large data-driven ensemble of the states of the system to accurately compute the background error covariance matrix with less sampling error. The H-EnKF framework estimates a better initial condition without the need for any ad-hoc localization strategies. H-EnKF can be extended to any ensemble-based DA algorithm, e.g., particle filters, which are currently difficult to use for high dimensional systems.

Experimental Evaluation of Visual-Inertial Odometry Systems for Arable Farming

  • Authors: Javier Cremona, Román Comelli, Taihú Pire
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2206.05066
  • Pdf link: https://arxiv.org/pdf/2206.05066
  • Abstract The farming industry constantly seeks the automation of different processes involved in agricultural production, such as sowing, harvesting and weed control. The use of mobile autonomous robots to perform those tasks is of great interest. Arable lands present hard challenges for Simultaneous Localization and Mapping (SLAM) systems, key for mobile robotics, given the visual difficulty due to the highly repetitive scene and the crop leaves movement caused by the wind. In recent years, several Visual-Inertial Odometry (VIO) and SLAM systems have been developed. They have proved to be robust and capable of achieving high accuracy in indoor and outdoor urban environments. However, they were not properly assessed in agricultural fields. In this work we assess the most relevant state-of-the-art VIO systems in terms of accuracy and processing time on arable lands in order to better understand how they behave on these environments. In particular, the evaluation is carried out on a collection of sensor data recorded by our wheeled robot in a soybean field, which was publicly released as the Rosario Dataset. The evaluation shows that the highly repetitive appearance of the environment, the strong vibration produced by the rough terrain and the movement of the leaves caused by the wind, expose the limitations of the current state-of-the-art VIO and SLAM systems. We analyze the systems failures and highlight the observed drawbacks, including initialization failures, tracking loss and sensitivity to IMU saturation. Finally, we conclude that even though certain systems like ORB-SLAM3 and S-MSCKF show good results with respect to others, more improvements should be done to make them reliable in agricultural fields for certain applications such as soil tillage of crop rows and pesticide spraying.

Spectral analysis and fast methods for large matrices arising from PDE approximation

  • Authors: Ryma Imene Rahla
  • Subjects: Numerical Analysis (math.NA)
  • Arxiv link: https://arxiv.org/abs/2206.05171
  • Pdf link: https://arxiv.org/pdf/2206.05171
  • Abstract The main goal of this thesis is to show the crucial role that plays the symbol in analysing the spectrum the sequence of matrices resulting from PDE approximation and in designing a fast method to solve the associated linear problem. In the first part, we study the spectral properties of the matrices arising from $\mathbb{P}_k$ Lagrangian Finite Elements approximation of second order elliptic differential problem with Dirichlet boundary conditions and where the operator is $\mathrm{div} \left(-a(\mathbf{x}) \nabla\cdot\right)$, with $a$ continuous and positive over $\overline \Omega$, $\Omega$ being an open and bounded subset of $\mathbb{R}^d$, $d\ge 1$. We investigate the spectral distribution in the Weyl sense, with a concise overview on localization, clustering, extremal eigenvalues, and asymptotic conditioning. We study in detail the case of constant coefficients on $\Omega=(0,1)^2$ and we give a brief account in the case of variable coefficients and more general domains. While in the second part, we design a fast method of multigrid type for the resolution of linear systems arising from the $\mathbb{Q}_k$ Finite Elements approximation of the same considered problem in one and higher dimensional. The analysis is performed in one dimension, while the numerics are carried out also in higher dimension $d\ge 2$, demonstrating an optimal behavior in terms of the dependency on the matrix size and a robustness with respect to the dimensionality $d$ and to the polynomial degree $k$.

Keyword: transformer

Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation

  • Authors: Jinman Park, Kimathi Kaai, Saad Hossain, Norikatsu Sumi, Sirisha Rambhatla, Paul Fieguth
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2206.04785
  • Pdf link: https://arxiv.org/pdf/2206.04785
  • Abstract Egocentric 3D human pose estimation (HPE) from images is challenging due to severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera. Although existing works use intermediate heatmap-based representations to counter distortion with some success, addressing self-occlusion remains an open problem. In this work, we leverage information from past frames to guide our self-attention-based 3D HPE estimation procedure -- Ego-STAN. Specifically, we build a spatio-temporal Transformer model that attends to semantically rich convolutional neural network-based feature maps. We also propose feature map tokens: a new set of learnable parameters to attend to these feature maps. Finally, we demonstrate Ego-STAN's superior performance on the xR-EgoPose dataset where it achieves a 30.6% improvement on the overall mean per-joint position error, while leading to a 22% drop in parameters compared to the state-of-the-art.

Syntactic Inductive Biases for Deep Learning Methods

  • Authors: Yikang Shen
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2206.04806
  • Pdf link: https://arxiv.org/pdf/2206.04806
  • Abstract In this thesis, we try to build a connection between the two schools by introducing syntactic inductive biases for deep learning models. We propose two families of inductive biases, one for constituency structure and another one for dependency structure. The constituency inductive bias encourages deep learning models to use different units (or neurons) to separately process long-term and short-term information. This separation provides a way for deep learning models to build the latent hierarchical representations from sequential inputs, that a higher-level representation is composed of and can be decomposed into a series of lower-level representations. For example, without knowing the ground-truth structure, our proposed model learns to process logical expression through composing representations of variables and operators into representations of expressions according to its syntactic structure. On the other hand, the dependency inductive bias encourages models to find the latent relations between entities in the input sequence. For natural language, the latent relations are usually modeled as a directed dependency graph, where a word has exactly one parent node and zero or several children nodes. After applying this constraint to a Transformer-like model, we find the model is capable of inducing directed graphs that are close to human expert annotations, and it also outperforms the standard transformer model on different tasks. We believe that these experimental results demonstrate an interesting alternative for the future development of deep learning models.

Transformer-Graph Neural Network with Global-Local Attention for Multimodal Rumour Detection with Knowledge Distillation

  • Authors: Tsun-hin Cheung, Kin-man Lam
  • Subjects: Multimedia (cs.MM); Social and Information Networks (cs.SI)
  • Arxiv link: https://arxiv.org/abs/2206.04832
  • Pdf link: https://arxiv.org/pdf/2206.04832
  • Abstract Misinformation spreading becomes a critical issue in online conversation. Detecting rumours is an important research topic in social media analysis. Most existing methods, based on Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs), do not make use of the relationship between the global and local information of a conversation for detection. In this paper, we propose a Transformer-Graph Neural Network (TGNN), to fuse the local information with the global representation, through an attention mechanism. Then, we extend the proposed TGNN for multimodal rumour detection, by considering the latent relationship between the multimodal feature and node feature to form a more comprehensive graph representation. To verify the effectiveness of our proposed method for multimodal rumour detection, we extend the existing PHEME-2016, PHEME-2018, and Weibo data sets, by collecting available and relevant images for training the proposal framework. To improve the performance of single-modal rumour detection, i.e., based on text input only, a teacher-student framework is employed to distil the knowledge from the multimodal model to the single-modal model. Experimental results show that our proposed TGNN can achieve state-of-the-art performance and generalization ability evaluated on the PHEME-2016, PHEME-2018, and Weibo data sets.

Machop: an End-to-End Generalized Entity Matching Framework

  • Authors: Jin Wang, Yuliang Li, Wataru Hirota, Eser Kandogan
  • Subjects: Databases (cs.DB)
  • Arxiv link: https://arxiv.org/abs/2206.04853
  • Pdf link: https://arxiv.org/pdf/2206.04853
  • Abstract Real-world applications frequently seek to solve a general form of the Entity Matching (EM) problem to find associated entities. Such scenarios include matching jobs to candidates in job targeting, matching students with courses in online education, matching products with user reviews on e-commercial websites, and beyond. These tasks impose new requirements such as matching data entries with diverse formats or having a flexible and semantics-rich matching definition, which are beyond the current EM task formulation or approaches. In this paper, we introduce the problem of Generalized Entity Matching (GEM) that satisfies these practical requirements and presents an end-to-end pipeline Machop as the solution. Machop allows end-users to define new matching tasks from scratch and apply them to new domains in a step-by-step manner. Machop casts the GEM problem as sequence pair classification so as to utilize the language understanding capability of Transformers-based language models (LMs) such as BERT. Moreover, it features a novel external knowledge injection approach with structure-aware pooling methods that allow domain experts to guide the LM to focus on the key matching information thus further contributing to the overall performance. Our experiments and case studies on real-world datasets from a popular recruiting platform show a significant 17.1% gain in F1 score against state-of-the-art methods along with meaningful matching results that are human-understandable.

NAGphormer: Neighborhood Aggregation Graph Transformer for Node Classification in Large Graphs

  • Authors: Jinsong Chen, Kaiyuan Gao, Gaichao Li, Kun He
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2206.04910
  • Pdf link: https://arxiv.org/pdf/2206.04910
  • Abstract Graph Transformers have demonstrated superiority on various graph learning tasks in recent years. However, the complexity of existing Graph Transformers scales quadratically with the number of nodes, making it hard to scale to graphs with thousands of nodes. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that is scalable to large graphs with millions of nodes. Before feeding the node features into the Transformer model, NAGphormer constructs tokens for each node by a neighborhood aggregation module called Hop2Token. For each node, Hop2Token aggregates neighborhood features from each hop into a representation, and thereby produces a sequence of token vectors. Subsequently, the resulting sequence of different hop information serves as input to the Transformer model. By considering each node as a sequence, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. NAGphormer further develops an attention-based readout function so as to learn the importance of each hop adaptively. We conduct extensive experiments on various popular benchmarks, including six small datasets and three large datasets. The results demonstrate that NAGphormer consistently outperforms existing Graph Transformers and mainstream Graph Neural Networks.

MAREO: Memory- and Attention- based visual REasOning

  • Authors: Mohit Vaishnav, Thomas Serre
  • Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
  • Arxiv link: https://arxiv.org/abs/2206.04928
  • Pdf link: https://arxiv.org/pdf/2206.04928
  • Abstract Humans continue to vastly outperform modern AI systems in their ability to parse and understand complex visual scenes flexibly. Attention and memory are two systems known to play a critical role in our ability to selectively maintain and manipulate behaviorally-relevant visual information to solve some of the most challenging visual reasoning tasks. Here, we present a novel architecture for visual reasoning inspired by the cognitive-science literature on visual reasoning, the Memory- and Attention-based (visual) REasOning (MAREO) architecture. MAREO instantiates an active-vision theory, which posits that the brain solves complex visual reasoning problems compositionally by learning to combine previously-learned elementary visual operations to form more complex visual routines. MAREO learns to solve visual reasoning tasks via sequences of attention shifts to route and maintain task-relevant visual information into a memory bank via a multi-head transformer module. Visual routines are then deployed by a dedicated reasoning module trained to judge various relations between objects in the scenes. Experiments on four types of reasoning tasks demonstrate MAREO's ability to learn visual routines in a robust and sample-efficient manner.

Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing

  • Authors: Elena Alvarez Mellado, Constantine Lignos
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2206.04973
  • Pdf link: https://arxiv.org/pdf/2206.04973
  • Abstract We present a new corpus of Twitter data annotated for codeswitching and borrowing between Spanish and English. The corpus contains 9,500 tweets annotated at the token level with codeswitches, borrowings, and named entities. This corpus differs from prior corpora of codeswitching in that we attempt to clearly define and annotate the boundary between codeswitching and borrowing and do not treat common "internet-speak" ('lol', etc.) as codeswitching when used in an otherwise monolingual context. The result is a corpus that enables the study and modeling of Spanish-English borrowing and codeswitching on Twitter in one dataset. We present baseline scores for modeling the labels of this corpus using Transformer-based language models. The annotation itself is released with a CC BY 4.0 license, while the text it applies to is distributed in compliance with the Twitter terms of service.

NR-DFERNet: Noise-Robust Network for Dynamic Facial Expression Recognition

  • Authors: Hanting Li, Mingzhe Sui, Zhaoqing Zhu, Feng zhao
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.04975
  • Pdf link: https://arxiv.org/pdf/2206.04975
  • Abstract Dynamic facial expression recognition (DFER) in the wild is an extremely challenging task, due to a large number of noisy frames in the video sequences. Previous works focus on extracting more discriminative features, but ignore distinguishing the key frames from the noisy frames. To tackle this problem, we propose a noise-robust dynamic facial expression recognition network (NR-DFERNet), which can effectively reduce the interference of noisy frames on the DFER task. Specifically, at the spatial stage, we devise a dynamic-static fusion module (DSF) that introduces dynamic features to static features for learning more discriminative spatial features. To suppress the impact of target irrelevant frames, we introduce a novel dynamic class token (DCT) for the transformer at the temporal stage. Moreover, we design a snippet-based filter (SF) at the decision stage to reduce the effect of too many neutral frames on non-neutral sequence classification. Extensive experimental results demonstrate that our NR-DFERNet outperforms the state-of-the-art methods on both the DFEW and AFEW benchmarks.

Position Labels for Self-Supervised Vision Transformer

  • Authors: Zhemin Zhang, Xun Gong, Jinyi Wu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.04981
  • Pdf link: https://arxiv.org/pdf/2206.04981
  • Abstract Position encoding is important for vision transformer (ViT) to capture the spatial structure of the input image. General efficacy has been proven in ViT. In our work we propose to train ViT to recognize the 2D position encoding of patches of the input image, this apparently simple task actually yields a meaningful self-supervisory task. Based on previous work on ViT position encoding, we propose two position labels dedicated to 2D images including absolute position and relative position. Our position labels can be easily plugged into transformer, combined with the various current ViT variants. It can work in two ways: 1.As an auxiliary training target for vanilla ViT (e.g., ViT-B and Swin-B) to improve model performance. 2. Combine the self-supervised ViT (e.g., MAE) to provide a more powerful self-supervised signal for semantic feature learning. Experiments demonstrate that solely due to the proposed self-supervised methods, Swin-B and ViT-B obtained improvements of 1.9% (top-1 Acc) and 5.6% (top-1 Acc) on Mini-ImageNet, respectively.

Saccade Mechanisms for Image Classification, Object Detection and Tracking

  • Authors: Saurabh Farkya, Zachary Daniels, Aswin Nadamuni Raghavan, David Zhang, Michael Piacentino
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
  • Arxiv link: https://arxiv.org/abs/2206.05102
  • Pdf link: https://arxiv.org/pdf/2206.05102
  • Abstract We examine how the saccade mechanism from biological vision can be used to make deep neural networks more efficient for classification and object detection problems. Our proposed approach is based on the ideas of attention-driven visual processing and saccades, miniature eye movements influenced by attention. We conduct experiments by analyzing: i) the robustness of different deep neural network (DNN) feature extractors to partially-sensed images for image classification and object detection, and ii) the utility of saccades in masking image patches for image classification and object tracking. Experiments with convolutional nets (ResNet-18) and transformer-based models (ViT, DETR, TransTrack) are conducted on several datasets (CIFAR-10, DAVSOD, MSCOCO, and MOT17). Our experiments show intelligent data reduction via learning to mimic human saccades when used in conjunction with state-of-the-art DNNs for classification, detection, and tracking tasks. We observed minimal drop in performance for the classification and detection tasks while only using about 30% of the original sensor data. We discuss how the saccade mechanism can inform hardware design via ``in-pixel'' processing.

Exploring Feature Self-relation for Self-supervised Transformer

  • Authors: Zhong-Yu Li, Shanghua Gao, Ming-Ming Cheng
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.05184
  • Pdf link: https://arxiv.org/pdf/2206.05184
  • Abstract Learning representations with self-supervision for convolutional networks (CNN) has proven effective for vision tasks. As an alternative for CNN, vision transformers (ViTs) emerge strong representation ability with the pixel-level self-attention and channel-level feed-forward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViTs. Still, most works follow self-supervised strategy designed for CNNs, e.g., instance-level discrimination of samples, but they ignore the unique properties of ViTs. We observe that modeling relations among pixels and channels distinguishes ViTs from other networks. To enforce this property, we explore the feature self-relations for training self-supervised ViTs. Specifically, instead of conducting self-supervised learning solely on feature embeddings from multiple views, we utilize the feature self-relations, i.e., pixel/channel-level self-relations, for self-supervised learning. Self-relation based learning further enhance the relation modeling ability of ViTs, resulting in strong representations that stably improve performance on multiple downstream tasks. Our source code will be made publicly available.

StructCoder: Structure-Aware Transformer for Code Generation

  • Authors: Sindhu Tipirneni, Ming Zhu, Chandan K. Reddy
  • Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE)
  • Arxiv link: https://arxiv.org/abs/2206.05239
  • Pdf link: https://arxiv.org/pdf/2206.05239
  • Abstract There has been a recent surge of interest in automating software engineering tasks using deep learning. This work addresses the problem of code generation where the goal is to generate target code given source code in a different language or a natural language description. Most of the state-of-the-art deep learning models for code generation use training strategies that are primarily designed for natural language. However, understanding and generating code requires a more rigorous comprehension of the code syntax and semantics. With this motivation, we develop an encoder-decoder Transformer model where both the encoder and decoder are trained to recognize the syntax and data flow in the source and target codes, respectively. We not only make the encoder structure-aware by leveraging the source code's syntax tree and data flow graph, but we also ensure that our decoder preserves the syntax and data flow of the target code by introducing two auxiliary tasks: AST (Abstract Syntax Tree) paths prediction and data flow prediction. To the best of our knowledge, this is the first work to introduce a structure-aware Transformer decoder to enhance the quality of generated code by modeling target syntax and data flow. The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark.

Keyword: autonomous driving

R4D: Utilizing Reference Objects for Long-Range Distance Estimation

  • Authors: Yingwei Li, Tiffany Chen, Maya Kabkab, Ruichi Yu, Longlong Jing, Yurong You, Hang Zhao
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.04831
  • Pdf link: https://arxiv.org/pdf/2206.04831
  • Abstract Estimating the distance of objects is a safety-critical task for autonomous driving. Focusing on short-range objects, existing methods and datasets neglect the equally important long-range objects. In this paper, we introduce a challenging and under-explored task, which we refer to as Long-Range Distance Estimation, as well as two datasets to validate new methods developed for this task. We then proposeR4D, the first framework to accurately estimate the distance of long-range objects by using references with known distances in the scene. Drawing inspiration from human perception, R4D builds a graph by connecting a target object to all references. An edge in the graph encodes the relative distance information between a pair of target and reference objects. An attention module is then used to weigh the importance of reference objects and combine them into one target object distance prediction. Experiments on the two proposed datasets demonstrate the effectiveness and robustness of R4D by showing significant improvements compared to existing baselines. We are looking to make the proposed dataset, Waymo OpenDataset - Long-Range Labels, available publicly at waymo.com/open/download.

Unsupervised Foggy Scene Understanding via Self Spatial-Temporal Label Diffusion

  • Authors: Liang Liao, Wenyi Chen, Jing Xiao, Zheng Wang, Chia-Wen Lin, Shin'ichi Satoh
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2206.04879
  • Pdf link: https://arxiv.org/pdf/2206.04879
  • Abstract Understanding foggy image sequence in the driving scenes is critical for autonomous driving, but it remains a challenging task due to the difficulty in collecting and annotating real-world images of adverse weather. Recently, the self-training strategy has been considered a powerful solution for unsupervised domain adaptation, which iteratively adapts the model from the source domain to the target domain by generating target pseudo labels and re-training the model. However, the selection of confident pseudo labels inevitably suffers from the conflict between sparsity and accuracy, both of which will lead to suboptimal models. To tackle this problem, we exploit the characteristics of the foggy image sequence of driving scenes to densify the confident pseudo labels. Specifically, based on the two discoveries of local spatial similarity and adjacent temporal correspondence of the sequential image data, we propose a novel Target-Domain driven pseudo label Diffusion (TDo-Dif) scheme. It employs superpixels and optical flows to identify the spatial similarity and temporal correspondence, respectively and then diffuses the confident but sparse pseudo labels within a superpixel or a temporal corresponding pair linked by the flow. Moreover, to ensure the feature similarity of the diffused pixels, we introduce local spatial similarity loss and temporal contrastive loss in the model re-training stage. Experimental results show that our TDo-Dif scheme helps the adaptive model achieve 51.92% and 53.84% mean intersection-over-union (mIoU) on two publicly available natural foggy datasets (Foggy Zurich and Foggy Driving), which exceeds the state-of-the-art unsupervised domain adaptive semantic segmentation methods. Models and data can be found at https://github.com/velor2012/TDo-Dif.

zhuhu00 avatar Jun 13 '22 03:06 zhuhu00