New submissions for Fri, 24 Jun 22

Open zhuhu00 opened this issue 2 years ago • 0 comments

New submissions for Fri, 24 Jun 22

Keyword: SLAM

Algorithms for 2-connected network design and flexible Steiner trees with a constant number of terminals

Authors: Ishan Bansal, Joe Cheriyan, Logan Grout, Sharat Ibrahimpur
Subjects: Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2206.11807
Pdf link: https://arxiv.org/pdf/2206.11807
Abstract The $k$-Steiner-2NCS problem is as follows: Given a constant $k$, and an undirected connected graph $G = (V,E)$, non-negative costs $c$ on $E$, and a partition $(T, V-T)$ of $V$ into a set of terminals, $T$, and a set of non-terminals (or, Steiner nodes), where $|T|=k$, find a minimum-cost two-node connected subgraph that contains the terminals. We present a randomized polynomial-time algorithm for the unweighted problem, and a randomized PTAS for the weighted problem. We obtain similar results for the $k$-Steiner-2ECS problem, where the input is the same, and the algorithmic goal is to find a minimum-cost two-edge connected subgraph that contains the terminals. Our methods build on results by Bj"orklund, Husfeldt, and Taslaman (ACM-SIAM SODA 2012) that give a randomized polynomial-time algorithm for the unweighted $k$-Steiner-cycle problem; this problem has the same inputs as the unweighted $k$-Steiner-2NCS problem, and the algorithmic goal is to find a minimum-size simple cycle $C$ that contains the terminals ($C$ may contain any number of Steiner nodes).

Keyword: odometry

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

LidarMutliNet: Unifying LiDAR Semantic Segmentation, 3D Object Detection, and Panoptic Segmentation in a Single Multi-task Network

Authors: Dongqiangzi Ye, Weijia Chen, Zixiang Zhou, Yufei Xie, Yu Wang, Panqu Wang, Hassan Foroosh
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.11428
Pdf link: https://arxiv.org/pdf/2206.11428
Abstract This technical report presents the 1st place winning solution for the Waymo Open Dataset 3D semantic segmentation challenge 2022. Our network, termed LidarMultiNet, unifies the major LiDAR perception tasks such as 3D semantic segmentation, object detection, and panoptic segmentation in a single framework. At the core of LidarMultiNet is a strong 3D voxel-based encoder-decoder network with a novel Global Context Pooling (GCP) module extracting global contextual features from a LiDAR frame to complement its local features. An optional second stage is proposed to refine the first-stage segmentation or generate accurate panoptic segmentation results. Our solution achieves a mIoU of 71.13 and is the best for most of the 22 classes on the Waymo 3D semantic segmentation test set, outperforming all the other 3D semantic segmentation methods on the official leaderboard. We demonstrate for the first time that major LiDAR perception tasks can be unified in a single strong network that can be trained end-to-end.

Keyword: loop detection

There is no result

Keyword: nerf

EventNeRF: Neural Radiance Fields from a Single Colour Event Camera

Authors: Viktor Rudnev, Mohamed Elgharib, Christian Theobalt, Vladislav Golyanik
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.11896
Pdf link: https://arxiv.org/pdf/2206.11896
Abstract Learning coordinate-based volumetric 3D scene representations such as neural radiance fields (NeRF) has been so far studied assuming RGB or RGB-D images as inputs. At the same time, it is known from the neuroscience literature that human visual system (HVS) is tailored to process asynchronous brightness changes rather than synchronous RGB images, in order to build and continuously update mental 3D representations of the surroundings for navigation and survival. Visual sensors that were inspired by HVS principles are event cameras. Thus, events are sparse and asynchronous per-pixel brightness (or colour channel) change signals. In contrast to existing works on neural 3D scene representation learning, this paper approaches the problem from a new perspective. We demonstrate that it is possible to learn NeRF suitable for novel-view synthesis in the RGB space from asynchronous event streams. Our models achieve high visual accuracy of the rendered novel views of challenging scenes in the RGB space, even though they are trained with substantially fewer data (i.e., event streams from a single event camera moving around the object) and more efficiently (due to the inherent sparsity of event streams) than the existing NeRF models trained with RGB images. We will release our datasets and the source code, see https://4dqv.mpi-inf.mpg.de/EventNeRF/.

Keyword: mapping

Metric Optimization in Penner Coordinates

Authors: Ryan Capouellez, Denis Zorin
Subjects: Computational Geometry (cs.CG); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2206.11456
Pdf link: https://arxiv.org/pdf/2206.11456
Abstract Many parametrization and mapping-related problems in geometry processing can be viewed as metric optimization problems, i.e., computing a metric minimizing a functional and satisfying a set of constraints, such as flatness. Penner coordinates are global coordinates on the space of metrics on meshes with a fixed vertex set and topology, but varying connectivity, making it homeomorphic to the Euclidean space of dimension equal to the number of edges in the mesh, without any additional constraints imposed, and reducing to logarithms of edge lengths when restricted to a fixed connectivity. These coordinates play an important role in the theory of discrete conformal maps, enabling recent development of highly robust algorithms with convergence and solution existence guarantees for computing such maps. We demonstrate how Penner coordinates can be used to solve a general class of problems involving metrics, including optimization and interpolation, while retaining the key guarantees available for conformal maps.

Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

Authors: Pihe Hu, Yu Chen, Longbo Huang
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.11489
Pdf link: https://arxiv.org/pdf/2206.11489
Abstract We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping $\boldsymbol{\phi}(s,a)$. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. To the best of our knowledge, this is the first minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap between the best known upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in \cite{jin2020provably} and lower bound of $\Omega(Hd\sqrt{T})$ for linear MDPs.

The role of open data in transforming the society to Society 5.0: a resource or a tool for SDG-compliant Smart Living?

Authors: Anastasija Nikiforova, Miguel Angel Alor, Miltiadis D. Lytras
Subjects: Computers and Society (cs.CY); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2206.11784
Pdf link: https://arxiv.org/pdf/2206.11784
Abstract Open data are characterized by a number of economic, technological, innovative and social benefits. They are seen as a significant contributor to the city's transformation into Smart City. This is all the more so when the society is on the border of Society 5.0, i.e., shift from the information society to a super smart society or society of imagination takes place. However, the question constantly asked by open data experts is, what are the key factors to be met and satisfied in order to achieve promised benefits? The current trend of openness suggests that the principle of openness should be followed not only by data but also research, education, software, standard, hardware etc., it should become a philosophy to be followed at different levels, in different domains. This should ensure greater transparency, eliminating inequalities, promoting, and achieving sustainable development goals. Therefore, many agendas now have openness as a prerequisite. This chapter deals with concepts of open (government) data and Society 5.0 pointing to their common objectives, providing some success stories of open data use in smart cities or transformation of cities towards smart cities, mapping them to the features of the Society 5.0. We believe that this trend develops a new form of society, which we refer to as "open data-driven society". It forms a bridge from Society 4.0 to Society 5.0. This Chapter attempts to identify the role of openness in promoting human-centric Smart Society, Smart city, and Smart Living.

Provably Efficient Model-Free Constrained RL with Linear Function Approximation

Authors: Arnob Ghosh, Xingyu Zhou, Ness Shroff
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2206.11889
Pdf link: https://arxiv.org/pdf/2206.11889
Abstract We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.

Keyword: localization

Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization

Authors: Kun Xia, Le Wang, Sanping Zhou, Nanning Zheng, Wei Tang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.11493
Pdf link: https://arxiv.org/pdf/2206.11493
Abstract The main challenge of Temporal Action Localization is to retrieve subtle human actions from various co-occurring ingredients, e.g., context and background, in an untrimmed video. While prior approaches have achieved substantial progress through devising advanced action detectors, they still suffer from these co-occurring ingredients which often dominate the actual action content in videos. In this paper, we explore two orthogonal but complementary aspects of a video snippet, i.e., the action features and the co-occurrence features. Especially, we develop a novel auxiliary task by decoupling these two types of features within a video snippet and recombining them to generate a new feature representation with more salient action information for accurate action localization. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features, and then synthesizes a new action-dominated video representation. Extensive experimental results and ablation studies on THUMOS14 and ActivityNet v1.3 demonstrate that our new representation, combined with a simple action detector, can significantly improve the action localization performance.

A Neuromorphic Vision-Based Measurement for Robust Relative Localization in Future Space Exploration Missions

Authors: Mohammed Salah, Mohammed Chehadah, Muhammed Humais, Mohammed Wahbah, Abdulla Ayyad, Rana Azzam, Lakmal Senevirante, Yahya Zweiri
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.11541
Pdf link: https://arxiv.org/pdf/2206.11541
Abstract Space exploration has witnessed revolutionary changes upon landing of the Perseverance Rover on the Martian surface and demonstrating the first flight beyond Earth by the Mars helicopter, Ingenuity. During their mission on Mars, Perseverance Rover and Ingenuity collaboratively explore the Martian surface, where Ingenuity scouts terrain information for rover's safe traversability. Hence, determining the relative poses between both the platforms is of paramount importance for the success of this mission. Driven by this necessity, this work proposes a robust relative localization system based on a fusion of neuromorphic vision-based measurements (NVBMs) and inertial measurements. The emergence of neuromorphic vision triggered a paradigm shift in the computer vision community, due to its unique working principle delineated with asynchronous events triggered by variations of light intensities occurring in the scene. This implies that observations cannot be acquired in static scenes due to illumination invariance. To circumvent this limitation, high frequency active landmarks are inserted in the scene to guarantee consistent event firing. These landmarks are adopted as salient features to facilitate relative localization. A novel event-based landmark identification algorithm using Gaussian Mixture Models (GMM) is developed for matching the landmarks correspondences formulating our NVBMs. The NVBMs are fused with inertial measurements in proposed state estimators, landmark tracking Kalman filter (LTKF) and translation decoupled Kalman filter (TDKF) for landmark tracking and relative localization, respectively. The proposed system was tested in a variety of experiments and has outperformed state-of-the-art approaches in accuracy and range.

Keyword: transformer

GACT: Activation Compressed Training for General Architectures

Authors: Xiaoxuan Liu, Lianmin Zheng, Dequan Wang, Yukuo Cen, Weize Chen, Xu Han, Jianfei Chen, Zhiyuan Liu, Jie Tang, Joey Gonzalez, Michael Mahoney, Alvin Cheung
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.11357
Pdf link: https://arxiv.org/pdf/2206.11357
Abstract Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1x, enabling training with a 4.2x to 24.7x larger batch size, with negligible accuracy loss.

ICOS Protein Expression Segmentation: Can Transformer Networks Give Better Results?

Authors: Vivek Kumar Singh, Paul O Reilly, Jacqueline James, Manuel Salto Tellez, Perry Maxwell
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2206.11520
Pdf link: https://arxiv.org/pdf/2206.11520
Abstract Biomarkers identify a patients response to treatment. With the recent advances in artificial intelligence based on the Transformer networks, there is only limited research has been done to measure the performance on challenging histopathology images. In this paper, we investigate the efficacy of the numerous state-of-the-art Transformer networks for immune-checkpoint biomarker, Inducible Tcell COStimulator (ICOS) protein cell segmentation in colon cancer from immunohistochemistry (IHC) slides. Extensive and comprehensive experimental results confirm that MiSSFormer achieved the highest Dice score of 74.85% than the rest evaluated Transformer and Efficient U-Net methods.

Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Switchboard Corpus

Authors: Junhao Xu, Shoukang Hu, Xunying Liu, Helen Meng
Subjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2206.11643
Pdf link: https://arxiv.org/pdf/2206.11643
Abstract State of the art time automatic speech recognition (ASR) systems are becoming increasingly complex and expensive for practical applications. This paper presents the development of a high performance and low-footprint 4-bit quantized LF-MMI trained factored time delay neural networks (TDNNs) based ASR system on the 300-hr Switchboard corpus. A key feature of the overall system design is to account for the fine-grained, varying performance sensitivity at different model components to quantization errors. To this end, a set of neural architectural compression and mixed precision quantization approaches were used to facilitate hidden layer level auto-configuration of optimal factored TDNN weight matrix subspace dimensionality and quantization bit-widths. The proposed techniques were also used to produce 2-bit mixed precision quantized Transformer language models. Experiments conducted on the Switchboard data suggest that the proposed neural architectural compression and mixed precision quantization techniques consistently outperform the uniform precision quantised baseline systems of comparable bit-widths in terms of word error rate (WER). An overall "lossless" compression ratio of 13.6 was obtained over the baseline full precision system including both the TDNN and Transformer components while incurring no statistically significant WER increase.

Graph Neural Networks for Temperature-Dependent Activity Coefficient Prediction of Solutes in Ionic Liquids

Authors: Jan G. Rittig, Karim Ben Hicham, Artur M. Schweidtmann, Manuel Dahmen, Alexander Mitsos
Subjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
Arxiv link: https://arxiv.org/abs/2206.11776
Pdf link: https://arxiv.org/pdf/2206.11776
Abstract Ionic liquids (ILs) are important solvents for sustainable processes and predicting activity coefficients (ACs) of solutes in ILs is needed. Recently, matrix completion methods (MCMs), transformers, and graph neural networks (GNNs) have shown high accuracy in predicting ACs of binary mixtures, superior to well-established models, e.g., COSMO-RS and UNIFAC. GNNs are particularly promising here as they learn a molecular graph-to-property relationship without pretraining, typically required for transformers, and are, unlike MCMs, applicable to molecules not included in training. For ILs, however, GNN applications are currently missing. Herein, we present a GNN to predict temperature-dependent infinite dilution ACs of solutes in ILs. We train the GNN on a database including more than 40,000 AC values and compare it to a state-of-the-art MCM. The GNN and MCM achieve similar high prediction performance, with the GNN additionally enabling high-quality predictions for ACs of solutions that contain ILs and solutes not considered during training.

Toward Clinically Assisted Colorectal Polyp Recognition via Structured Cross-modal Representation Consistency

Authors: Weijie Ma, Ye Zhu, Ruimao Zhang, Jie Yang, Yiwen Hu, Zhen Li, Li Xiang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.11826
Pdf link: https://arxiv.org/pdf/2206.11826
Abstract The colorectal polyps classification is a critical clinical examination. To improve the classification accuracy, most computer-aided diagnosis algorithms recognize colorectal polyps by adopting Narrow-Band Imaging (NBI). However, the NBI usually suffers from missing utilization in real clinic scenarios since the acquisition of this specific image requires manual switching of the light mode when polyps have been detected by using White-Light (WL) images. To avoid the above situation, we propose a novel method to directly achieve accurate white-light colonoscopy image classification by conducting structured cross-modal representation consistency. In practice, a pair of multi-modal images, i.e. NBI and WL, are fed into a shared Transformer to extract hierarchical feature representations. Then a novel designed Spatial Attention Module (SAM) is adopted to calculate the similarities between the class token and patch tokens %from multi-levels for a specific modality image. By aligning the class tokens and spatial attention maps of paired NBI and WL images at different levels, the Transformer achieves the ability to keep both global and local representation consistency for the above two modalities. Extensive experimental results illustrate the proposed method outperforms the recent studies with a margin, realizing multi-modal prediction with a single Transformer while greatly improving the classification accuracy when only with WL images.

Romantic-Computing

Authors: Elizabeth Horishny
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2206.11864
Pdf link: https://arxiv.org/pdf/2206.11864
Abstract In this paper we compare various text generation models' ability to write poetry in the style of early English Romanticism. These models include: Character-Level Recurrent Neural Networks with Long Short-Term Memory, Hugging Face's GPT-2, OpenAI's GPT-3, and EleutherAI's GPT-NEO. Quality was measured based syllable count and coherence with the automatic evaluation metric GRUEN. Character-Level Recurrent Neural Networks performed far worse compared to transformer models. And, as parameter-size increased, the quality of transformer models' poems improved. These models are typically not compared in a creative context, and we are happy to contribute.

A Multi-Policy Framework for Deep Learning-Based Fake News Detection

Authors: João Vitorino, Tiago Dias, Tiago Fonseca, Nuno Oliveira, Isabel Praça
Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.11866
Pdf link: https://arxiv.org/pdf/2206.11866
Abstract Connectivity plays an ever-increasing role in modern society, with people all around the world having easy access to rapidly disseminated information. However, a more interconnected society enables the spread of intentionally false information. To mitigate the negative impacts of fake news, it is essential to improve detection methodologies. This work introduces Multi-Policy Statement Checker (MPSC), a framework that automates fake news detection by using deep learning techniques to analyze a statement itself and its related news articles, predicting whether it is seemingly credible or suspicious. The proposed framework was evaluated using four merged datasets containing real and fake news. Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU) and Bidirectional Encoder Representations from Transformers (BERT) models were trained to utilize both lexical and syntactic features, and their performance was evaluated. The obtained results demonstrate that a multi-policy analysis reliably identifies suspicious statements, which can be advantageous for fake news detection.

Lifelong Learning Natural Language Processing Approach for Multilingual Data Classification

Authors: Jędrzej Kozal, Michał Leś, Paweł Zyblewski, Paweł Ksieniewicz, Michał Woźniak
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.11867
Pdf link: https://arxiv.org/pdf/2206.11867
Abstract The abundance of information in digital media, which in today's world is the main source of knowledge about current events for the masses, makes it possible to spread disinformation on a larger scale than ever before. Consequently, there is a need to develop novel fake news detection approaches capable of adapting to changing factual contexts and generalizing previously or concurrently acquired knowledge. To deal with this problem, we propose a lifelong learning-inspired approach, which allows for fake news detection in multiple languages and the mutual transfer of knowledge acquired in each of them. Both classical feature extractors, such as Term frequency-inverse document frequency or Latent Dirichlet Allocation, and integrated deep NLP (Natural Language Processing) BERT (Bidirectional Encoder Representations from Transformers) models paired with MLP (Multilayer Perceptron) classifier, were employed. The results of experiments conducted on two datasets dedicated to the fake news classification task (in English and Spanish, respectively), supported by statistical analysis, confirmed that utilization of additional languages could improve performance for traditional methods. Also, in some cases supplementing the deep learning method with classical ones can positively impact obtained results. The ability of models to generalize the knowledge acquired between the analyzed languages was also observed.

On the Parameterization and Initialization of Diagonal State Space Models

Authors: Albert Gu, Ankit Gupta, Karan Goel, Christopher Ré
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.11893
Pdf link: https://arxiv.org/pdf/2206.11893
Abstract State space models (SSM) have recently been shown to be very effective as a deep learning layer as a promising alternative to sequence models such as RNNs, CNNs, or Transformers. The first version to show this potential was the S4 model, which is particularly effective on tasks involving long-range dependencies by using a prescribed state matrix called the HiPPO matrix. While this has an interpretable mathematical mechanism for modeling long dependencies, it introduces a custom representation and algorithm that can be difficult to implement. On the other hand, a recent variant of S4 called DSS showed that restricting the state matrix to be fully diagonal can still preserve the performance of the original model when using a specific initialization based on approximating S4's matrix. This work seeks to systematically understand how to parameterize and initialize such diagonal state space models. While it follows from classical results that almost all SSMs have an equivalent diagonal form, we show that the initialization is critical for performance. We explain why DSS works mathematically, by showing that the diagonal restriction of S4's matrix surprisingly recovers the same kernel in the limit of infinite state dimension. We also systematically describe various design choices in parameterizing and computing diagonal SSMs, and perform a controlled empirical study ablating the effects of these choices. Our final model S4D is a simple diagonal version of S4 whose kernel computation requires just 2 lines of code and performs comparably to S4 in almost all settings, with state-of-the-art results for image, audio, and medical time-series domains, and averaging 85% on the Long Range Arena benchmark.

MaskViT: Masked Visual Pre-Training for Video Prediction

Authors: Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2206.11894
Pdf link: https://arxiv.org/pdf/2206.11894
Abstract The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Authors: Jinghuan Shang, Srijan Das, Michael S. Ryoo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2206.11895
Pdf link: https://arxiv.org/pdf/2206.11895
Abstract Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html

Keyword: autonomous driving

A Novel Algorithm for Exact Concave Hull Extraction

Authors: Kevin Christopher VanHorn, Murat Can Çobanoğlu
Subjects: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2206.11481
Pdf link: https://arxiv.org/pdf/2206.11481
Abstract Region extraction is necessary in a wide range of applications, from object detection in autonomous driving to analysis of subcellular morphology in cell biology. There exist two main approaches: convex hull extraction, for which exact and efficient algorithms exist and concave hulls, which are better at capturing real-world shapes but do not have a single solution. Especially in the context of a uniform grid, concave hull algorithms are largely approximate, sacrificing region integrity for spatial and temporal efficiency. In this study, we present a novel algorithm that can provide vertex-minimized concave hulls with maximal (i.e. pixel-perfect) resolution and is tunable for speed-efficiency tradeoffs. Our method provides advantages in multiple downstream applications including data compression, retrieval, visualization, and analysis. To demonstrate the practical utility of our approach, we focus on image compression. We demonstrate significant improvements through context-dependent compression on disparate regions within a single image (entropy encoding for noisy and predictive encoding for the structured regions). We show that these improvements range from biomedical images to natural images. Beyond image compression, our algorithm can be applied more broadly to aid in a wide range of practical applications for data retrieval, visualization, and analysis.

Jun 24 '22 03:06 zhuhu00

Paper-Daily-Notice Paper-Daily-Notice copied to clipboard

New submissions for Fri, 24 Jun 22

New submissions for Fri, 24 Jun 22

Keyword: SLAM

Algorithms for 2-connected network design and flexible Steiner trees with a constant number of terminals

Keyword: odometry

Keyword: livox

Keyword: loam

Keyword: lidar

LidarMutliNet: Unifying LiDAR Semantic Segmentation, 3D Object Detection, and Panoptic Segmentation in a Single Multi-task Network

Keyword: loop detection

Keyword: nerf

EventNeRF: Neural Radiance Fields from a Single Colour Event Camera

Keyword: mapping

Metric Optimization in Penner Coordinates

Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

The role of open data in transforming the society to Society 5.0: a resource or a tool for SDG-compliant Smart Living?

Provably Efficient Model-Free Constrained RL with Linear Function Approximation

Keyword: localization

Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization

A Neuromorphic Vision-Based Measurement for Robust Relative Localization in Future Space Exploration Missions

Keyword: transformer

GACT: Activation Compressed Training for General Architectures

ICOS Protein Expression Segmentation: Can Transformer Networks Give Better Results?

Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Switchboard Corpus

Graph Neural Networks for Temperature-Dependent Activity Coefficient Prediction of Solutes in Ionic Liquids

Toward Clinically Assisted Colorectal Polyp Recognition via Structured Cross-modal Representation Consistency

Romantic-Computing

A Multi-Policy Framework for Deep Learning-Based Fake News Detection

Lifelong Learning Natural Language Processing Approach for Multilingual Data Classification

On the Parameterization and Initialization of Diagonal State Space Models

MaskViT: Masked Visual Pre-Training for Video Prediction

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Keyword: autonomous driving

A Novel Algorithm for Exact Concave Hull Extraction

Paper-Daily-Notice
Paper-Daily-Notice copied to clipboard