Paper-Daily-Notice
Paper-Daily-Notice copied to clipboard
New submissions for Fri, 24 Jun 22
New submissions for Fri, 24 Jun 22
Keyword: SLAM
Algorithms for 2-connected network design and flexible Steiner trees with a constant number of terminals
- Authors: Ishan Bansal, Joe Cheriyan, Logan Grout, Sharat Ibrahimpur
- Subjects: Data Structures and Algorithms (cs.DS)
- Arxiv link: https://arxiv.org/abs/2206.11807
- Pdf link: https://arxiv.org/pdf/2206.11807
- Abstract The $k$-Steiner-2NCS problem is as follows: Given a constant $k$, and an undirected connected graph $G = (V,E)$, non-negative costs $c$ on $E$, and a partition $(T, V-T)$ of $V$ into a set of terminals, $T$, and a set of non-terminals (or, Steiner nodes), where $|T|=k$, find a minimum-cost two-node connected subgraph that contains the terminals. We present a randomized polynomial-time algorithm for the unweighted problem, and a randomized PTAS for the weighted problem. We obtain similar results for the $k$-Steiner-2ECS problem, where the input is the same, and the algorithmic goal is to find a minimum-cost two-edge connected subgraph that contains the terminals. Our methods build on results by Bj"orklund, Husfeldt, and Taslaman (ACM-SIAM SODA 2012) that give a randomized polynomial-time algorithm for the unweighted $k$-Steiner-cycle problem; this problem has the same inputs as the unweighted $k$-Steiner-2NCS problem, and the algorithmic goal is to find a minimum-size simple cycle $C$ that contains the terminals ($C$ may contain any number of Steiner nodes).
Keyword: odometry
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
LidarMutliNet: Unifying LiDAR Semantic Segmentation, 3D Object Detection, and Panoptic Segmentation in a Single Multi-task Network
- Authors: Dongqiangzi Ye, Weijia Chen, Zixiang Zhou, Yufei Xie, Yu Wang, Panqu Wang, Hassan Foroosh
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.11428
- Pdf link: https://arxiv.org/pdf/2206.11428
- Abstract This technical report presents the 1st place winning solution for the Waymo Open Dataset 3D semantic segmentation challenge 2022. Our network, termed LidarMultiNet, unifies the major LiDAR perception tasks such as 3D semantic segmentation, object detection, and panoptic segmentation in a single framework. At the core of LidarMultiNet is a strong 3D voxel-based encoder-decoder network with a novel Global Context Pooling (GCP) module extracting global contextual features from a LiDAR frame to complement its local features. An optional second stage is proposed to refine the first-stage segmentation or generate accurate panoptic segmentation results. Our solution achieves a mIoU of 71.13 and is the best for most of the 22 classes on the Waymo 3D semantic segmentation test set, outperforming all the other 3D semantic segmentation methods on the official leaderboard. We demonstrate for the first time that major LiDAR perception tasks can be unified in a single strong network that can be trained end-to-end.
Keyword: loop detection
There is no result
Keyword: nerf
EventNeRF: Neural Radiance Fields from a Single Colour Event Camera
- Authors: Viktor Rudnev, Mohamed Elgharib, Christian Theobalt, Vladislav Golyanik
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.11896
- Pdf link: https://arxiv.org/pdf/2206.11896
- Abstract Learning coordinate-based volumetric 3D scene representations such as neural radiance fields (NeRF) has been so far studied assuming RGB or RGB-D images as inputs. At the same time, it is known from the neuroscience literature that human visual system (HVS) is tailored to process asynchronous brightness changes rather than synchronous RGB images, in order to build and continuously update mental 3D representations of the surroundings for navigation and survival. Visual sensors that were inspired by HVS principles are event cameras. Thus, events are sparse and asynchronous per-pixel brightness (or colour channel) change signals. In contrast to existing works on neural 3D scene representation learning, this paper approaches the problem from a new perspective. We demonstrate that it is possible to learn NeRF suitable for novel-view synthesis in the RGB space from asynchronous event streams. Our models achieve high visual accuracy of the rendered novel views of challenging scenes in the RGB space, even though they are trained with substantially fewer data (i.e., event streams from a single event camera moving around the object) and more efficiently (due to the inherent sparsity of event streams) than the existing NeRF models trained with RGB images. We will release our datasets and the source code, see https://4dqv.mpi-inf.mpg.de/EventNeRF/.
Keyword: mapping
Metric Optimization in Penner Coordinates
- Authors: Ryan Capouellez, Denis Zorin
- Subjects: Computational Geometry (cs.CG); Graphics (cs.GR)
- Arxiv link: https://arxiv.org/abs/2206.11456
- Pdf link: https://arxiv.org/pdf/2206.11456
- Abstract Many parametrization and mapping-related problems in geometry processing can be viewed as metric optimization problems, i.e., computing a metric minimizing a functional and satisfying a set of constraints, such as flatness. Penner coordinates are global coordinates on the space of metrics on meshes with a fixed vertex set and topology, but varying connectivity, making it homeomorphic to the Euclidean space of dimension equal to the number of edges in the mesh, without any additional constraints imposed, and reducing to logarithms of edge lengths when restricted to a fixed connectivity. These coordinates play an important role in the theory of discrete conformal maps, enabling recent development of highly robust algorithms with convergence and solution existence guarantees for computing such maps. We demonstrate how Penner coordinates can be used to solve a general class of problems involving metrics, including optimization and interpolation, while retaining the key guarantees available for conformal maps.
Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation
- Authors: Pihe Hu, Yu Chen, Longbo Huang
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2206.11489
- Pdf link: https://arxiv.org/pdf/2206.11489
- Abstract We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping $\boldsymbol{\phi}(s,a)$. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. To the best of our knowledge, this is the first minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap between the best known upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in \cite{jin2020provably} and lower bound of $\Omega(Hd\sqrt{T})$ for linear MDPs.
The role of open data in transforming the society to Society 5.0: a resource or a tool for SDG-compliant Smart Living?
- Authors: Anastasija Nikiforova, Miguel Angel Alor, Miltiadis D. Lytras
- Subjects: Computers and Society (cs.CY); Computational Engineering, Finance, and Science (cs.CE)
- Arxiv link: https://arxiv.org/abs/2206.11784
- Pdf link: https://arxiv.org/pdf/2206.11784
- Abstract Open data are characterized by a number of economic, technological, innovative and social benefits. They are seen as a significant contributor to the city's transformation into Smart City. This is all the more so when the society is on the border of Society 5.0, i.e., shift from the information society to a super smart society or society of imagination takes place. However, the question constantly asked by open data experts is, what are the key factors to be met and satisfied in order to achieve promised benefits? The current trend of openness suggests that the principle of openness should be followed not only by data but also research, education, software, standard, hardware etc., it should become a philosophy to be followed at different levels, in different domains. This should ensure greater transparency, eliminating inequalities, promoting, and achieving sustainable development goals. Therefore, many agendas now have openness as a prerequisite. This chapter deals with concepts of open (government) data and Society 5.0 pointing to their common objectives, providing some success stories of open data use in smart cities or transformation of cities towards smart cities, mapping them to the features of the Society 5.0. We believe that this trend develops a new form of society, which we refer to as "open data-driven society". It forms a bridge from Society 4.0 to Society 5.0. This Chapter attempts to identify the role of openness in promoting human-centric Smart Society, Smart city, and Smart Living.
Provably Efficient Model-Free Constrained RL with Linear Function Approximation
- Authors: Arnob Ghosh, Xingyu Zhou, Ness Shroff
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2206.11889
- Pdf link: https://arxiv.org/pdf/2206.11889
- Abstract We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.
Keyword: localization
Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization
- Authors: Kun Xia, Le Wang, Sanping Zhou, Nanning Zheng, Wei Tang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.11493
- Pdf link: https://arxiv.org/pdf/2206.11493
- Abstract The main challenge of Temporal Action Localization is to retrieve subtle human actions from various co-occurring ingredients, e.g., context and background, in an untrimmed video. While prior approaches have achieved substantial progress through devising advanced action detectors, they still suffer from these co-occurring ingredients which often dominate the actual action content in videos. In this paper, we explore two orthogonal but complementary aspects of a video snippet, i.e., the action features and the co-occurrence features. Especially, we develop a novel auxiliary task by decoupling these two types of features within a video snippet and recombining them to generate a new feature representation with more salient action information for accurate action localization. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features, and then synthesizes a new action-dominated video representation. Extensive experimental results and ablation studies on THUMOS14 and ActivityNet v1.3 demonstrate that our new representation, combined with a simple action detector, can significantly improve the action localization performance.
A Neuromorphic Vision-Based Measurement for Robust Relative Localization in Future Space Exploration Missions
- Authors: Mohammed Salah, Mohammed Chehadah, Muhammed Humais, Mohammed Wahbah, Abdulla Ayyad, Rana Azzam, Lakmal Senevirante, Yahya Zweiri
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.11541
- Pdf link: https://arxiv.org/pdf/2206.11541
- Abstract Space exploration has witnessed revolutionary changes upon landing of the Perseverance Rover on the Martian surface and demonstrating the first flight beyond Earth by the Mars helicopter, Ingenuity. During their mission on Mars, Perseverance Rover and Ingenuity collaboratively explore the Martian surface, where Ingenuity scouts terrain information for rover's safe traversability. Hence, determining the relative poses between both the platforms is of paramount importance for the success of this mission. Driven by this necessity, this work proposes a robust relative localization system based on a fusion of neuromorphic vision-based measurements (NVBMs) and inertial measurements. The emergence of neuromorphic vision triggered a paradigm shift in the computer vision community, due to its unique working principle delineated with asynchronous events triggered by variations of light intensities occurring in the scene. This implies that observations cannot be acquired in static scenes due to illumination invariance. To circumvent this limitation, high frequency active landmarks are inserted in the scene to guarantee consistent event firing. These landmarks are adopted as salient features to facilitate relative localization. A novel event-based landmark identification algorithm using Gaussian Mixture Models (GMM) is developed for matching the landmarks correspondences formulating our NVBMs. The NVBMs are fused with inertial measurements in proposed state estimators, landmark tracking Kalman filter (LTKF) and translation decoupled Kalman filter (TDKF) for landmark tracking and relative localization, respectively. The proposed system was tested in a variety of experiments and has outperformed state-of-the-art approaches in accuracy and range.
Keyword: transformer
GACT: Activation Compressed Training for General Architectures
- Authors: Xiaoxuan Liu, Lianmin Zheng, Dequan Wang, Yukuo Cen, Weize Chen, Xu Han, Jianfei Chen, Zhiyuan Liu, Jie Tang, Joey Gonzalez, Michael Mahoney, Alvin Cheung
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2206.11357
- Pdf link: https://arxiv.org/pdf/2206.11357
- Abstract Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1x, enabling training with a 4.2x to 24.7x larger batch size, with negligible accuracy loss.
ICOS Protein Expression Segmentation: Can Transformer Networks Give Better Results?
- Authors: Vivek Kumar Singh, Paul O Reilly, Jacqueline James, Manuel Salto Tellez, Perry Maxwell
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2206.11520
- Pdf link: https://arxiv.org/pdf/2206.11520
- Abstract Biomarkers identify a patients response to treatment. With the recent advances in artificial intelligence based on the Transformer networks, there is only limited research has been done to measure the performance on challenging histopathology images. In this paper, we investigate the efficacy of the numerous state-of-the-art Transformer networks for immune-checkpoint biomarker, Inducible Tcell COStimulator (ICOS) protein cell segmentation in colon cancer from immunohistochemistry (IHC) slides. Extensive and comprehensive experimental results confirm that MiSSFormer achieved the highest Dice score of 74.85% than the rest evaluated Transformer and Efficient U-Net methods.
Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Switchboard Corpus
- Authors: Junhao Xu, Shoukang Hu, Xunying Liu, Helen Meng
- Subjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2206.11643
- Pdf link: https://arxiv.org/pdf/2206.11643
- Abstract State of the art time automatic speech recognition (ASR) systems are becoming increasingly complex and expensive for practical applications. This paper presents the development of a high performance and low-footprint 4-bit quantized LF-MMI trained factored time delay neural networks (TDNNs) based ASR system on the 300-hr Switchboard corpus. A key feature of the overall system design is to account for the fine-grained, varying performance sensitivity at different model components to quantization errors. To this end, a set of neural architectural compression and mixed precision quantization approaches were used to facilitate hidden layer level auto-configuration of optimal factored TDNN weight matrix subspace dimensionality and quantization bit-widths. The proposed techniques were also used to produce 2-bit mixed precision quantized Transformer language models. Experiments conducted on the Switchboard data suggest that the proposed neural architectural compression and mixed precision quantization techniques consistently outperform the uniform precision quantised baseline systems of comparable bit-widths in terms of word error rate (WER). An overall "lossless" compression ratio of 13.6 was obtained over the baseline full precision system including both the TDNN and Transformer components while incurring no statistically significant WER increase.
Graph Neural Networks for Temperature-Dependent Activity Coefficient Prediction of Solutes in Ionic Liquids
- Authors: Jan G. Rittig, Karim Ben Hicham, Artur M. Schweidtmann, Manuel Dahmen, Alexander Mitsos
- Subjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
- Arxiv link: https://arxiv.org/abs/2206.11776
- Pdf link: https://arxiv.org/pdf/2206.11776
- Abstract Ionic liquids (ILs) are important solvents for sustainable processes and predicting activity coefficients (ACs) of solutes in ILs is needed. Recently, matrix completion methods (MCMs), transformers, and graph neural networks (GNNs) have shown high accuracy in predicting ACs of binary mixtures, superior to well-established models, e.g., COSMO-RS and UNIFAC. GNNs are particularly promising here as they learn a molecular graph-to-property relationship without pretraining, typically required for transformers, and are, unlike MCMs, applicable to molecules not included in training. For ILs, however, GNN applications are currently missing. Herein, we present a GNN to predict temperature-dependent infinite dilution ACs of solutes in ILs. We train the GNN on a database including more than 40,000 AC values and compare it to a state-of-the-art MCM. The GNN and MCM achieve similar high prediction performance, with the GNN additionally enabling high-quality predictions for ACs of solutions that contain ILs and solutes not considered during training.
Toward Clinically Assisted Colorectal Polyp Recognition via Structured Cross-modal Representation Consistency
- Authors: Weijie Ma, Ye Zhu, Ruimao Zhang, Jie Yang, Yiwen Hu, Zhen Li, Li Xiang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.11826
- Pdf link: https://arxiv.org/pdf/2206.11826
- Abstract The colorectal polyps classification is a critical clinical examination. To improve the classification accuracy, most computer-aided diagnosis algorithms recognize colorectal polyps by adopting Narrow-Band Imaging (NBI). However, the NBI usually suffers from missing utilization in real clinic scenarios since the acquisition of this specific image requires manual switching of the light mode when polyps have been detected by using White-Light (WL) images. To avoid the above situation, we propose a novel method to directly achieve accurate white-light colonoscopy image classification by conducting structured cross-modal representation consistency. In practice, a pair of multi-modal images, i.e. NBI and WL, are fed into a shared Transformer to extract hierarchical feature representations. Then a novel designed Spatial Attention Module (SAM) is adopted to calculate the similarities between the class token and patch tokens %from multi-levels for a specific modality image. By aligning the class tokens and spatial attention maps of paired NBI and WL images at different levels, the Transformer achieves the ability to keep both global and local representation consistency for the above two modalities. Extensive experimental results illustrate the proposed method outperforms the recent studies with a margin, realizing multi-modal prediction with a single Transformer while greatly improving the classification accuracy when only with WL images.
Romantic-Computing
- Authors: Elizabeth Horishny
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2206.11864
- Pdf link: https://arxiv.org/pdf/2206.11864
- Abstract In this paper we compare various text generation models' ability to write poetry in the style of early English Romanticism. These models include: Character-Level Recurrent Neural Networks with Long Short-Term Memory, Hugging Face's GPT-2, OpenAI's GPT-3, and EleutherAI's GPT-NEO. Quality was measured based syllable count and coherence with the automatic evaluation metric GRUEN. Character-Level Recurrent Neural Networks performed far worse compared to transformer models. And, as parameter-size increased, the quality of transformer models' poems improved. These models are typically not compared in a creative context, and we are happy to contribute.
A Multi-Policy Framework for Deep Learning-Based Fake News Detection
- Authors: João Vitorino, Tiago Dias, Tiago Fonseca, Nuno Oliveira, Isabel Praça
- Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2206.11866
- Pdf link: https://arxiv.org/pdf/2206.11866
- Abstract Connectivity plays an ever-increasing role in modern society, with people all around the world having easy access to rapidly disseminated information. However, a more interconnected society enables the spread of intentionally false information. To mitigate the negative impacts of fake news, it is essential to improve detection methodologies. This work introduces Multi-Policy Statement Checker (MPSC), a framework that automates fake news detection by using deep learning techniques to analyze a statement itself and its related news articles, predicting whether it is seemingly credible or suspicious. The proposed framework was evaluated using four merged datasets containing real and fake news. Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU) and Bidirectional Encoder Representations from Transformers (BERT) models were trained to utilize both lexical and syntactic features, and their performance was evaluated. The obtained results demonstrate that a multi-policy analysis reliably identifies suspicious statements, which can be advantageous for fake news detection.
Lifelong Learning Natural Language Processing Approach for Multilingual Data Classification
- Authors: Jędrzej Kozal, Michał Leś, Paweł Zyblewski, Paweł Ksieniewicz, Michał Woźniak
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2206.11867
- Pdf link: https://arxiv.org/pdf/2206.11867
- Abstract The abundance of information in digital media, which in today's world is the main source of knowledge about current events for the masses, makes it possible to spread disinformation on a larger scale than ever before. Consequently, there is a need to develop novel fake news detection approaches capable of adapting to changing factual contexts and generalizing previously or concurrently acquired knowledge. To deal with this problem, we propose a lifelong learning-inspired approach, which allows for fake news detection in multiple languages and the mutual transfer of knowledge acquired in each of them. Both classical feature extractors, such as Term frequency-inverse document frequency or Latent Dirichlet Allocation, and integrated deep NLP (Natural Language Processing) BERT (Bidirectional Encoder Representations from Transformers) models paired with MLP (Multilayer Perceptron) classifier, were employed. The results of experiments conducted on two datasets dedicated to the fake news classification task (in English and Spanish, respectively), supported by statistical analysis, confirmed that utilization of additional languages could improve performance for traditional methods. Also, in some cases supplementing the deep learning method with classical ones can positively impact obtained results. The ability of models to generalize the knowledge acquired between the analyzed languages was also observed.
On the Parameterization and Initialization of Diagonal State Space Models
- Authors: Albert Gu, Ankit Gupta, Karan Goel, Christopher Ré
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2206.11893
- Pdf link: https://arxiv.org/pdf/2206.11893
- Abstract State space models (SSM) have recently been shown to be very effective as a deep learning layer as a promising alternative to sequence models such as RNNs, CNNs, or Transformers. The first version to show this potential was the S4 model, which is particularly effective on tasks involving long-range dependencies by using a prescribed state matrix called the HiPPO matrix. While this has an interpretable mathematical mechanism for modeling long dependencies, it introduces a custom representation and algorithm that can be difficult to implement. On the other hand, a recent variant of S4 called DSS showed that restricting the state matrix to be fully diagonal can still preserve the performance of the original model when using a specific initialization based on approximating S4's matrix. This work seeks to systematically understand how to parameterize and initialize such diagonal state space models. While it follows from classical results that almost all SSMs have an equivalent diagonal form, we show that the initialization is critical for performance. We explain why DSS works mathematically, by showing that the diagonal restriction of S4's matrix surprisingly recovers the same kernel in the limit of infinite state dimension. We also systematically describe various design choices in parameterizing and computing diagonal SSMs, and perform a controlled empirical study ablating the effects of these choices. Our final model S4D is a simple diagonal version of S4 whose kernel computation requires just 2 lines of code and performs comparably to S4 in almost all settings, with state-of-the-art results for image, audio, and medical time-series domains, and averaging 85% on the Long Range Arena benchmark.
MaskViT: Masked Visual Pre-Training for Video Prediction
- Authors: Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2206.11894
- Pdf link: https://arxiv.org/pdf/2206.11894
- Abstract The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.
Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space
- Authors: Jinghuan Shang, Srijan Das, Michael S. Ryoo
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2206.11895
- Pdf link: https://arxiv.org/pdf/2206.11895
- Abstract Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html
Keyword: autonomous driving
A Novel Algorithm for Exact Concave Hull Extraction
- Authors: Kevin Christopher VanHorn, Murat Can Çobanoğlu
- Subjects: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.11481
- Pdf link: https://arxiv.org/pdf/2206.11481
- Abstract Region extraction is necessary in a wide range of applications, from object detection in autonomous driving to analysis of subcellular morphology in cell biology. There exist two main approaches: convex hull extraction, for which exact and efficient algorithms exist and concave hulls, which are better at capturing real-world shapes but do not have a single solution. Especially in the context of a uniform grid, concave hull algorithms are largely approximate, sacrificing region integrity for spatial and temporal efficiency. In this study, we present a novel algorithm that can provide vertex-minimized concave hulls with maximal (i.e. pixel-perfect) resolution and is tunable for speed-efficiency tradeoffs. Our method provides advantages in multiple downstream applications including data compression, retrieval, visualization, and analysis. To demonstrate the practical utility of our approach, we focus on image compression. We demonstrate significant improvements through context-dependent compression on disparate regions within a single image (entropy encoding for noisy and predictive encoding for the structured regions). We show that these improvements range from biomedical images to natural images. Beyond image compression, our algorithm can be applied more broadly to aid in a wide range of practical applications for data retrieval, visualization, and analysis.