arxiv-daily
arxiv-daily copied to clipboard
New submissions for Wed, 5 Oct 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
Automated Medical Device Display Reading Using Deep Learning Object Detection
- Authors: Lucas P. Moreira
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.01325
- Pdf link: https://arxiv.org/pdf/2210.01325
- Abstract Telemedicine and mobile health applications, especially during the quarantine imposed by the covid-19 pandemic, led to an increase on the need of transferring health monitor readings from patients to specialists. Considering that most home medical devices use seven-segment displays, an automatic display reading algorithm should provide a more reliable tool for remote health care. This work proposes an end-to-end method for detection and reading seven-segment displays from medical devices based on deep learning object detection models. Two state of the art model families, EfficientDet and EfficientDet-lite, previously trained with the MS-COCO dataset, were fine-tuned on a dataset comprised by medical devices photos taken with mobile digital cameras, to simulate real case applications. Evaluation of the trained model show high efficiency, where all models achieved more than 98% of detection precision and more than 98% classification accuracy, with model EfficientDet-lite1 showing 100% detection precision and 100% correct digit classification for a test set of 104 images and 438 digits.
Bridged Transformer for Vision and Point Cloud 3D Object Detection
- Authors: Yikai Wang, TengQi Ye, Lele Cao, Wenbing Huang, Fuchun Sun, Fengxiang He, Dacheng Tao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.01391
- Pdf link: https://arxiv.org/pdf/2210.01391
- Abstract 3D object detection is a crucial research topic in computer vision, which usually uses 3D point clouds as input in conventional setups. Recently, there is a trend of leveraging multiple sources of input data, such as complementing the 3D point cloud with 2D images that often have richer color and fewer noises. However, due to the heterogeneous geometrics of the 2D and 3D representations, it prevents us from applying off-the-shelf neural networks to achieve multimodal fusion. To that end, we propose Bridged Transformer (BrT), an end-to-end architecture for 3D object detection. BrT is simple and effective, which learns to identify 3D and 2D object bounding boxes from both points and image patches. A key element of BrT lies in the utilization of object queries for bridging 3D and 2D spaces, which unifies different sources of data representations in Transformer. We adopt a form of feature aggregation realized by point-to-patch projections which further strengthen the correlations between images and points. Moreover, BrT works seamlessly for fusing the point cloud with multi-view images. We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
Streaming Video Analytics On The Edge With Asynchronous Cloud Support
- Authors: Anurag Ghosh, Srinivasan Iyengar, Stephen Lee, Anuj Rathore, Venkat N Padmanabhan
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Multimedia (cs.MM)
- Arxiv link: https://arxiv.org/abs/2210.01402
- Pdf link: https://arxiv.org/pdf/2210.01402
- Abstract Emerging Internet of Things (IoT) and mobile computing applications are expected to support latency-sensitive deep neural network (DNN) workloads. To realize this vision, the Internet is evolving towards an edge-computing architecture, where computing infrastructure is located closer to the end device to help achieve low latency. However, edge computing may have limited resources compared to cloud environments and thus, cannot run large DNN models that often have high accuracy. In this work, we develop REACT, a framework that leverages cloud resources to execute large DNN models with higher accuracy to improve the accuracy of models running on edge devices. To do so, we propose a novel edge-cloud fusion algorithm that fuses edge and cloud predictions, achieving low latency and high accuracy. We extensively evaluate our approach and show that our approach can significantly improve the accuracy compared to baseline approaches. We focus specifically on object detection in videos (applicable in many video analytics scenarios) and show that the fused edge-cloud predictions can outperform the accuracy of edge-only and cloud-only scenarios by as much as 50%. We also show that REACT can achieve good performance across tradeoff points by choosing a wide range of system parameters to satisfy use-case specific constraints, such as limited network bandwidth or GPU cycles.
Long-Term Localization using Semantic Cues in Floor Plan Maps
- Authors: Nicky Zimmerman, Tiziano Guadagnino, Xieyuanli Chen, Jens Behley, Cyrill Stachniss
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2210.01456
- Pdf link: https://arxiv.org/pdf/2210.01456
- Abstract Lifelong localization in a given map is an essential capability for autonomous service robots. In this paper, we consider the task of long-term localization in a changing indoor environment given sparse CAD floor plans. The commonly used pre-built maps from the robot sensors may increase the cost and time of deployment. Furthermore, their detailed nature requires that they are updated when significant changes occur. We address the difficulty of localization when the correspondence between the map and the observations is low due to the sparsity of the CAD map and the changing environment. To overcome both challenges, we propose to exploit semantic cues that are commonly present in human-oriented spaces. These semantic cues can be detected using RGB cameras by utilizing object detection, and are matched against an easy-to-update, abstract semantic map. The semantic information is integrated into a Monte Carlo localization framework using a particle filter that operates on 2D LiDAR scans and camera data. We provide a long-term localization solution and a semantic map format, for environments that undergo changes to their interior structure and detailed geometric maps are not available. We evaluate our localization framework on multiple challenging indoor scenarios in an office environment, taken weeks apart. The experiments suggest that our approach is robust to structural changes and can run on an onboard computer. We released the open source implementation of our approach written in C++ together with a ROS wrapper.
Keyword: transformer
TPGNN: Learning High-order Information in Dynamic Graphs via Temporal Propagation
- Authors: Zehong Wang, Qi Li, Donghua Yu
- Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
- Arxiv link: https://arxiv.org/abs/2210.01171
- Pdf link: https://arxiv.org/pdf/2210.01171
- Abstract Temporal graph is an abstraction for modeling dynamic systems that consist of evolving interaction elements. In this paper, we aim to solve an important yet neglected problem -- how to learn information from high-order neighbors in temporal graphs? -- to enhance the informativeness and discriminativeness for the learned node representations. We argue that when learning high-order information from temporal graphs, we encounter two challenges, i.e., computational inefficiency and over-smoothing, that cannot be solved by conventional techniques applied on static graphs. To remedy these deficiencies, we propose a temporal propagation-based graph neural network, namely TPGNN. To be specific, the model consists of two distinct components, i.e., propagator and node-wise encoder. The propagator is leveraged to propagate messages from the anchor node to its temporal neighbors within $k$-hop, and then simultaneously update the state of neighborhoods, which enables efficient computation, especially for a deep model. In addition, to prevent over-smoothing, the model compels the messages from $n$-hop neighbors to update the $n$-hop memory vector preserved on the anchor. The node-wise encoder adopts transformer architecture to learn node representations by explicitly learning the importance of memory vectors preserved on the node itself, that is, implicitly modeling the importance of messages from neighbors at different layers, thus mitigating the over-smoothing. Since the encoding process will not query temporal neighbors, we can dramatically save time consumption in inference. Extensive experiments on temporal link prediction and node classification demonstrate the superiority of TPGNN over state-of-the-art baselines in efficiency and robustness.
Efficient Spiking Transformer Enabled By Partial Information
- Authors: Ziqing Wang, Yuetong Fang, Jiahang Cao, Zhongrui Wang, Renjing Xu
- Subjects: Neural and Evolutionary Computing (cs.NE)
- Arxiv link: https://arxiv.org/abs/2210.01208
- Pdf link: https://arxiv.org/pdf/2210.01208
- Abstract Spiking neural networks (SNNs) have received substantial attention in recent years due to their sparse and asynchronous communication nature, and thus can be deployed in neuromorphic hardware and achieve extremely high energy efficiency. However, SNNs currently can hardly realize a comparable performance to that of artificial neural networks (ANNs) because their limited scalability does not allow for large-scale networks. Especially for Transformer, as a model of ANNs that has accomplished remarkable performance in various machine learning tasks, its implementation in SNNs by conventional methods requires a large number of neurons, notably in the self-attention module. Inspired by the mechanisms in the nervous system, we propose an efficient spiking Transformer (EST) framework enabled by partial information to address the above problem. In this model, we not only implemented the self-attention module with a reasonable number of neurons, but also introduced partial-information self-attention (PSA), which utilizes only partial input signals, further reducing computational resources compared to conventional methods. The experimental results show that our EST can outperform the state-of-the-art SNN model in terms of accuracy and the number of time steps on both Cifar-10/100 and ImageNet datasets. In particular, the proposed EST model achieves 78.48% top-1 accuracy on the ImageNet dataset with only 16 time steps. In addition, our proposed PSA reduces flops by 49.8% with negligible performance loss compared to a self-attention module with full information.
Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes
- Authors: Ke Shen, Mayank Kejriwal
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2210.01258
- Pdf link: https://arxiv.org/pdf/2210.01258
- Abstract Recent work on transformer-based neural networks has led to impressive advances on multiple-choice natural language understanding (NLU) problems, such as Question Answering (QA) and abductive reasoning. Despite these advances, there is limited work still on understanding whether these models respond to perturbed multiple-choice instances in a sufficiently robust manner that would allow them to be trusted in real-world situations. We present four confusion probes, inspired by similar phenomena first identified in the behavioral science community, to test for problems such as prior bias and choice paralysis. Experimentally, we probe a widely used transformer-based multiple-choice NLU system using four established benchmark datasets. Here we show that the model exhibits significant prior bias and to a lesser, but still highly significant degree, choice paralysis, in addition to other problems. Our results suggest that stronger testing protocols and additional benchmarks may be necessary before the language models are used in front-facing systems or decision making with real world consequences.
A Fixed-Point Algorithm for the AC Power Flow Problem
- Authors: Liangjie Chen, John W. Simpson-Porco
- Subjects: Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2210.01310
- Pdf link: https://arxiv.org/pdf/2210.01310
- Abstract This paper presents an algorithm that solves the AC power flow problem for balanced, three-phase transmission systems at steady state. The algorithm extends the "fixed-point power flow" algorithm in the literature to include transmission losses, phase-shifting transformers, and a distributed slack bus model. The algorithm is derived by vectorizing the component-wise AC power flow equations and manipulating them into a novel equivalent fixed-point form. Preliminary theoretical results guaranteeing convergence are reported for the case of a two-bus power system. We validate the algorithm through extensive simulations on test systems of various sizes under different loading levels, and compare its convergence behavior against those of classic power flow algorithms.
ImmFusion: Robust mmWave-RGB Fusion for 3D Human Body Reconstruction in All Weather Conditions
- Authors: Anjun Chen, Xiangyu Wang, Kun Shi, Shaohao Zhu, Yingfeng Chen, Bin Fang, Jiming Chen, Yuchi Huo, Qi Ye
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.01346
- Pdf link: https://arxiv.org/pdf/2210.01346
- Abstract 3D human reconstruction from RGB images achieves decent results in good weather conditions but degrades dramatically in rough weather. Complementary, mmWave radars have been employed to reconstruct 3D human joints and meshes in rough weather. However, combining RGB and mmWave signals for robust all-weather 3D human reconstruction is still an open challenge, given the sparse nature of mmWave and the vulnerability of RGB images. In this paper, we present ImmFusion, the first mmWave-RGB fusion solution to reconstruct 3D human bodies in all weather conditions robustly. Specifically, our ImmFusion consists of image and point backbones for token feature extraction and a Transformer module for token fusion. The image and point backbones refine global and local features from original data, and the Fusion Transformer Module aims for effective information fusion of two modalities by dynamically selecting informative tokens. Extensive experiments on a large-scale dataset, mmBody, captured in various environments demonstrate that ImmFusion can efficiently utilize the information of two modalities to achieve a robust 3D human body reconstruction in all weather conditions. In addition, our method's accuracy is significantly superior to that of state-of-the-art Transformer-based LiDAR-camera fusion methods.
Connecting Surrogate Safety Measures to Crash Probablity via Causal Probabilistic Time Series Prediction
- Authors: Jiajian Lu, Offer Grembek, Mark Hansen
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2210.01363
- Pdf link: https://arxiv.org/pdf/2210.01363
- Abstract Surrogate safety measures can provide fast and pro-active safety analysis and give insights on the pre-crash process and crash failure mechanism by studying near misses. However, validating surrogate safety measures by connecting them to crashes is still an open question. This paper proposed a method to connect surrogate safety measures to crash probability using probabilistic time series prediction. The method used sequences of speed, acceleration and time-to-collision to estimate the probability density functions of those variables with transformer masked autoregressive flow (transformer-MAF). The autoregressive structure mimicked the causal relationship between condition, action and crash outcome and the probability density functions are used to calculate the conditional action probability, crash probability and conditional crash probability. The predicted sequence is accurate and the estimated probability is reasonable under both traffic conflict context and normal interaction context and the conditional crash probability shows the effectiveness of evasive action to avoid crashes in a counterfactual experiment.
Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling
- Authors: Yunsung Lee, Gyuseong Lee, Kwangrok Ryoo, Hyojun Go, Jihye Park, Seungryong Kim
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.01370
- Pdf link: https://arxiv.org/pdf/2210.01370
- Abstract There are two de facto standard architectures in recent computer vision: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Strong inductive biases of convolutions help the model learn sample effectively, but such strong biases also limit the upper bound of CNNs when sufficient data are available. On the contrary, ViT is inferior to CNNs for small data but superior for sufficient data. Recent approaches attempt to combine the strengths of these two architectures. However, we show these approaches overlook that the optimal inductive bias also changes according to the target data scale changes by comparing various models' accuracy on subsets of sampled ImageNet at different ratios. In addition, through Fourier analysis of feature maps, the model's response patterns according to signal frequency changes, we observe which inductive bias is advantageous for each data scale. The more convolution-like inductive bias is included in the model, the smaller the data scale is required where the ViT-like model outperforms the ResNet performance. To obtain a model with flexible inductive bias on the data scale, we show reparameterization can interpolate inductive bias between convolution and self-attention. By adjusting the number of epochs the model stays in the convolution, we show that reparameterization from convolution to self-attention interpolates the Fourier analysis pattern between CNNs and ViTs. Adapting these findings, we propose Progressive Reparameterization Scheduling (PRS), in which reparameterization adjusts the required amount of convolution-like or self-attention-like inductive bias per layer. For small-scale datasets, our PRS performs reparameterization from convolution to self-attention linearly faster at the late stage layer. PRS outperformed previous studies on the small-scale dataset, e.g., CIFAR-100.
Bridged Transformer for Vision and Point Cloud 3D Object Detection
- Authors: Yikai Wang, TengQi Ye, Lele Cao, Wenbing Huang, Fuchun Sun, Fengxiang He, Dacheng Tao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.01391
- Pdf link: https://arxiv.org/pdf/2210.01391
- Abstract 3D object detection is a crucial research topic in computer vision, which usually uses 3D point clouds as input in conventional setups. Recently, there is a trend of leveraging multiple sources of input data, such as complementing the 3D point cloud with 2D images that often have richer color and fewer noises. However, due to the heterogeneous geometrics of the 2D and 3D representations, it prevents us from applying off-the-shelf neural networks to achieve multimodal fusion. To that end, we propose Bridged Transformer (BrT), an end-to-end architecture for 3D object detection. BrT is simple and effective, which learns to identify 3D and 2D object bounding boxes from both points and image patches. A key element of BrT lies in the utilization of object queries for bridging 3D and 2D spaces, which unifies different sources of data representations in Transformer. We adopt a form of feature aggregation realized by point-to-patch projections which further strengthen the correlations between images and points. Moreover, BrT works seamlessly for fusing the point cloud with multi-view images. We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
Accurate Image Restoration with Attention Retractable Transformer
- Authors: Jiale Zhang, Yulun Zhang, Jinjin Gu, Yongbing Zhang, Linghe Kong, Xin Yuan
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.01427
- Pdf link: https://arxiv.org/pdf/2210.01427
- Abstract Recently, Transformer-based image restoration networks have achieved promising improvements over convolutional neural networks due to parameter-independent global interactions. To lower computational cost, existing works generally limit self-attention computation within non-overlapping windows. However, each group of tokens are always from a dense area of the image. This is considered as a dense attention strategy since the interactions of tokens are restrained in dense regions. Obviously, this strategy could result in restricted receptive fields. To address this issue, we propose Attention Retractable Transformer (ART) for image restoration, which presents both dense and sparse attention modules in the network. The sparse attention module allows tokens from sparse areas to interact and thus provides a wider receptive field. Furthermore, the alternating application of dense and sparse attention modules greatly enhances representation ability of Transformer while providing retractable attention on the input image.We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks. Experimental results validate that our proposed ART outperforms state-of-the-art methods on various benchmark datasets both quantitatively and visually. We also provide code and models at the website https://github.com/gladzhang/ART.
Transformer-based Subject Entity Detection in Wikipedia Listings
- Authors: Nicolas Heist, Heiko Paulheim
- Subjects: Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2210.01482
- Pdf link: https://arxiv.org/pdf/2210.01482
- Abstract In tasks like question answering or text summarisation, it is essential to have background knowledge about the relevant entities. The information about entities - in particular, about long-tail or emerging entities - in publicly available knowledge graphs like DBpedia or CaLiGraph is far from complete. In this paper, we present an approach that exploits the semi-structured nature of listings (like enumerations and tables) to identify the main entities of the listing items (i.e., of entries and rows). These entities, which we call subject entities, can be used to increase the coverage of knowledge graphs. Our approach uses a transformer network to identify subject entities at the token-level and surpasses an existing approach in terms of performance while being bound by fewer limitations. Due to a flexible input format, it is applicable to any kind of listing and is, unlike prior work, not dependent on entity boundaries as input. We demonstrate our approach by applying it to the complete Wikipedia corpus and extracting 40 million mentions of subject entities with an estimated precision of 71% and recall of 77%. The results are incorporated in the most recent version of CaLiGraph.
Enhancing Spatiotemporal Prediction Model using Modular Design and Beyond
- Authors: Haoyu Pan, Hao Wu, Tan Yang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2210.01500
- Pdf link: https://arxiv.org/pdf/2210.01500
- Abstract Predictive learning uses a known state to generate a future state over a period of time. It is a challenging task to predict spatiotemporal sequence because the spatiotemporal sequence varies both in time and space. The mainstream method is to model spatial and temporal structures at the same time using RNN-based or transformer-based architecture, and then generates future data by using learned experience in the way of auto-regressive. The method of learning spatial and temporal features simultaneously brings a lot of parameters to the model, which makes the model difficult to be convergent. In this paper, a modular design is proposed, which decomposes spatiotemporal sequence model into two modules: a spatial encoder-decoder and a predictor. These two modules can extract spatial features and predict future data respectively. The spatial encoder-decoder maps the data into a latent embedding space and generates data from the latent space while the predictor forecasts future embedding from past. By applying the design to the current research and performing experiments on KTH-Action and MovingMNIST datasets, we both improve computational performance and obtain state-of-the-art results.
Knowledge Distillation based Contextual Relevance Matching for E-commerce Product Search
- Authors: Ziyang Liu, Chaokun Wang, Hao Feng, Lingfei Wu, Liqun Yang
- Subjects: Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2210.01701
- Pdf link: https://arxiv.org/pdf/2210.01701
- Abstract Online relevance matching is an essential task of e-commerce product search to boost the utility of search engines and ensure a smooth user experience. Previous work adopts either classical relevance matching models or Transformer-style models to address it. However, they ignore the inherent bipartite graph structures that are ubiquitous in e-commerce product search logs and are too inefficient to deploy online. In this paper, we design an efficient knowledge distillation framework for e-commerce relevance matching to integrate the respective advantages of Transformer-style models and classical relevance matching models. Especially for the core student model of the framework, we propose a novel method using $k$-order relevance modeling. The experimental results on large-scale real-world data (the size is 6$\sim$174 million) show that the proposed method significantly improves the prediction accuracy in terms of human relevance judgment. We deploy our method to the anonymous online search platform. The A/B testing results show that our method significantly improves 5.7% of UV-value under price sort mode.
Improving Label-Deficient Keyword Spotting Using Self-Supervised Pretraining
- Authors: Holger Severin Bovbjerg (1), Zheng-Hua Tan (1) ((1) Aalborg University)
- Subjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2210.01703
- Pdf link: https://arxiv.org/pdf/2210.01703
- Abstract In recent years, the development of accurate deep keyword spotting (KWS) models has resulted in KWS technology being embedded in a number of technologies such as voice assistants. Many of these models rely on large amounts of labelled data to achieve good performance. As a result, their use is restricted to applications for which a large labelled speech data set can be obtained. Self-supervised learning seeks to mitigate the need for large labelled data sets by leveraging unlabelled data, which is easier to obtain in large amounts. However, most self-supervised methods have only been investigated for very large models, whereas KWS models are desired to be small. In this paper, we investigate the use of self-supervised pretraining for the smaller KWS models in a label-deficient scenario. We pretrain the Keyword Transformer model using the self-supervised framework Data2Vec and carry out experiments on a label-deficient setup of the Google Speech Commands data set. It is found that the pretrained models greatly outperform the models without pretraining, showing that Data2Vec pretraining can increase the performance of KWS models in label-deficient scenarios. The source code is made publicly available.
Dense Prediction Transformer for Scale Estimation in Monocular Visual Odometry
- Authors: André O. Françani, Marcos R. O. A. Maximo
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2210.01723
- Pdf link: https://arxiv.org/pdf/2210.01723
- Abstract Monocular visual odometry consists of the estimation of the position of an agent through images of a single camera, and it is applied in autonomous vehicles, medical robots, and augmented reality. However, monocular systems suffer from the scale ambiguity problem due to the lack of depth information in 2D frames. This paper contributes by showing an application of the dense prediction transformer model for scale estimation in monocular visual odometry systems. Experimental results show that the scale drift problem of monocular systems can be reduced through the accurate estimation of the depth map by this model, achieving competitive state-of-the-art performance on a visual odometry benchmark.
One Transformer Can Understand Both 2D & 3D Molecular Data
- Authors: Shengjie Luo, Tianlang Chen, Yixian Xu, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, Di He
- Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2210.01765
- Pdf link: https://arxiv.org/pdf/2210.01765
- Abstract Unlike vision and language data which usually has a unique format, molecules can naturally be characterized using different chemical formulations. One can view a molecule as a 2D graph or define it as a collection of atoms located in a 3D space. For molecular representation learning, most previous works designed neural networks only for a particular data format, making the learned models likely to fail for other data formats. We believe a general-purpose neural network model for chemistry should be able to handle molecular tasks across data modalities. To achieve this goal, in this work, we develop a novel Transformer-based Molecular model called Transformer-M, which can take molecular data of 2D or 3D formats as input and generate meaningful semantic representations. Using the standard Transformer as the backbone architecture, Transformer-M develops two separated channels to encode 2D and 3D structural information and incorporate them with the atom features in the network modules. When the input data is in a particular format, the corresponding channel will be activated, and the other will be disabled. By training on 2D and 3D molecular data with properly designed supervised signals, Transformer-M automatically learns to leverage knowledge from different data modalities and correctly capture the representations. We conducted extensive experiments for Transformer-M. All empirical results show that Transformer-M can simultaneously achieve strong performance on 2D and 3D tasks, suggesting its broad applicability. The code and models will be made publicly available at https://github.com/lsj2408/Transformer-M.
COPILOT: Human Collision Prediction and Localization from Multi-view Egocentric Videos
- Authors: Boxiao Pan, Bokui Shen, Davis Rempe, Despoina Paschalidou, Kaichun Mo, Yanchao Yang, Leonidas J. Guibas
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2210.01781
- Pdf link: https://arxiv.org/pdf/2210.01781
- Abstract To produce safe human motions, assistive wearable exoskeletons must be equipped with a perception system that enables anticipating potential collisions from egocentric observations. However, previous approaches to exoskeleton perception greatly simplify the problem to specific types of environments, limiting their scalability. In this paper, we propose the challenging and novel problem of predicting human-scene collisions for diverse environments from multi-view egocentric RGB videos captured from an exoskeleton. By classifying which body joints will collide with the environment and predicting a collision region heatmap that localizes potential collisions in the environment, we aim to develop an exoskeleton perception system that generalizes to complex real-world scenes and provides actionable outputs for downstream control. We propose COPILOT, a video transformer-based model that performs both collision prediction and localization simultaneously, leveraging multi-view video inputs via a proposed joint space-time-viewpoint attention operation. To train and evaluate the model, we build a synthetic data generation framework to simulate virtual humans moving in photo-realistic 3D environments. This framework is then used to establish a dataset consisting of 8.6M egocentric RGBD frames to enable future work on the problem. Extensive experiments suggest that our model achieves promising performance and generalizes to unseen scenes as well as real world. We apply COPILOT to a downstream collision avoidance task, and successfully reduce collision cases by 29% on unseen scenes using a simple closed-loop control algorithm.
Keyword: scene understanding
GaIA: Graphical Information Gain based Attention Network for Weakly Supervised Point Cloud Semantic Segmentation
- Authors: Min Seok Lee, Seok Woo Yang, Sung Won Han
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.01558
- Pdf link: https://arxiv.org/pdf/2210.01558
- Abstract While point cloud semantic segmentation is a significant task in 3D scene understanding, this task demands a time-consuming process of fully annotating labels. To address this problem, recent studies adopt a weakly supervised learning approach under the sparse annotation. Different from the existing studies, this study aims to reduce the epistemic uncertainty measured by the entropy for a precise semantic segmentation. We propose the graphical information gain based attention network called GaIA, which alleviates the entropy of each point based on the reliable information. The graphical information gain discriminates the reliable point by employing relative entropy between target point and its neighborhoods. We further introduce anchor-based additive angular margin loss, ArcPoint. The ArcPoint optimizes the unlabeled points containing high entropy towards semantically similar classes of the labeled points on hypersphere space. Experimental results on S3DIS and ScanNet-v2 datasets demonstrate our framework outperforms the existing weakly supervised methods. We have released GaIA at https://github.com/Karel911/GaIA.
FreDSNet: Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions
- Authors: Bruno Berenguel-Baeta, Jesus Bermudez-Cameo, Jose J. Guerrero
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.01595
- Pdf link: https://arxiv.org/pdf/2210.01595
- Abstract In this work we present FreDSNet, a deep learning solution which obtains semantic 3D understanding of indoor environments from single panoramas. Omnidirectional images reveal task-specific advantages when addressing scene understanding problems due to the 360-degree contextual information about the entire environment they provide. However, the inherent characteristics of the omnidirectional images add additional problems to obtain an accurate detection and segmentation of objects or a good depth estimation. To overcome these problems, we exploit convolutions in the frequential domain obtaining a wider receptive field in each convolutional layer. These convolutions allow to leverage the whole context information from omnidirectional images. FreDSNet is the first network that jointly provides monocular depth estimation and semantic segmentation from a single panoramic image exploiting fast Fourier convolutions. Our experiments show that FreDSNet has similar performance as specific state of the art methods for semantic segmentation and depth estimation. FreDSNet code is publicly available in https://github.com/Sbrunoberenguel/FreDSNet
Keyword: visual reasoning
Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
- Authors: Xu Yang, Hanwang Zhang, Chongyang Gao, Jianfei Cai
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2210.01338
- Pdf link: https://arxiv.org/pdf/2210.01338
- Abstract Humans tend to decompose a sentence into different parts like \textsc{sth do sth at someplace} and then fill each part with certain content. Inspired by this, we follow the \textit{principle of modular design} to propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (CVLNM). Unlike the \re{widely used} neural module networks in VQA, where the language (\ie, question) is fully observable, \re{the task of collocating visual-linguistic modules is more challenging.} This is because the language is only partially observable, for which we need to dynamically collocate the modules during the process of image captioning. To sum up, we make the following technical contributions to design and train our CVLNM: 1) \textit{distinguishable module design} -- \re{four modules in the encoder} including one linguistic module for function words and three visual modules for different content words (\ie, noun, adjective, and verb) and another linguistic one in the decoder for commonsense reasoning, 2) a self-attention based \textit{module controller} for robustifying the visual reasoning, 3) a part-of-speech based \textit{syntax loss} imposed on the module controller for further regularizing the training of our CVLNM. Extensive experiments on the MS-COCO dataset show that our CVLNM is more effective, \eg, achieving a new state-of-the-art 129.5 CIDEr-D, and more robust, \eg, being less likely to overfit to dataset bias and suffering less when fewer training samples are available. Codes are available at \url{https://github.com/GCYZSL/CVLMN}