arxiv-daily New submissions for Fri, 21 Oct 22

New submissions for Fri, 21 Oct 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

MMRNet: Improving Reliability for Multimodal Computer Vision for Bin Picking via Multimodal Redundancy

Authors: Yuhao Chen, Hayden Gunraj, E. Zhixuan Zeng, Maximilian Gilles, Alexander Wong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2210.10842
Pdf link: https://arxiv.org/pdf/2210.10842
Abstract Recently, there has been tremendous interest in industry 4.0 infrastructure to address labor shortages in global supply chains. Deploying artificial intelligence-enabled robotic bin picking systems in real world has become particularly important for reducing labor demands and costs while increasing efficiency. To this end, artificial intelligence-enabled robotic bin picking systems may be used to automate bin picking, but may also cause expensive damage during an abnormal event such as a sensor failure. As such, reliability becomes a critical factor for translating artificial intelligence research to real world applications and products. In this paper, we propose a reliable vision system with MultiModal Redundancy (MMRNet) for tackling object detection and segmentation for robotic bin picking using data from different modalities. This is the first system that introduces the concept of multimodal redundancy to combat sensor failure issues during deployment. In particular, we realize the multimodal redundancy framework with a gate fusion module and dynamic ensemble learning. Finally, we present a new label-free multimodal consistency score that utilizes the output from all modalities to measure the overall system output reliability and uncertainty. Through experiments, we demonstrate that in an event of missing modality, our system provides a much more reliable performance compared to baseline models. We also demonstrate that our MC score is a more powerful reliability indicator for outputs during inference time where model generated confidence score are often over-confident.

Uni6Dv3: 5D Anchor Mechanism for 6D Pose Estimation

Authors: Jianqiu Chen, Mingshan Sun, Ye Zheng, Tianpeng Bao, Zhenyu He, Donghai Li, Guoqiang Jin, Rui Zhao, Liwei Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.10959
Pdf link: https://arxiv.org/pdf/2210.10959
Abstract Unlike indirect methods that usually require time-consuming post-processing, recent deep learning-based direct methods for 6D pose estimation try to predict the 3D rotation and 3D translation from RGB-D data directly. However, direct methods, regressing the absolute translation of the pose, suffer from diverse object translation distribution between training and test data, which is usually caused by expensive data collection and annotation in practice. To this end, we propose a 5D anchor mechanism by defining the anchor with 3D coordinates in the physical space and 2D coordinates in the image plane. Inspired by anchor-based object detection methods, 5D anchor regresses the offset between the target and anchor, which eliminates the distribution gap and transforms the regression target to a small range. But regressing offset leads to the mismatch between the absolute input and relative output. We build an anchor-based projection model by replacing the absolute input with the relative one, which further improves the performance. By plugging 5D anchor into the latest direct methods, Uni6Dv2 and ES6D obtain 38.7% and 3.5% improvement, respectively. Specifically, Uni6Dv2+5D anchor, dubbed Uni6Dv3, achieves state-of-the-art overall results on datasets including Occlusion LineMOD (79.3%), LineMOD (99.5%), and YCB-Video datasets (91.5%), and requires only 10% of training data to reach comparable performance as full data.

PSA-Det3D: Pillar Set Abstraction for 3D object Detection

Authors: Zhicong Huang, Jingwen Zhao, Zhijie Zheng, Dihu Chena, Haifeng Hu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.10983
Pdf link: https://arxiv.org/pdf/2210.10983
Abstract Small object detection for 3D point cloud is a challenging problem because of two limitations: (1) Perceiving small objects is much more diffcult than normal objects due to the lack of valid points. (2) Small objects are easily blocked which breaks the shape of their meshes in 3D point cloud. In this paper, we propose a pillar set abstraction (PSA) and foreground point compensation (FPC) and design a point-based detection network, PSA-Det3D, to improve the detection performance for small object. The PSA embeds a pillar query operation on the basis of set abstraction (SA) to expand its receptive field of the network, which can aggregate point-wise features effectively. To locate more occluded objects, we persent a proposal generation layer consisting of a foreground point segmentation and a FPC module. Both the foreground points and the estimated centers are finally fused together to generate the detection result. The experiments on the KITTI 3D detection benchmark show that our proposed PSA-Det3D outperforms other algorithms with high accuracy for small object detection.

Large-batch Optimization for Dense Visual Predictions

Authors: Zeyue Xue, Jianming Liang, Guanglu Song, Zhuofan Zong, Liang Chen, Yu Liu, Ping Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.11078
Pdf link: https://arxiv.org/pdf/2210.11078
Abstract Training a large-scale deep neural network in a large-scale dataset is challenging and time-consuming. The recent breakthrough of large-batch optimization is a promising way to tackle this challenge. However, although the current advanced algorithms such as LARS and LAMB succeed in classification models, the complicated pipelines of dense visual predictions such as object detection and segmentation still suffer from the heavy performance drop in the large-batch training regime. To address this challenge, we propose a simple yet effective algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can train dense visual predictors with very large batch size, enabling several benefits more appealing than prior arts. Firstly, AGVM can align the gradient variances between different modules in the dense visual predictors, such as backbone, feature pyramid network (FPN), detection, and segmentation heads. We show that training with a large batch size can fail with the gradient variances misaligned among them, which is a phenomenon primarily overlooked in previous work. Secondly, AGVM is a plug-and-play module that generalizes well to many different architectures (e.g., CNNs and Transformers) and different tasks (e.g., object detection, instance segmentation, semantic segmentation, and panoptic segmentation). It is also compatible with different optimizers (e.g., SGD and AdamW). Thirdly, a theoretical analysis of AGVM is provided. Extensive experiments on the COCO and ADE20K datasets demonstrate the superiority of AGVM. For example, it can train Faster R-CNN+ResNet50 in 4 minutes without losing performance. AGVM enables training an object detector with one billion parameters in just 3.5 hours, reducing the training time by 20.9x, whilst achieving 62.2 mAP on COCO. The deliverables are released at https://github.com/Sense-X/AGVM.

Mathematical Justification of Hard Negative Mining via Isometric Approximation Theorem

Authors: Albert Xu, Jhih-Yi Hsieh, Bhaskar Vundurthy, Eliana Cohen, Howie Choset, Lu Li
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.11173
Pdf link: https://arxiv.org/pdf/2210.11173
Abstract In deep metric learning, the Triplet Loss has emerged as a popular method to learn many computer vision and natural language processing tasks such as facial recognition, object detection, and visual-semantic embeddings. One issue that plagues the Triplet Loss is network collapse, an undesirable phenomenon where the network projects the embeddings of all data onto a single point. Researchers predominately solve this problem by using triplet mining strategies. While hard negative mining is the most effective of these strategies, existing formulations lack strong theoretical justification for their empirical success. In this paper, we utilize the mathematical theory of isometric approximation to show an equivalence between the Triplet Loss sampled by hard negative mining and an optimization problem that minimizes a Hausdorff-like distance between the neural network and its ideal counterpart function. This provides the theoretical justifications for hard negative mining's empirical efficacy. In addition, our novel application of the isometric approximation theorem provides the groundwork for future forms of hard negative mining that avoid network collapse. Our theory can also be extended to analyze other Euclidean space-based metric learning methods like Ladder Loss or Contrastive Learning.

MixMask: Revisiting Masked Siamese Self-supervised Learning in Asymmetric Distance

Authors: Kirill Vishniakov, Eric Xing, Zhiqiang Shen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.11456
Pdf link: https://arxiv.org/pdf/2210.11456
Abstract Recent advances in self-supervised learning integrate Masked Modeling and Siamese Networks into a single framework to fully reap the advantages of both the two techniques. However, previous erasing-based masking scheme in masked image modeling is not originally designed for siamese networks. Existing approaches simply inherit the default loss design from previous siamese networks, and ignore the information loss and distance change after employing masking operation in the frameworks. In this paper, we propose a filling-based masking strategy called MixMask to prevent information loss due to the randomly erased areas of an image in vanilla masking method. We further introduce a dynamic loss function design with soft distance to adapt the integrated architecture and avoid mismatches between transformed input and objective in Masked Siamese ConvNets (MSCN). The dynamic loss distance is calculated according to the proposed mix-masking scheme. Extensive experiments are conducted on various datasets of CIFAR-100, Tiny-ImageNet and ImageNet-1K. The results demonstrate that the proposed framework can achieve better accuracy on linear probing, semi-supervised and {supervised finetuning}, which outperforms the state-of-the-art MSCN by a significant margin. We also show the superiority on downstream tasks of object detection and segmentation. Our source code is available at https://github.com/LightnessOfBeing/MixMask.

Self-Supervised Learning via Maximum Entropy Coding

Authors: Xin Liu, Zhongdao Wang, Yali Li, Shengjin Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.11464
Pdf link: https://arxiv.org/pdf/2210.11464
Abstract A mainstream type of current self-supervised learning methods pursues a general-purpose representation that can be well transferred to downstream tasks, typically by optimizing on a given pretext task such as instance discrimination. In this work, we argue that existing pretext tasks inevitably introduce biases into the learned representation, which in turn leads to biased transfer performance on various downstream tasks. To cope with this issue, we propose Maximum Entropy Coding (MEC), a more principled objective that explicitly optimizes on the structure of the representation, so that the learned representation is less biased and thus generalizes better to unseen downstream tasks. Inspired by the principle of maximum entropy in information theory, we hypothesize that a generalizable representation should be the one that admits the maximum entropy among all plausible representations. To make the objective end-to-end trainable, we propose to leverage the minimal coding length in lossy data coding as a computationally tractable surrogate for the entropy, and further derive a scalable reformulation of the objective that allows fast computation. Extensive experiments demonstrate that MEC learns a more generalizable representation than previous methods based on specific pretext tasks. It achieves state-of-the-art performance consistently on various downstream tasks, including not only ImageNet linear probe, but also semi-supervised classification, object detection, instance segmentation, and object tracking. Interestingly, we show that existing batch-wise and feature-wise self-supervised objectives could be seen equivalent to low-order approximations of MEC. Code and pre-trained models are available at https://github.com/xinliu20/MEC.

Keyword: transformer

Grounded Video Situation Recognition

Authors: Zeeshan Khan, C.V. Jawahar, Makarand Tapaswi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.10828
Pdf link: https://arxiv.org/pdf/2210.10828
Abstract Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatio-temporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time.

Scene Text Recognition with Semantics

Authors: Joshua Cesare Placidi, Yishu Miao, Zixu Wang, Lucia Specia
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.10836
Pdf link: https://arxiv.org/pdf/2210.10836
Abstract Scene Text Recognition (STR) models have achieved high performance in recent years on benchmark datasets where text images are presented with minimal noise. Traditional STR recognition pipelines take a cropped image as sole input and attempt to identify the characters present. This infrastructure can fail in instances where the input image is noisy or the text is partially obscured. This paper proposes using semantic information from the greater scene to contextualise predictions. We generate semantic vectors using object tags and fuse this information into a transformer-based architecture. The results demonstrate that our multimodal approach yields higher performance than traditional benchmark models, particularly on noisy instances.

Modeling Animal Vocalizations through Synthesizers

Authors: Masato Hagiwara, Maddie Cusimano, Jen-Yu Liu
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2210.10857
Pdf link: https://arxiv.org/pdf/2210.10857
Abstract Modeling real-world sound is a fundamental problem in the creative use of machine learning and many other fields, including human speech processing and bioacoustics. Transformer-based generative models and some prior work (e.g., DDSP) are known to produce realistic sound, although they have limited control and are hard to interpret. As an alternative, we aim to use modular synthesizers, i.e., compositional, parametric electronic musical instruments, for modeling non-music sounds. However, inferring synthesizer parameters given a target sound, i.e., the parameter inference task, is not trivial for general sounds, and past research has typically focused on musical sound. In this work, we optimize a differentiable synthesizer from TorchSynth in order to model, emulate, and creatively generate animal vocalizations. We compare an array of optimization methods, from gradient-based search to genetic algorithms, for inferring its parameters, and then demonstrate how one can control and interpret the parameters for modeling non-music sounds.

HT-Net: Hierarchical Transformer based Operator Learning Model for Multiscale PDEs

Authors: Xinliang Liu, Bo Xu, Lei Zhang
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.10890
Pdf link: https://arxiv.org/pdf/2210.10890
Abstract Complex nonlinear interplays of multiple scales give rise to many interesting physical phenomena and pose major difficulties for the computer simulation of multiscale PDE models in areas such as reservoir simulation, high frequency scattering and turbulence modeling. In this paper, we introduce a hierarchical transformer (HT) scheme to efficiently learn the solution operator for multiscale PDEs. We construct a hierarchical architecture with scale adaptive interaction range, such that the features can be computed in a nested manner and with a controllable linear cost. Self-attentions over a hierarchy of levels can be used to encode and decode the multiscale solution space over all scale ranges. In addition, we adopt an empirical $H^1$ loss function to counteract the spectral bias of the neural network approximation for multiscale functions. In the numerical experiments, we demonstrate the superior performance of the HT scheme compared with state-of-the-art (SOTA) methods for representative multiscale problems.

SSiT: Saliency-guided Self-supervised Image Transformer for Diabetic Retinopathy Grading

Authors: Yijin Huang, Junyan Lyu, Pujin Cheng, Roger Tam, Xiaoying Tang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.10969
Pdf link: https://arxiv.org/pdf/2210.10969
Abstract Self-supervised learning (SSL) has been widely applied to learn image representations through exploiting unlabeled images. However, it has not been fully explored in the medical image analysis field. In this work, we propose Saliency-guided Self-Supervised image Transformer (SSiT) for diabetic retinopathy (DR) grading from fundus images. We novelly introduce saliency maps into SSL, with a goal of guiding self-supervised pre-training with domain-specific prior knowledge. Specifically, two saliency-guided learning tasks are employed in SSiT: (1) We conduct saliency-guided contrastive learning based on the momentum contrast, wherein we utilize fundus images' saliency maps to remove trivial patches from the input sequences of the momentum-updated key encoder. And thus, the key encoder is constrained to provide target representations focusing on salient regions, guiding the query encoder to capture salient features. (2) We train the query encoder to predict the saliency segmentation, encouraging preservation of fine-grained information in the learned representations. Extensive experiments are conducted on four publicly-accessible fundus image datasets. The proposed SSiT significantly outperforms other representative state-of-the-art SSL methods on all datasets and under various evaluation settings, establishing the effectiveness of the learned representations from SSiT. The source code is available at https://github.com/YijinHuang/SSiT.

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

Authors: Vijay John, Yasutomo Kawanishi
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.10972
Pdf link: https://arxiv.org/pdf/2210.10972
Abstract Utilizing the sensor characteristics of the audio, visible camera, and thermal camera, the robustness of person recognition can be enhanced. Existing multimodal person recognition frameworks are primarily formulated assuming that multimodal data is always available. In this paper, we propose a novel trimodal sensor fusion framework using the audio, visible, and thermal camera, which addresses the missing modality problem. In the framework, a novel deep latent embedding framework, termed the AVTNet, is proposed to learn multiple latent embeddings. Also, a novel loss function, termed missing modality loss, accounts for possible missing modalities based on the triplet loss calculation while learning the individual latent embeddings. Additionally, a joint latent embedding utilizing the trimodal data is learnt using the multi-head attention transformer, which assigns attention weights to the different modalities. The different latent embeddings are subsequently used to train a deep neural network. The proposed framework is validated on the Speaking Faces dataset. A comparative analysis with baseline algorithms shows that the proposed framework significantly increases the person recognition accuracy while accounting for missing modalities.

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

Authors: Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.11006
Pdf link: https://arxiv.org/pdf/2210.11006
Abstract Click-based interactive image segmentation aims at extracting objects with limited user clicking. Hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for the downstream task without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive segmentation. To fill this gap, we propose the first plain-backbone method, termed as SimpleClick due to its simplicity in architecture, for interactive segmentation. With the plain backbone pretrained as masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance without bells and whistles. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8% over previous best result. Extensive evaluation of medical images highlights the generalizability of our method. We also provide a detailed computation analysis for our method, highlighting its availability as a practical annotation tool.

How Does a Deep Learning Model Architecture Impact Its Privacy?

Authors: Guangsheng Zhang, Bo Liu, Huan Tian, Tianqing Zhu, Ming Ding, Wanlei Zhou
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2210.11049
Pdf link: https://arxiv.org/pdf/2210.11049
Abstract As a booming research area in the past decade, deep learning technologies have been driven by big data collected and processed on an unprecedented scale. However, the sensitive information in the collected training data raises privacy concerns. Recent research indicated that deep learning models are vulnerable to various privacy attacks, including membership inference attacks, attribute inference attacks, and gradient inversion attacks. It is noteworthy that the performance of the attacks varies from model to model. In this paper, we conduct empirical analyses to answer a fundamental question: Does model architecture affect model privacy? We investigate several representative model architectures from CNNs to Transformers, and show that Transformers are generally more vulnerable to privacy attacks than CNNs. We further demonstrate that the micro design of activation layers, stem layers, and bias parameters, are the major reasons why CNNs are more resilient to privacy attacks than Transformers. We also find that the presence of attention modules is another reason why Transformers are more vulnerable to privacy attacks. We hope our discovery can shed some new light on how to defend against the investigated privacy attacks and help the community build privacy-friendly model architectures.

Large-batch Optimization for Dense Visual Predictions

Authors: Zeyue Xue, Jianming Liang, Guanglu Song, Zhuofan Zong, Liang Chen, Yu Liu, Ping Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.11078
Pdf link: https://arxiv.org/pdf/2210.11078
Abstract Training a large-scale deep neural network in a large-scale dataset is challenging and time-consuming. The recent breakthrough of large-batch optimization is a promising way to tackle this challenge. However, although the current advanced algorithms such as LARS and LAMB succeed in classification models, the complicated pipelines of dense visual predictions such as object detection and segmentation still suffer from the heavy performance drop in the large-batch training regime. To address this challenge, we propose a simple yet effective algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can train dense visual predictors with very large batch size, enabling several benefits more appealing than prior arts. Firstly, AGVM can align the gradient variances between different modules in the dense visual predictors, such as backbone, feature pyramid network (FPN), detection, and segmentation heads. We show that training with a large batch size can fail with the gradient variances misaligned among them, which is a phenomenon primarily overlooked in previous work. Secondly, AGVM is a plug-and-play module that generalizes well to many different architectures (e.g., CNNs and Transformers) and different tasks (e.g., object detection, instance segmentation, semantic segmentation, and panoptic segmentation). It is also compatible with different optimizers (e.g., SGD and AdamW). Thirdly, a theoretical analysis of AGVM is provided. Extensive experiments on the COCO and ADE20K datasets demonstrate the superiority of AGVM. For example, it can train Faster R-CNN+ResNet50 in 4 minutes without losing performance. AGVM enables training an object detector with one billion parameters in just 3.5 hours, reducing the training time by 20.9x, whilst achieving 62.2 mAP on COCO. The deliverables are released at https://github.com/Sense-X/AGVM.

General Image Descriptors for Open World Image Retrieval using ViT CLIP

Authors: Marcos V. Conde, Ivan Aerlic, Simon Jégou
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.11141
Pdf link: https://arxiv.org/pdf/2210.11141
Abstract The Google Universal Image Embedding (GUIE) Challenge is one of the first competitions in multi-domain image representations in the wild, covering a wide distribution of objects: landmarks, artwork, food, etc. This is a fundamental computer vision problem with notable applications in image retrieval, search engines and e-commerce. In this work, we explain our 4th place solution to the GUIE Challenge, and our "bag of tricks" to fine-tune zero-shot Vision Transformers (ViT) pre-trained using CLIP.

Transformer-based Entity Typing in Knowledge Graphs

Authors: Zhiwei Hu, Víctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, Jeff Z. Pan
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.11151
Pdf link: https://arxiv.org/pdf/2210.11151
Abstract We investigate the knowledge graph entity typing task which aims at inferring plausible entity types. In this paper, we propose a novel Transformer-based Entity Typing (TET) approach, effectively encoding the content of neighbors of an entity. More precisely, TET is composed of three different mechanisms: a local transformer allowing to infer missing types of an entity by independently encoding the information provided by each of its neighbors; a global transformer aggregating the information of all neighbors of an entity into a single long sequence to reason about more complex entity types; and a context transformer integrating neighbors content based on their contribution to the type inference through information exchange between neighbor pairs. Furthermore, TET uses information about class membership of types to semantically strengthen the representation of an entity. Experiments on two real-world datasets demonstrate the superior performance of TET compared to the state-of-the-art.

Accurate Extrinsic Prediction of Physical Systems Using Transformers

Authors: Arnaud Pannatier, Kyle Matoba, François Fleuret
Subjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2210.11269
Pdf link: https://arxiv.org/pdf/2210.11269
Abstract Accurate high-altitude wind forecasting is important for air traffic control. And the large volume of data available for this task makes deep neural network-based models a possibility. However, special methods are required because the data is measured only sparsely: along the main aircraft trajectories and arranged sparsely in space, namely along the main air corridors. Several deep learning approaches have been proposed, and in this work, we show that Transformers can fit this data efficiently and are able to extrapolate coherently from a context set. We show this by an extensive comparison of Transformers to numerous existing deep learning-based baselines in the literature. Besides high-altitude wind forecasting, we compare competing models on other dynamical physical systems, namely those modelled by partial differential equations, in particular the Poisson equation and Darcy Flow equation. For these experiments, in the case where the data is arranged non-regularly in space, Transformers outperform all the other evaluated methods. We also compared them in a more standard setup where the data is arranged on a grid and show that the Transformers are competitive with state-of-the-art methods, even though it does not require regular spacing. The code and datasets of the different experiments will be made publicly available at publication time.

VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors

Authors: Yifeng Zhu, Abhishek Joshi, Peter Stone, Yuke Zhu
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2210.11339
Pdf link: https://arxiv.org/pdf/2210.11339
Abstract We introduce VIOLA, an object-centric imitation learning approach to learning closed-loop visuomotor policies for robot manipulation. Our approach constructs object-centric representations based on general object proposals from a pre-trained vision model. VIOLA uses a transformer-based policy to reason over these representations and attend to the task-relevant visual factors for action prediction. Such object-based structural priors improve deep imitation learning algorithm's robustness against object variations and environmental perturbations. We quantitatively evaluate VIOLA in simulation and on real robots. VIOLA outperforms the state-of-the-art imitation learning methods by $45.8%$ in success rate. It has also been deployed successfully on a physical robot to solve challenging long-horizon tasks, such as dining table arrangement and coffee making. More videos and model details can be found in supplementary material and the project website: https://ut-austin-rpl.github.io/VIOLA .

Transformer-based Global 3D Hand Pose Estimation in Two Hands Manipulating Objects Scenarios

Authors: Hoseong Cho, Donguk Kim, Chanwoo Kim, Seongyeong Lee, Seungryul Baek
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.11384
Pdf link: https://arxiv.org/pdf/2210.11384
Abstract This report describes our 1st place solution to ECCV 2022 challenge on Human Body, Hands, and Activities (HBHA) from Egocentric and Multi-view Cameras (hand pose estimation). In this challenge, we aim to estimate global 3D hand poses from the input image where two hands and an object are interacting on the egocentric viewpoint. Our proposed method performs end-to-end multi-hand pose estimation via transformer architecture. In particular, our method robustly estimates hand poses in a scenario where two hands interact. Additionally, we propose an algorithm that considers hand scales to robustly estimate the absolute depth. The proposed algorithm works well even when the hand sizes are various for each person. Our method attains 14.4 mm (left) and 15.9 mm (right) errors for each hand in the test set.

Transformer-based Action recognition in hand-object interacting scenarios

Authors: Hoseong Cho, Seungryul Baek
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.11387
Pdf link: https://arxiv.org/pdf/2210.11387
Abstract This report describes the 2nd place solution to the ECCV 2022 Human Body, Hands, and Activities (HBHA) from Egocentric and Multi-view Cameras Challenge: Action Recognition. This challenge aims to recognize hand-object interaction in an egocentric view. We propose a framework that estimates keypoints of two hands and an object with a Transformer-based keypoint estimator and recognizes actions based on the estimated keypoints. We achieved a top-1 accuracy of 87.19% on the testset.

Solving Reasoning Tasks with a Slot Transformer

Authors: Ryan Faulkner, Daniel Zoran
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.11394
Pdf link: https://arxiv.org/pdf/2210.11394
Abstract The ability to carve the world into useful abstractions in order to reason about time and space is a crucial component of intelligence. In order to successfully perceive and act effectively using senses we must parse and compress large amounts of information for further downstream reasoning to take place, allowing increasingly complex concepts to emerge. If there is any hope to scale representation learning methods to work with real world scenes and temporal dynamics then there must be a way to learn accurate, concise, and composable abstractions across time. We present the Slot Transformer, an architecture that leverages slot attention, transformers and iterative variational inference on video scene data to infer such representations. We evaluate the Slot Transformer on CLEVRER, Kinetics-600 and CATER datesets and demonstrate that the approach allows us to develop robust modeling and reasoning around complex behaviours as well as scores on these datasets that compare favourably to existing baselines. Finally we evaluate the effectiveness of key components of the architecture, the model's representational capacity and its ability to predict from incomplete input.

Self-Supervised Learning with Masked Image Modeling for Teeth Numbering, Detection of Dental Restorations, and Instance Segmentation in Dental Panoramic Radiographs

Authors: Amani Almalki, Longin Jan Latecki
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.11404
Pdf link: https://arxiv.org/pdf/2210.11404
Abstract The computer-assisted radiologic informative report is currently emerging in dental practice to facilitate dental care and reduce time consumption in manual panoramic radiographic interpretation. However, the amount of dental radiographs for training is very limited, particularly from the point of view of deep learning. This study aims to utilize recent self-supervised learning methods like SimMIM and UM-MAE to increase the model efficiency and understanding of the limited number of dental radiographs. We use the Swin Transformer for teeth numbering, detection of dental restorations, and instance segmentation tasks. To the best of our knowledge, this is the first study that applied self-supervised learning methods to Swin Transformer on dental panoramic radiographs. Our results show that the SimMIM method obtained the highest performance of 90.4% and 88.9% on detecting teeth and dental restorations and instance segmentation, respectively, increasing the average precision by 13.4 and 12.8 over the random initialization baseline. Moreover, we augment and correct the existing dataset of panoramic radiographs. The code and the dataset are available at https://github.com/AmaniHAlmalki/DentalMIM.

GPR-Net: Multi-view Layout Estimation via a Geometry-aware Panorama Registration Network

Authors: Jheng-Wei Su, Chi-Han Peng, Peter Wonka, Hung-Kuo Chu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.11419
Pdf link: https://arxiv.org/pdf/2210.11419
Abstract Reconstructing 3D layouts from multiple $360^{\circ}$ panoramas has received increasing attention recently as estimating a complete layout of a large-scale and complex room from a single panorama is very difficult. The state-of-the-art method, called PSMNet, introduces the first learning-based framework that jointly estimates the room layout and registration given a pair of panoramas. However, PSMNet relies on an approximate (i.e., "noisy") registration as input. Obtaining this input requires a solution for wide baseline registration which is a challenging problem. In this work, we present a complete multi-view panoramic layout estimation framework that jointly learns panorama registration and layout estimation given a pair of panoramas without relying on a pose prior. The major improvement over PSMNet comes from a novel Geometry-aware Panorama Registration Network or GPR-Net that effectively tackles the wide baseline registration problem by exploiting the layout geometry and computing fine-grained correspondences on the layout boundaries, instead of the global pixel-space. Our architecture consists of two parts. First, given two panoramas, we adopt a vision transformer to learn a set of 1D horizon features sampled on the panorama. These 1D horizon features encode the depths of individual layout boundary samples and the correspondence and covisibility maps between layout boundaries. We then exploit a non-linear registration module to convert these 1D horizon features into a set of corresponding 2D boundary points on the layout. Finally, we estimate the final relative camera pose via RANSAC and obtain the complete layout simply by taking the union of registered layouts. Experimental results indicate that our method achieves state-of-the-art performance in both panorama registration and layout estimation on a large-scale indoor panorama dataset ZInD.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

Oct 21 '22 03:10 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Fri, 21 Oct 22

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

MMRNet: Improving Reliability for Multimodal Computer Vision for Bin Picking via Multimodal Redundancy

Uni6Dv3: 5D Anchor Mechanism for 6D Pose Estimation

PSA-Det3D: Pillar Set Abstraction for 3D object Detection

Large-batch Optimization for Dense Visual Predictions

Mathematical Justification of Hard Negative Mining via Isometric Approximation Theorem

MixMask: Revisiting Masked Siamese Self-supervised Learning in Asymmetric Distance

Self-Supervised Learning via Maximum Entropy Coding

Keyword: transformer

Grounded Video Situation Recognition

Scene Text Recognition with Semantics

Modeling Animal Vocalizations through Synthesizers

HT-Net: Hierarchical Transformer based Operator Learning Model for Multiscale PDEs

SSiT: Saliency-guided Self-supervised Image Transformer for Diabetic Retinopathy Grading

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

How Does a Deep Learning Model Architecture Impact Its Privacy?

Large-batch Optimization for Dense Visual Predictions

General Image Descriptors for Open World Image Retrieval using ViT CLIP

Transformer-based Entity Typing in Knowledge Graphs

Accurate Extrinsic Prediction of Physical Systems Using Transformers

VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors

Transformer-based Global 3D Hand Pose Estimation in Two Hands Manipulating Objects Scenarios

Transformer-based Action recognition in hand-object interacting scenarios

Solving Reasoning Tasks with a Slot Transformer

Self-Supervised Learning with Masked Image Modeling for Teeth Numbering, Detection of Dental Restorations, and Instance Segmentation in Dental Panoramic Radiographs

GPR-Net: Multi-view Layout Estimation via a Geometry-aware Panorama Registration Network

Keyword: scene understanding

Keyword: visual reasoning

arxiv-daily
arxiv-daily copied to clipboard