arxiv-daily New submissions for Tue, 6 Dec 22

New submissions for Tue, 6 Dec 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

PROB: Probabilistic Objectness for Open World Object Detection

Authors: Orr Zohar, Kuan-Chieh Wang, Serena Yeung
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2212.01424
Pdf link: https://arxiv.org/pdf/2212.01424
Abstract Open World Object Detection (OWOD) is a new and challenging computer vision task that bridges the gap between classic object detection (OD) benchmarks and object detection in the real world. In addition to detecting and classifying seen/labeled objects, OWOD algorithms are expected to detect novel/unknown objects - which can be classified and incrementally learned. In standard OD, object proposals not overlapping with a labeled object are automatically classified as background. Therefore, simply applying OD methods to OWOD fails as unknown objects would be predicted as background. The challenge of detecting unknown objects stems from the lack of supervision in distinguishing unknown objects and background object proposals. Previous OWOD methods have attempted to overcome this issue by generating supervision using pseudo-labeling - however, unknown object detection has remained low. Probabilistic/generative models may provide a solution for this challenge. Herein, we introduce a novel probabilistic framework for objectness estimation, where we alternate between probability distribution estimation and objectness likelihood maximization of known objects in the embedded feature space - ultimately allowing us to estimate the objectness probability of different proposals. The resulting Probabilistic Objectness transformer-based open-world detector, PROB, integrates our framework into traditional object detection models, adapting them for the open-world setting. Comprehensive experiments on OWOD benchmarks show that PROB outperforms all existing OWOD methods in both unknown object detection ($\sim 2\times$ unknown recall) and known object detection ($\sim 10%$ mAP). Our code will be made available upon publication at https://github.com/orrzohar/PROB.

IDMS: Instance Depth for Multi-scale Monocular 3D Object Detection

Authors: Chao Hu, Liqiang Zhu, Weibing Qiu, Weijie Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.01528
Pdf link: https://arxiv.org/pdf/2212.01528
Abstract Due to the lack of depth information of images and poor detection accuracy in monocular 3D object detection, we proposed the instance depth for multi-scale monocular 3D object detection method. Firstly, to enhance the model's processing ability for different scale targets, a multi-scale perception module based on dilated convolution is designed, and the depth features containing multi-scale information are re-refined from both spatial and channel directions considering the inconsistency between feature maps of different scales. Firstly, we designed a multi-scale perception module based on dilated convolution to enhance the model's processing ability for different scale targets. The depth features containing multi-scale information are re-refined from spatial and channel directions considering the inconsistency between feature maps of different scales. Secondly, so as to make the model obtain better 3D perception, this paper proposed to use the instance depth information as an auxiliary learning task to enhance the spatial depth feature of the 3D target and use the sparse instance depth to supervise the auxiliary task. Finally, by verifying the proposed algorithm on the KITTI test set and evaluation set, the experimental results show that compared with the baseline method, the proposed method improves by 5.27% in AP40 in the car category, effectively improving the detection performance of the monocular 3D object detection algorithm.

Onboard Real-Time Multi-Sensor Pose Estimation for Indoor Quadrotor Navigation with Intermittent Communication

Authors: Loizos Hadjiloizou, Kyriakos M. Deliparaschos, Evagoras Makridis, Themistoklis Charalambous
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2212.01599
Pdf link: https://arxiv.org/pdf/2212.01599
Abstract We propose a multisensor fusion framework for onboard real-time navigation of a quadrotor in an indoor environment, by integrating sensor readings from an Inertial Measurement Unit (IMU), a camera-based object detection algorithm, and an Ultra-WideBand (UWB) localization system. The sensor readings from the camera-based object detection algorithm and the UWB localization system arrive intermittently, since the measurements are not readily available. We design a Kalman filter that manages intermittent observations in order to handle and fuse the readings and estimate the pose of the quadrotor for tracking a predefined trajectory. The system is implemented via a Hardware-in-the-loop (HIL) simulation technique, in which the dynamic model of the quadrotor is simulated in an open-source 3D robotics simulator tool, and the whole navigation system is implemented on Artificial Intelligence (AI) enabled edge GPU. The simulation results show that our proposed framework offers low positioning and trajectory errors, while handling intermittent sensor measurements.

Synthesize Boundaries: A Boundary-aware Self-consistent Framework for Weakly Supervised Salient Object Detection

Authors: Binwei Xu, Haoran Liang, Ronghua Liang, Peng Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.01764
Pdf link: https://arxiv.org/pdf/2212.01764
Abstract Fully supervised salient object detection (SOD) has made considerable progress based on expensive and time-consuming data with pixel-wise annotations. Recently, to relieve the labeling burden while maintaining performance, some scribble-based SOD methods have been proposed. However, learning precise boundary details from scribble annotations that lack edge information is still difficult. In this paper, we propose to learn precise boundaries from our designed synthetic images and labels without introducing any extra auxiliary data. The synthetic image creates boundary information by inserting synthetic concave regions that simulate the real concave regions of salient objects. Furthermore, we propose a novel self-consistent framework that consists of a global integral branch (GIB) and a boundary-aware branch (BAB) to train a saliency detector. GIB aims to identify integral salient objects, whose input is the original image. BAB aims to help predict accurate boundaries, whose input is the synthetic image. These two branches are connected through a self-consistent loss to guide the saliency detector to predict precise boundaries while identifying salient objects. Experimental results on five benchmarks demonstrate that our method outperforms the state-of-the-art weakly supervised SOD methods and further narrows the gap with the fully supervised methods.

3D Object Aided Self-Supervised Monocular Depth Estimation

Authors: Songlin Wei, Guodong Chen, Wenzheng Chi, Zhenhua Wang, Lining Sun
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.01768
Pdf link: https://arxiv.org/pdf/2212.01768
Abstract Monocular depth estimation has been actively studied in fields such as robot vision, autonomous driving, and 3D scene understanding. Given a sequence of color images, unsupervised learning methods based on the framework of Structure-From-Motion (SfM) simultaneously predict depth and camera relative pose. However, dynamically moving objects in the scene violate the static world assumption, resulting in inaccurate depths of dynamic objects. In this work, we propose a new method to address such dynamic object movements through monocular 3D object detection. Specifically, we first detect 3D objects in the images and build the per-pixel correspondence of the dynamic pixels with the detected object pose while leaving the static pixels corresponding to the rigid background to be modeled with camera motion. In this way, the depth of every pixel can be learned via a meaningful geometry model. Besides, objects are detected as cuboids with absolute scale, which is used to eliminate the scale ambiguity problem inherent in monocular vision. Experiments on the KITTI depth dataset show that our method achieves State-of-The-Art performance for depth estimation. Furthermore, joint training of depth, camera motion and object pose also improves monocular 3D object detection performance. To the best of our knowledge, this is the first work that allows a monocular 3D object detection network to be fine-tuned in a self-supervised manner.

DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detection

Authors: Ziyuan Zhao, Mingxi Xu, Peisheng Qian, Ramanpreet Singh Pahwa, Richard Chang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2212.02057
Pdf link: https://arxiv.org/pdf/2212.02057
Abstract Deep learning has achieved notable success in 3D object detection with the advent of large-scale point cloud datasets. However, severe performance degradation in the past trained classes, i.e., catastrophic forgetting, still remains a critical issue for real-world deployment when the number of classes is unknown or may vary. Moreover, existing 3D class-incremental detection methods are developed for the single-domain scenario, which fail when encountering domain shift caused by different datasets, varying environments, etc. In this paper, we identify the unexplored yet valuable scenario, i.e., class-incremental learning under domain shift, and propose a novel 3D domain adaptive class-incremental object detection framework, DA-CIL, in which we design a novel dual-domain copy-paste augmentation method to construct multiple augmented domains for diversifying training distributions, thereby facilitating gradual domain adaptation. Then, multi-level consistency is explored to facilitate dual-teacher knowledge distillation from different domains for domain adaptive class-incremental learning. Extensive experiments on various datasets demonstrate the effectiveness of the proposed method over baselines in the domain adaptive class-incremental learning scenario.

DL-SLOT: Dynamic LiDAR SLAM and object tracking based on collaborative graph optimization

Authors: Xuebo Tian, Zhongyang Zhu, Junqiao Zhao, Gengxuan Tian, Chen Ye
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2212.02077
Pdf link: https://arxiv.org/pdf/2212.02077
Abstract Ego-pose estimation and dynamic object tracking are two critical problems for autonomous driving systems. The solutions to these problems are generally based on their respective assumptions, \ie{the static world assumption for simultaneous localization and mapping (SLAM) and the accurate ego-pose assumption for object tracking}. However, these assumptions are challenging to hold in dynamic road scenarios, where SLAM and object tracking become closely correlated. Therefore, we propose DL-SLOT, a dynamic LiDAR SLAM and object tracking method, to simultaneously address these two coupled problems. This method integrates the state estimations of both the autonomous vehicle and the stationary and dynamic objects in the environment into a unified optimization framework. First, we used object detection to identify all points belonging to potentially dynamic objects. Subsequently, a LiDAR odometry was conducted using the filtered point cloud. Simultaneously, we proposed a sliding window-based object association method that accurately associates objects according to the historical trajectories of tracked objects. The ego-states and those of the stationary and dynamic objects are integrated into the sliding window-based collaborative graph optimization. The stationary objects are subsequently restored from the potentially dynamic object set. Finally, a global pose-graph is implemented to eliminate the accumulated error. Experiments on KITTI datasets demonstrate that our method achieves better accuracy than SLAM and object tracking baseline methods. This confirms that solving SLAM and object tracking simultaneously is mutually advantageous, dramatically improving the robustness and accuracy of SLAM and object tracking in dynamic road scenarios.

YolOOD: Utilizing Object Detection Concepts for Out-of-Distribution Detection

Authors: Alon Zolfi, Guy Amit, Amit Baras, Satoru Koda, Ikuya Morikawa, Yuval Elovici, Asaf Shabtai
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2212.02081
Pdf link: https://arxiv.org/pdf/2212.02081
Abstract Out-of-distribution (OOD) detection has attracted a large amount of attention from the machine learning research community in recent years due to its importance in deployed systems. Most of the previous studies focused on the detection of OOD samples in the multi-class classification task. However, OOD detection in the multi-label classification task remains an underexplored domain. In this research, we propose YolOOD - a method that utilizes concepts from the object detection domain to perform OOD detection in the multi-label classification task. Object detection models have an inherent ability to distinguish between objects of interest (in-distribution) and irrelevant objects (e.g., OOD objects) on images that contain multiple objects from different categories. These abilities allow us to convert a regular object detection model into an image classifier with inherent OOD detection capabilities with just minor changes. We compare our approach to state-of-the-art OOD detection methods and demonstrate YolOOD's ability to outperform these methods on a comprehensive suite of in-distribution and OOD benchmark datasets.

Perceive, Interact, Predict: Learning Dynamic and Static Clues for End-to-End Motion Prediction

Authors: Bo Jiang, Shaoyu Chen, Xinggang Wang, Bencheng Liao, Tianheng Cheng, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2212.02181
Pdf link: https://arxiv.org/pdf/2212.02181
Abstract Motion prediction is highly relevant to the perception of dynamic objects and static map elements in the scenarios of autonomous driving. In this work, we propose PIP, the first end-to-end Transformer-based framework which jointly and interactively performs online mapping, object detection and motion prediction. PIP leverages map queries, agent queries and mode queries to encode the instance-wise information of map elements, agents and motion intentions, respectively. Based on the unified query representation, a differentiable multi-task interaction scheme is proposed to exploit the correlation between perception and prediction. Even without human-annotated HD map or agent's historical tracking trajectory as guidance information, PIP realizes end-to-end multi-agent motion prediction and achieves better performance than tracking-based and HD-map-based methods. PIP provides comprehensive high-level information of the driving scene (vectorized static map and dynamic objects with motion information), and contributes to the downstream planning and control. Code and models will be released for facilitating further research.

Real Time Incremental Image Mosaicking Without Use of Any Camera Parameter

Authors: Suleyman Melih Portakal, Ahmet Alp Kindiroglu, Mahiye Uluyagmur Ozturk
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.02302
Pdf link: https://arxiv.org/pdf/2212.02302
Abstract Over the past decade, there has been a significant increase in the use of Unmanned Aerial Vehicles (UAVs) to support a wide variety of missions, such as remote surveillance, vehicle tracking, and object detection. For problems involving processing of areas larger than a single image, the mosaicking of UAV imagery is a necessary step. Real-time image mosaicking is used for missions that requires fast response like search and rescue missions. It typically requires information from additional sensors, such as Global Position System (GPS) and Inertial Measurement Unit (IMU), to facilitate direct orientation, or 3D reconstruction approaches to recover the camera poses. This paper proposes a UAV-based system for real-time creation of incremental mosaics which does not require either direct or indirect camera parameters such as orientation information. Inspired by previous approaches, in the mosaicking process, feature extraction from images, matching of similar key points between images, finding homography matrix to warp and align images, and blending images to obtain mosaics better looking, plays important roles in the achievement of the high quality result. Edge detection is used in the blending step as a novel approach. Experimental results show that real-time incremental image mosaicking process can be completed satisfactorily and without need for any additional camera parameters.

Keyword: transformer

PROB: Probabilistic Objectness for Open World Object Detection

Authors: Orr Zohar, Kuan-Chieh Wang, Serena Yeung
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2212.01424
Pdf link: https://arxiv.org/pdf/2212.01424
Abstract Open World Object Detection (OWOD) is a new and challenging computer vision task that bridges the gap between classic object detection (OD) benchmarks and object detection in the real world. In addition to detecting and classifying seen/labeled objects, OWOD algorithms are expected to detect novel/unknown objects - which can be classified and incrementally learned. In standard OD, object proposals not overlapping with a labeled object are automatically classified as background. Therefore, simply applying OD methods to OWOD fails as unknown objects would be predicted as background. The challenge of detecting unknown objects stems from the lack of supervision in distinguishing unknown objects and background object proposals. Previous OWOD methods have attempted to overcome this issue by generating supervision using pseudo-labeling - however, unknown object detection has remained low. Probabilistic/generative models may provide a solution for this challenge. Herein, we introduce a novel probabilistic framework for objectness estimation, where we alternate between probability distribution estimation and objectness likelihood maximization of known objects in the embedded feature space - ultimately allowing us to estimate the objectness probability of different proposals. The resulting Probabilistic Objectness transformer-based open-world detector, PROB, integrates our framework into traditional object detection models, adapting them for the open-world setting. Comprehensive experiments on OWOD benchmarks show that PROB outperforms all existing OWOD methods in both unknown object detection ($\sim 2\times$ unknown recall) and known object detection ($\sim 10%$ mAP). Our code will be made available upon publication at https://github.com/orrzohar/PROB.

Learning a Pedestrian Social Behavior Dictionary

Authors: Faith Johnson, Kristin Dana
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.01426
Pdf link: https://arxiv.org/pdf/2212.01426
Abstract Understanding pedestrian behavior patterns is a key component to building autonomous agents that can navigate among humans. We seek a learned dictionary of pedestrian behavior to obtain a semantic description of pedestrian trajectories. Supervised methods for dictionary learning are impractical since pedestrian behaviors may be unknown a priori and the process of manually generating behavior labels is prohibitively time consuming. We instead utilize a novel, unsupervised framework to create a taxonomy of pedestrian behavior observed in a specific space. First, we learn a trajectory latent space that enables unsupervised clustering to create an interpretable pedestrian behavior dictionary. We show the utility of this dictionary for building pedestrian behavior maps to visualize space usage patterns and for computing the distributions of behaviors. We demonstrate a simple but effective trajectory prediction by conditioning on these behavior labels. While many trajectory analysis methods rely on RNNs or transformers, we develop a lightweight, low-parameter approach and show results comparable to SOTA on the ETH and UCY datasets.

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

Authors: Maxwell Mbabilla Aladago, AJ Piergiovanni
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2212.01447
Pdf link: https://arxiv.org/pdf/2212.01447
Abstract We present an effective method for fusing visual-and-language representations for several question answering tasks including visual question answering and visual entailment. In contrast to prior works that concatenate unimodal representations or use only cross-attention, we compose multimodal representations via channel fusion. By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods. These multimodal representations, which we call compound tokens are generated with cross-attention transformer layers. First, vision tokens are used as queries to retrieve compatible text tokens through cross-attention. We then chain the vision tokens and the queried text tokens along the channel dimension. We call the resulting representations compound tokens. A second group of compound tokens are generated using an analogous process where the text tokens serve as queries to the cross-attention layer. We concatenate all the compound tokens for further processing with multimodal encoder. We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting. Compound Tokens achieve highly competitive performance across a range of question answering tasks including GQA, VQA2.0, and SNLI-VE.

Twitter Data Analysis: Izmir Earthquake Case

Authors: Özgür Agrali, Hakan Sökün, Enis Karaarslan
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2212.01453
Pdf link: https://arxiv.org/pdf/2212.01453
Abstract T"urkiye is located on a fault line; earthquakes often occur on a large and small scale. There is a need for effective solutions for gathering current information during disasters. We can use social media to get insight into public opinion. This insight can be used in public relations and disaster management. In this study, Twitter posts on Izmir Earthquake that took place on October 2020 are analyzed. We question if this analysis can be used to make social inferences on time. Data mining and natural language processing (NLP) methods are used for this analysis. NLP is used for sentiment analysis and topic modelling. The latent Dirichlet Allocation (LDA) algorithm is used for topic modelling. We used the Bidirectional Encoder Representations from Transformers (BERT) model working with Transformers architecture for sentiment analysis. It is shown that the users shared their goodwill wishes and aimed to contribute to the initiated aid activities after the earthquake. The users desired to make their voices heard by competent institutions and organizations. The proposed methods work effectively. Future studies are also discussed.

Box2Mask: Box-supervised Instance Segmentation via Level-set Evolution

Authors: Wentong Li, Wenyu Liu, Jianke Zhu, Miaomiao Cui, Risheng Yu, Xiansheng Hua, Lei Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.01579
Pdf link: https://arxiv.org/pdf/2212.01579
Abstract In contrast to fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of simple box annotations, which has recently attracted increasing research attention. This paper presents a novel single-shot instance segmentation approach, namely Box2Mask, which integrates the classical level-set evolution model into deep neural network learning to achieve accurate mask prediction with only bounding box supervision. Specifically, both the input image and its deep features are employed to evolve the level-set curves implicitly, and a local consistency module based on a pixel affinity kernel is used to mine the local context and spatial relations. Two types of single-stage frameworks, i.e., CNN-based and transformer-based frameworks, are developed to empower the level-set evolution for box-supervised instance segmentation, and each framework consists of three essential components: instance-aware decoder, box-level matching assignment and level-set evolution. By minimizing the level-set energy function, the mask map of each instance can be iteratively optimized within its bounding box annotation. The experimental results on five challenging testbeds, covering general scenes, remote sensing, medical and scene text images, demonstrate the outstanding performance of our proposed Box2Mask approach for box-supervised instance segmentation. In particular, with the Swin-Transformer large backbone, our Box2Mask obtains 42.4% mask AP on COCO, which is on par with the recently developed fully mask-supervised methods. The code is available at: https://github.com/LiWentomng/boxlevelset.

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Authors: Yu Qi, Fan Yang, Yousong Zhu, Yufei Liu, Liwei Wu, Rui Zhao, Wei Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2212.01610
Pdf link: https://arxiv.org/pdf/2212.01610
Abstract Autoregressive language modeling (ALM) have been successfully used in self-supervised pre-training in Natural language processing (NLP). However, this paradigm has not achieved comparable results with other self-supervised approach in computer vision (e.g., contrastive learning, mask image modeling). In this paper, we try to find the reason why autoregressive modeling does not work well on vision tasks. To tackle this problem, we fully analyze the limitation of visual autoregressive methods and proposed a novel stochastic autoregressive image modeling (named SAIM) by the two simple designs. First, we employ stochastic permutation strategy to generate effective and robust image context which is critical for vision tasks. Second, we create a parallel encoder-decoder training process in which the encoder serves a similar role to the standard vision transformer focus on learning the whole contextual information, and meanwhile the decoder predicts the content of the current position, so that the encoder and decoder can reinforce each other. By introducing stochastic prediction and the parallel encoder-decoder, SAIM significantly improve the performance of autoregressive image modeling. Our method achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data. Transfer performance in downstream tasks also show that our model achieves competitive performance.

Global memory transformer for processing long documents

Authors: Arij Al Adel
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2212.01650
Pdf link: https://arxiv.org/pdf/2212.01650
Abstract Transformer variants dominate the state-of-the-art in different natural language processing tasks such as translation, reading comprehension and summarization. Our paper is more directed to use general memory slots added to the inputs and studying the results of adding these slots. This paper is a go on study of general memory slots rule that were added to the input of the proposed model in previous work. We have two main tasks;1) pretraining task using masked language modeling and b) fine tuning task using HotpotQA . This study aims to verify the ability of the proposed model to handle chunks as if they were one chunk comparing with the base model. As baseline we used T5 transformer. We studied the rule of memory slots augmented to each input chunk and studied the model performance without selector. We found that adding memory to input chunks helped the proposed model to overcome the baseline on Masked language modeling task with specific training parameters. Ablation study reveals the ability of using the compressed input chunks with a degradation in performance.

A Survey on Medical Document Summarization

Authors: Raghav Jain, Anubhav Jangra, Sriparna Saha, Adam Jatowt
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2212.01669
Pdf link: https://arxiv.org/pdf/2212.01669
Abstract The internet has had a dramatic effect on the healthcare industry, allowing documents to be saved, shared, and managed digitally. This has made it easier to locate and share important data, improving patient care and providing more opportunities for medical studies. As there is so much data accessible to doctors and patients alike, summarizing it has become increasingly necessary - this has been supported through the introduction of deep learning and transformer-based networks, which have boosted the sector significantly in recent years. This paper gives a comprehensive survey of the current techniques and trends in medical summarization

Recognition and Prediction of Surgical Gestures and Trajectories Using Transformer Models in Robot-Assisted Surgery

Authors: Chang Shi, Yi Zheng, Ann Majewicz Fey
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2212.01683
Pdf link: https://arxiv.org/pdf/2212.01683
Abstract Surgical activity recognition and prediction can help provide important context in many Robot-Assisted Surgery (RAS) applications, for example, surgical progress monitoring and estimation, surgical skill evaluation, and shared control strategies during teleoperation. Transformer models were first developed for Natural Language Processing (NLP) to model word sequences and soon the method gained popularity for general sequence modeling tasks. In this paper, we propose the novel use of a Transformer model for three tasks: gesture recognition, gesture prediction, and trajectory prediction during RAS. We modify the original Transformer architecture to be able to generate the current gesture sequence, future gesture sequence, and future trajectory sequence estimations using only the current kinematic data of the surgical robot end-effectors. We evaluate our proposed models on the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) and use Leave-One-User-Out (LOUO) cross-validation to ensure the generalizability of our results. Our models achieve up to 89.3% gesture recognition accuracy, 84.6% gesture prediction accuracy (1 second ahead) and 2.71mm trajectory prediction error (1 second ahead). Our models are comparable to and able to outperform state-of-the-art methods while using only the kinematic data channel. This approach can enable near-real time surgical activity recognition and prediction.

Joint Self-Supervised Image-Volume Representation Learning with Intra-Inter Contrastive Clustering

Authors: Duy M. H. Nguyen, Hoang Nguyen, Mai T. N. Truong, Tri Cao, Binh T. Nguyen, Nhat Ho, Paul Swoboda, Shadi Albarqouni, Pengtao Xie, Daniel Sonntag
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.01893
Pdf link: https://arxiv.org/pdf/2212.01893
Abstract Collecting large-scale medical datasets with fully annotated samples for training of deep networks is prohibitively expensive, especially for 3D volume data. Recent breakthroughs in self-supervised learning (SSL) offer the ability to overcome the lack of labeled training samples by learning feature representations from unlabeled data. However, most current SSL techniques in the medical field have been designed for either 2D images or 3D volumes. In practice, this restricts the capability to fully leverage unlabeled data from numerous sources, which may include both 2D and 3D data. Additionally, the use of these pre-trained networks is constrained to downstream tasks with compatible data dimensions. In this paper, we propose a novel framework for unsupervised joint learning on 2D and 3D data modalities. Given a set of 2D images or 2D slices extracted from 3D volumes, we construct an SSL task based on a 2D contrastive clustering problem for distinct classes. The 3D volumes are exploited by computing vectored embedding at each slice and then assembling a holistic feature through deformable self-attention mechanisms in Transformer, allowing incorporating long-range dependencies between slices inside 3D volumes. These holistic features are further utilized to define a novel 3D clustering agreement-based SSL task and masking embedding prediction inspired by pre-trained language models. Experiments on downstream tasks, such as 3D brain segmentation, lung nodule detection, 3D heart structures segmentation, and abnormal chest X-ray detection, demonstrate the effectiveness of our joint 2D and 3D SSL approach. We improve plain 2D Deep-ClusterV2 and SwAV by a significant margin and also surpass various modern 2D and 3D SSL approaches.

Learning Imbalanced Data with Vision Transformers

Authors: Zhengzhuo Xu, Ruikang Liu, Shuo Yang, Zenghao Chai, Chun Yuan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2212.02015
Pdf link: https://arxiv.org/pdf/2212.02015
Abstract The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs' performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. In addition, Binary Cross Entropy (BCE) loss, which shows conspicuous performance with ViTs, encounters predicaments in LTR. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins to deploy it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at https://github.com/XuZhengzhuo/LiVT.

Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

Authors: Zhengbao Jiang, Luyu Gao, Jun Araki, Haibo Ding, Zhiruo Wang, Jamie Callan, Graham Neubig
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2212.02027
Pdf link: https://arxiv.org/pdf/2212.02027
Abstract Systems for knowledge-intensive tasks such as open-domain question answering (QA) usually consist of two stages: efficient retrieval of relevant documents from a large corpus and detailed reading of the selected documents to generate answers. Retrievers and readers are usually modeled separately, which necessitates a cumbersome implementation and is hard to train and adapt in an end-to-end fashion. In this paper, we revisit this design and eschew the separate architecture and training in favor of a single Transformer that performs Retrieval as Attention (ReAtt), and end-to-end training solely based on supervision from the end QA task. We demonstrate for the first time that a single model trained end-to-end can achieve both competitive retrieval and QA performance, matching or slightly outperforming state-of-the-art separately trained retrievers and readers. Moreover, end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings, making our model a simple and adaptable solution for knowledge-intensive tasks. Code and models are available at https://github.com/jzbjyb/ReAtt.

Inspired by Norbert Wiener: FeedBack Loop Network Learning Incremental Knowledge for Driver Attention Prediction and Beyond

Authors: Yilong Chen, Zhixiong Nan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.02096
Pdf link: https://arxiv.org/pdf/2212.02096
Abstract The problem of predicting driver attention from the driving perspective is gaining the increasing research focuses due to its remarkable significance for autonomous driving and assisted driving systems. Driving experience is extremely important for driver attention prediction, a skilled driver is able to effortlessly predict oncoming danger (before it becomes salient) based on driving experience and quickly pay attention on the corresponding zones. However, the nonobjective driving experience is difficult to model, so a mechanism simulating driver experience accumulation procedure is absent in existing methods, and the existing methods usually follow the technique line of saliency prediction methods to predict driver attention. In this paper, we propose a FeedBack Loop Network (FBLNet), which attempts to model the driving experience accumulation procedure. By over-and-over iterations, FBLNet generates the incremental knowledge that carries rich historically-accumulative long-term temporal information. The incremental knowledge to our model is like the driving experience to humans. Under the guidance of the incremental knowledge, our model fuses the CNN feature and Transformer feature that are extracted from the input image to predict driver attention. Our model exhibits solid advantage over existing methods, achieving an average 10.3% performance improvement on three public datasets.

Learning to Learn Better for Video Object Segmentation

Authors: Meng Lan, Jing Zhang, Lefei Zhang, Dacheng Tao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.02112
Pdf link: https://arxiv.org/pdf/2212.02112
Abstract Recently, the joint learning framework (JOINT) integrates matching based transductive reasoning and online inductive learning to achieve accurate and robust semi-supervised video object segmentation (SVOS). However, using the mask embedding as the label to guide the generation of target features in the two branches may result in inadequate target representation and degrade the performance. Besides, how to reasonably fuse the target features in the two different branches rather than simply adding them together to avoid the adverse effect of one dominant branch has not been investigated. In this paper, we propose a novel framework that emphasizes Learning to Learn Better (LLB) target features for SVOS, termed LLB, where we design the discriminative label generation module (DLGM) and the adaptive fusion module to address these issues. Technically, the DLGM takes the background-filtered frame instead of the target mask as input and adopts a lightweight encoder to generate the target features, which serves as the label of the online few-shot learner and the value of the decoder in the transformer to guide the two branches to learn more discriminative target representation. The adaptive fusion module maintains a learnable gate for each branch, which reweighs the element-wise feature representation and allows an adaptive amount of target information in each branch flowing to the fused target feature, thus preventing one branch from being dominant and making the target feature more robust to distractor. Extensive experiments on public benchmarks show that our proposed LLB method achieves state-of-the-art performance.

Perceive, Interact, Predict: Learning Dynamic and Static Clues for End-to-End Motion Prediction

Authors: Bo Jiang, Shaoyu Chen, Xinggang Wang, Bencheng Liao, Tianheng Cheng, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2212.02181
Pdf link: https://arxiv.org/pdf/2212.02181
Abstract Motion prediction is highly relevant to the perception of dynamic objects and static map elements in the scenarios of autonomous driving. In this work, we propose PIP, the first end-to-end Transformer-based framework which jointly and interactively performs online mapping, object detection and motion prediction. PIP leverages map queries, agent queries and mode queries to encode the instance-wise information of map elements, agents and motion intentions, respectively. Based on the unified query representation, a differentiable multi-task interaction scheme is proposed to exploit the correlation between perception and prediction. Even without human-annotated HD map or agent's historical tracking trajectory as guidance information, PIP realizes end-to-end multi-agent motion prediction and achieves better performance than tracking-based and HD-map-based methods. PIP provides comprehensive high-level information of the driving scene (vectorized static map and dynamic objects with motion information), and contributes to the downstream planning and control. Code and models will be released for facilitating further research.

3D-LatentMapper: View Agnostic Single-View Reconstruction of 3D Shapes

Authors: Alara Dirik, Pinar Yanardag
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2212.02184
Pdf link: https://arxiv.org/pdf/2212.02184
Abstract Computer graphics, 3D computer vision and robotics communities have produced multiple approaches to represent and generate 3D shapes, as well as a vast number of use cases. However, single-view reconstruction remains a challenging topic that can unlock various interesting use cases such as interactive design. In this work, we propose a novel framework that leverages the intermediate latent spaces of Vision Transformer (ViT) and a joint image-text representational model, CLIP, for fast and efficient Single View Reconstruction (SVR). More specifically, we propose a novel mapping network architecture that learns a mapping between deep features extracted from ViT and CLIP, and the latent space of a base 3D generative model. Unlike previous work, our method enables view-agnostic reconstruction of 3D shapes, even in the presence of large occlusions. We use the ShapeNetV2 dataset and perform extensive experiments with comparisons to SOTA methods to demonstrate our method's effectiveness.

Generalizable Person Re-Identification via Viewpoint Alignment and Fusion

Authors: Bingliang Jiao, Lingqiao Liu, Liying Gao, Guosheng Lin, Ruiqi Wu, Shizhou Zhang, Peng Wang, Yanning Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.02398
Pdf link: https://arxiv.org/pdf/2212.02398
Abstract In the current person Re-identification (ReID) methods, most domain generalization works focus on dealing with style differences between domains while largely ignoring unpredictable camera view change, which we identify as another major factor leading to a poor generalization of ReID methods. To tackle the viewpoint change, this work proposes to use a 3D dense pose estimation model and a texture mapping module to map the pedestrian images to canonical view images. Due to the imperfection of the texture mapping module, the canonical view images may lose the discriminative detail clues from the original images, and thus directly using them for ReID will inevitably result in poor performance. To handle this issue, we propose to fuse the original image and canonical view image via a transformer-based module. The key insight of this design is that the cross-attention mechanism in the transformer could be an ideal solution to align the discriminative texture clues from the original image with the canonical view image, which could compensate for the low-quality texture information of the canonical view image. Through extensive experiments, we show that our method can lead to superior performance over the existing approaches in various evaluation settings.

Solving the Weather4cast Challenge via Visual Transformers for 3D Images

Authors: Yury Belousov, Sergey Polezhaev, Brian Pulfer
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
Arxiv link: https://arxiv.org/abs/2212.02456
Pdf link: https://arxiv.org/pdf/2212.02456
Abstract Accurately forecasting the weather is an important task, as many real-world processes and decisions depend on future meteorological conditions. The NeurIPS 2022 challenge entitled Weather4cast poses the problem of predicting rainfall events for the next eight hours given the preceding hour of satellite observations as a context. Motivated by the recent success of transformer-based architectures in computer vision, we implement and propose two methodologies based on this architecture to tackle this challenge. We find that ensembling different transformers with some baseline models achieves the best performance we could measure on the unseen test data. Our approach has been ranked 3rd in the competition.

Meta-Learning Fast Weight Language Models

Authors: Kevin Clark, Kelvin Guu, Ming-Wei Chang, Panupong Pasupat, Geoffrey Hinton, Mohammad Norouzi
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2212.02475
Pdf link: https://arxiv.org/pdf/2212.02475
Abstract Dynamic evaluation of language models (LMs) adapts model parameters at test time using gradient information from previous tokens and substantially improves LM performance. However, it requires over 3x more compute than standard inference. We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently by expressing gradient updates as linear attention. A key improvement over dynamic evaluation is that FWLs can also be applied at training time so the model learns to make good use of gradient updates. FWLs can easily be added on top of existing transformer models, require relatively little extra compute or memory to run, and significantly improve language modeling perplexity.

Keyword: scene understanding

Prediction of Scene Plausibility

Authors: Or Nachmias, Ohad Fried, Ariel Shamir
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.01470
Pdf link: https://arxiv.org/pdf/2212.01470
Abstract Understanding the 3D world from 2D images involves more than detection and segmentation of the objects within the scene. It also includes the interpretation of the structure and arrangement of the scene elements. Such understanding is often rooted in recognizing the physical world and its limitations, and in prior knowledge as to how similar typical scenes are arranged. In this research we pose a new challenge for neural network (or other) scene understanding algorithms - can they distinguish between plausible and implausible scenes? Plausibility can be defined both in terms of physical properties and in terms of functional and typical arrangements. Hence, we define plausibility as the probability of encountering a given scene in the real physical world. We build a dataset of synthetic images containing both plausible and implausible scenes, and test the success of various vision models in the task of recognizing and understanding plausibility.

3D Object Aided Self-Supervised Monocular Depth Estimation

Authors: Songlin Wei, Guodong Chen, Wenzheng Chi, Zhenhua Wang, Lining Sun
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2212.01768
Pdf link: https://arxiv.org/pdf/2212.01768
Abstract Monocular depth estimation has been actively studied in fields such as robot vision, autonomous driving, and 3D scene understanding. Given a sequence of color images, unsupervised learning methods based on the framework of Structure-From-Motion (SfM) simultaneously predict depth and camera relative pose. However, dynamically moving objects in the scene violate the static world assumption, resulting in inaccurate depths of dynamic objects. In this work, we propose a new method to address such dynamic object movements through monocular 3D object detection. Specifically, we first detect 3D objects in the images and build the per-pixel correspondence of the dynamic pixels with the detected object pose while leaving the static pixels corresponding to the rigid background to be modeled with camera motion. In this way, the depth of every pixel can be learned via a meaningful geometry model. Besides, objects are detected as cuboids with absolute scale, which is used to eliminate the scale ambiguity problem inherent in monocular vision. Experiments on the KITTI depth dataset show that our method achieves State-of-The-Art performance for depth estimation. Furthermore, joint training of depth, camera motion and object pose also improves monocular 3D object detection performance. To the best of our knowledge, this is the first work that allows a monocular 3D object detection network to be fine-tuned in a self-supervised manner.

Review on 6D Object Pose Estimation with the focus on Indoor Scene Understanding

Authors: Negar Nejatishahidin, Pooya Fayyazsanavi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2212.01920
Pdf link: https://arxiv.org/pdf/2212.01920
Abstract 6D object pose estimation problem has been extensively studied in the field of Computer Vision and Robotics. It has wide range of applications such as robot manipulation, augmented reality, and 3D scene understanding. With the advent of Deep Learning, many breakthroughs have been made; however, approaches continue to struggle when they encounter unseen instances, new categories, or real-world challenges such as cluttered backgrounds and occlusions. In this study, we will explore the available methods based on input modality, problem formulation, and whether it is a category-level or instance-level approach. As a part of our discussion, we will focus on how 6D object pose estimation can be used for understanding 3D scenes.

Framework for 2D Ad placements in LinearTV

Authors: Divya Bhargavi, Karan Sindwani, Sia Gholami
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2212.02450
Pdf link: https://arxiv.org/pdf/2212.02450
Abstract Virtual Product placement(VPP) is the advertising technique of digitally placing a branded object into the scene of a movie or TV show. This type of advertising provides the ability for brands to reach consumers without interrupting the viewing experience with a commercial break, as the products are seen in the background or as props. Despite this being a billion-dollar industry, ad rendering technique is currently executed at post production stage, manually either with the help of VFx artists or through semi-automated solutions. In this paper, we demonstrate a fully automated framework to digitally place 2-D ads in linear TV cooking shows captured using single-view camera with small camera movements. Without access to full video or production camera configuration, this framework performs the following tasks (i) identifying empty space for 2-D ad placement (ii) kitchen scene understanding (iii) occlusion handling (iv) ambient lighting and (v) ad tracking.

Keyword: visual reasoning

There is no result

Dec 06 '22 02:12 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Tue, 6 Dec 22

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

PROB: Probabilistic Objectness for Open World Object Detection

IDMS: Instance Depth for Multi-scale Monocular 3D Object Detection

Onboard Real-Time Multi-Sensor Pose Estimation for Indoor Quadrotor Navigation with Intermittent Communication

Synthesize Boundaries: A Boundary-aware Self-consistent Framework for Weakly Supervised Salient Object Detection

3D Object Aided Self-Supervised Monocular Depth Estimation

DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detection

DL-SLOT: Dynamic LiDAR SLAM and object tracking based on collaborative graph optimization

YolOOD: Utilizing Object Detection Concepts for Out-of-Distribution Detection

Perceive, Interact, Predict: Learning Dynamic and Static Clues for End-to-End Motion Prediction

Real Time Incremental Image Mosaicking Without Use of Any Camera Parameter

Keyword: transformer

PROB: Probabilistic Objectness for Open World Object Detection

Learning a Pedestrian Social Behavior Dictionary

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

Twitter Data Analysis: Izmir Earthquake Case

Box2Mask: Box-supervised Instance Segmentation via Level-set Evolution

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Global memory transformer for processing long documents

A Survey on Medical Document Summarization

Recognition and Prediction of Surgical Gestures and Trajectories Using Transformer Models in Robot-Assisted Surgery

Joint Self-Supervised Image-Volume Representation Learning with Intra-Inter Contrastive Clustering

Learning Imbalanced Data with Vision Transformers

Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

Inspired by Norbert Wiener: FeedBack Loop Network Learning Incremental Knowledge for Driver Attention Prediction and Beyond

Learning to Learn Better for Video Object Segmentation

Perceive, Interact, Predict: Learning Dynamic and Static Clues for End-to-End Motion Prediction

3D-LatentMapper: View Agnostic Single-View Reconstruction of 3D Shapes

Generalizable Person Re-Identification via Viewpoint Alignment and Fusion

Solving the Weather4cast Challenge via Visual Transformers for 3D Images

Meta-Learning Fast Weight Language Models

Keyword: scene understanding

Prediction of Scene Plausibility

3D Object Aided Self-Supervised Monocular Depth Estimation

Review on 6D Object Pose Estimation with the focus on Indoor Scene Understanding

Framework for 2D Ad placements in LinearTV

Keyword: visual reasoning

arxiv-daily
arxiv-daily copied to clipboard