arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Tue, 13 Sep 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Large-Field Contextual Feature Learning for Glass Detection

  • Authors: Haiyang Mei, Xin Yang, Letian Yu, Qiang Zhang, Xiaopeng Wei, Rynson W.H. Lau
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.04639
  • Pdf link: https://arxiv.org/pdf/2209.04639
  • Abstract Glass is very common in our daily life. Existing computer vision systems neglect it and thus may have severe consequences, e.g., a robot may crash into a glass wall. However, sensing the presence of glass is not straightforward. The key challenge is that arbitrary objects/scenes can appear behind the glass. In this paper, we propose an important problem of detecting glass surfaces from a single RGB image. To address this problem, we construct the first large-scale glass detection dataset (GDD) and propose a novel glass detection network, called GDNet-B, which explores abundant contextual cues in a large field-of-view via a novel large-field contextual feature integration (LCFI) module and integrates both high-level and low-level boundary features with a boundary feature enhancement (BFE) module. Extensive experiments demonstrate that our GDNet-B achieves satisfying glass detection results on the images within and beyond the GDD testing set. We further validate the effectiveness and generalization capability of our proposed GDNet-B by applying it to other vision tasks, including mirror segmentation and salient object detection. Finally, we show the potential applications of glass detection and discuss possible future research directions.

IR-LPR: Large Scale of Iranian License Plate Recognition Dataset

  • Authors: Mahdi Rahmani, Melika Sabaghian, Seyyede Mahila Moghadami, Mohammad Mohsen Talaie, Mahdi Naghibi, Mohammad Ali Keyvanrad
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.04680
  • Pdf link: https://arxiv.org/pdf/2209.04680
  • Abstract Object detection has always been practical. There are so many things in our world that recognizing them can not only increase our automatic knowledge of the surroundings, but can also be lucrative for those interested in starting a new business. One of these attractive objects is the license plate (LP). In addition to the security uses that license plate detection can have, it can also be used to create creative businesses. With the development of object detection methods based on deep learning models, an appropriate and comprehensive dataset becomes doubly important. But due to the frequent commercial use of license plate datasets, there are limited datasets not only in Iran but also in the world. The largest Iranian dataset for detection license plates has 1,466 images. Also, the largest Iranian dataset for recognizing the characters of a license plate has 5,000 images. We have prepared a complete dataset including 20,967 car images along with all the detection annotation of the whole license plate and its characters, which can be useful for various purposes. Also, the total number of license plate images for character recognition application is 27,745 images.

Towards Sparsification of Graph Neural Networks

  • Authors: Hongwu Peng, Deniz Gurevin, Shaoyi Huang, Tong Geng, Weiwen Jiang, Omer Khan, Caiwen Ding
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2209.04766
  • Pdf link: https://arxiv.org/pdf/2209.04766
  • Abstract As real-world graphs expand in size, larger GNN models with billions of parameters are deployed. High parameter count in such models makes training and inference on graphs expensive and challenging. To reduce the computational and memory costs of GNNs, optimization methods such as pruning the redundant nodes and edges in input graphs have been commonly adopted. However, model compression, which directly targets the sparsification of model layers, has been mostly limited to traditional Deep Neural Networks (DNNs) used for tasks such as image classification and object detection. In this paper, we utilize two state-of-the-art model compression methods (1) train and prune and (2) sparse training for the sparsification of weight layers in GNNs. We evaluate and compare the efficiency of both methods in terms of accuracy, training sparsity, and training FLOPs on real-world graphs. Our experimental results show that on the ia-email, wiki-talk, and stackoverflow datasets for link prediction, sparse training with much lower training FLOPs achieves a comparable accuracy with the train and prune method. On the brain dataset for node classification, sparse training uses a lower number FLOPs (less than 1/7 FLOPs of train and prune method) and preserves a much better accuracy performance under extreme model sparsity.

Multiple Object Tracking in Recent Times: A Literature Review

  • Authors: Mk Bashar, Samia Islam, Kashifa Kawaakib Hussain, Md. Bakhtiar Hasan, A.B.M. Ashikur Rahman, Md. Hasanul Kabir
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.04796
  • Pdf link: https://arxiv.org/pdf/2209.04796
  • Abstract Multiple object tracking gained a lot of interest from researchers in recent years, and it has become one of the trending problems in computer vision, especially with the recent advancement of autonomous driving. MOT is one of the critical vision tasks for different issues like occlusion in crowded scenes, similar appearance, small object detection difficulty, ID switching, etc. To tackle these challenges, as researchers tried to utilize the attention mechanism of transformer, interrelation of tracklets with graph convolutional neural network, appearance similarity of objects in different frames with the siamese network, they also tried simple IOU matching based CNN network, motion prediction with LSTM. To take these scattered techniques under an umbrella, we have studied more than a hundred papers published over the last three years and have tried to extract the techniques that are more focused on by researchers in recent times to solve the problems of MOT. We have enlisted numerous applications, possibilities, and how MOT can be related to real life. Our review has tried to show the different perspectives of techniques that researchers used overtimes and give some future direction for the potential researchers. Moreover, we have included popular benchmark datasets and metrics in this review.

Multi-modal Streaming 3D Object Detection

  • Authors: Mazen Abdelfattah, Kaiwen Yuan, Z. Jane Wang, Rabab Ward
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2209.04966
  • Pdf link: https://arxiv.org/pdf/2209.04966
  • Abstract Modern autonomous vehicles rely heavily on mechanical LiDARs for perception. Current perception methods generally require 360{\deg} point clouds, collected sequentially as the LiDAR scans the azimuth and acquires consecutive wedge-shaped slices. The acquisition latency of a full scan (~ 100ms) may lead to outdated perception which is detrimental to safe operation. Recent streaming perception works proposed directly processing LiDAR slices and compensating for the narrow field of view (FOV) of a slice by reusing features from preceding slices. These works, however, are all based on a single modality and require past information which may be outdated. Meanwhile, images from high-frequency cameras can support streaming models as they provide a larger FoV compared to a LiDAR slice. However, this difference in FoV complicates sensor fusion. To address this research gap, we propose an innovative camera-LiDAR streaming 3D object detection framework that uses camera images instead of past LiDAR slices to provide an up-to-date, dense, and wide context for streaming perception. The proposed method outperforms prior streaming models on the challenging NuScenes benchmark. It also outperforms powerful full-scan detectors while being much faster. Our method is shown to be robust to missing camera images, narrow LiDAR slices, and small camera-LiDAR miscalibration.

Keyword: transformer

Gluformer: Transformer-Based Personalized Glucose Forecasting with Uncertainty Quantification

  • Authors: Renat Sergazinov, Mohammadreza Armandpour, Irina Gaynanova
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2209.04526
  • Pdf link: https://arxiv.org/pdf/2209.04526
  • Abstract Deep learning models achieve state-of-the art results in predicting blood glucose trajectories, with a wide range of architectures being proposed. However, the adaptation of such models in clinical practice is slow, largely due to the lack of uncertainty quantification of provided predictions. In this work, we propose to model the future glucose trajectory conditioned on the past as an infinite mixture of basis distributions (i.e., Gaussian, Laplace, etc.). This change allows us to learn the uncertainty and predict more accurately in the cases when the trajectory has a heterogeneous or multi-modal distribution. To estimate the parameters of the predictive distribution, we utilize the Transformer architecture. We empirically demonstrate the superiority of our method over existing state-of-the-art techniques both in terms of accuracy and uncertainty on the synthetic and benchmark glucose data sets.

Multiple Object Tracking in Recent Times: A Literature Review

  • Authors: Mk Bashar, Samia Islam, Kashifa Kawaakib Hussain, Md. Bakhtiar Hasan, A.B.M. Ashikur Rahman, Md. Hasanul Kabir
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.04796
  • Pdf link: https://arxiv.org/pdf/2209.04796
  • Abstract Multiple object tracking gained a lot of interest from researchers in recent years, and it has become one of the trending problems in computer vision, especially with the recent advancement of autonomous driving. MOT is one of the critical vision tasks for different issues like occlusion in crowded scenes, similar appearance, small object detection difficulty, ID switching, etc. To tackle these challenges, as researchers tried to utilize the attention mechanism of transformer, interrelation of tracklets with graph convolutional neural network, appearance similarity of objects in different frames with the siamese network, they also tried simple IOU matching based CNN network, motion prediction with LSTM. To take these scattered techniques under an umbrella, we have studied more than a hundred papers published over the last three years and have tried to extract the techniques that are more focused on by researchers in recent times to solve the problems of MOT. We have enlisted numerous applications, possibilities, and how MOT can be related to real life. Our review has tried to show the different perspectives of techniques that researchers used overtimes and give some future direction for the potential researchers. Moreover, we have included popular benchmark datasets and metrics in this review.

Doctors vs. Nurses: Understanding the Great Divide in Vaccine Hesitancy among Healthcare Workers

  • Authors: Sajid Hussain Rafi Ahamed, Shahid Shakil, Hanjia Lyu, Xinping Zhang, Jiebo Luo
  • Subjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY)
  • Arxiv link: https://arxiv.org/abs/2209.04874
  • Pdf link: https://arxiv.org/pdf/2209.04874
  • Abstract Healthcare workers such as doctors and nurses are expected to be trustworthy and creditable sources of vaccine-related information. Their opinions toward the COVID-19 vaccines may influence the vaccination uptake among the general population. However, vaccine hesitancy is still an important issue even among the healthcare workers. Therefore, it is critical to understand their opinions to help reduce the level of vaccine hesitancy. There have been studies examining healthcare workers' viewpoints on COVID-19 vaccines using questionnaires. Reportedly, a considerably higher proportion of vaccine hesitancy is observed among nurses, compared to doctors. We intend to verify and study this phenomenon at a much larger scale and in fine grain using social media data, which has been effectively and efficiently leveraged by researchers to address real-world issues during the COVID-19 pandemic. More specifically, we use a keyword search to identify healthcare workers and further classify them into doctors and nurses from the profile descriptions of the corresponding Twitter users. Moreover, we apply a transformer-based language model to remove irrelevant tweets. Sentiment analysis and topic modeling are employed to analyze and compare the sentiment and thematic differences in the tweets posted by doctors and nurses. We find that doctors are overall more positive toward the COVID-19 vaccines. The focuses of doctors and nurses when they discuss vaccines in a negative way are in general different. Doctors are more concerned with the effectiveness of the vaccines over newer variants while nurses pay more attention to the potential side effects on children. Therefore, we suggest that more customized strategies should be deployed when communicating with different groups of healthcare workers.

On The Computational Complexity of Self-Attention

  • Authors: Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, Chinmay Hegde
  • Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC)
  • Arxiv link: https://arxiv.org/abs/2209.04881
  • Pdf link: https://arxiv.org/pdf/2209.04881
  • Abstract Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input. Several approaches have been proposed to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large majority of these works are not accompanied by rigorous error guarantees. In this work, we establish lower bounds on the computational complexity of self-attention in a number of scenarios. We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms. As a complement to our lower bounds, we show that it is indeed possible to approximate dot-product self-attention using finite Taylor series in linear-time, at the cost of having an exponential dependence on the polynomial order.

Instruction-driven history-aware policies for robotic manipulations

  • Authors: Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid
  • Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2209.04899
  • Pdf link: https://arxiv.org/pdf/2209.04899
  • Abstract In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions. Yet, robotic manipulation is extremely challenging as it requires fine-grained motor control, long-term memory as well as generalization to previously unseen tasks and environments. To address these challenges, we propose a unified transformer-based approach that takes into account multiple inputs. In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations while (iii) keeping track of the full history of observations and actions. Such an approach enables learning dependencies between history and instructions and improves manipulation precision using multiple views. We evaluate our method on the challenging RLBench benchmark and on a real-world robot. Notably, our approach scales to 74 diverse RLBench tasks and outperforms the state of the art. We also address instruction-conditioned tasks and demonstrate excellent generalization to previously unseen variations.

Transfer Learning and Vision Transformer based State-of-Health prediction of Lithium-Ion Batteries

  • Authors: Pengyu Fu, Liang Chu, Zhuoran Hou, Jincheng Hu, Yanjun Huang, Yuanjian Zhang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2209.05253
  • Pdf link: https://arxiv.org/pdf/2209.05253
  • Abstract In recent years, significant progress has been made in transportation electrification. And lithium-ion batteries (LIB), as the main energy storage devices, have received widespread attention. Accurately predicting the state of health (SOH) can not only ease the anxiety of users about the battery life but also provide important information for the management of the battery. This paper presents a prediction method for SOH based on Vision Transformer (ViT) model. First, discrete charging data of a predefined voltage range is used as an input data matrix. Then, the cycle features of the battery are captured by the ViT which can obtain the global features, and the SOH is obtained by combining the cycle features with the full connection (FC) layer. At the same time, transfer learning (TL) is introduced, and the prediction model based on source task battery training is further fine-tuned according to the early cycle data of the target task battery to provide an accurate prediction. Experiments show that our method can obtain better feature expression compared with existing deep learning methods so that better prediction effect and transfer effect can be achieved.

Deep Convolutional Pooling Transformer for Deepfake Detection

  • Authors: Tianyi Wang, Harry Cheng, Kam Pui Chow, Liqiang Nie
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2209.05299
  • Pdf link: https://arxiv.org/pdf/2209.05299
  • Abstract Recently, Deepfake has drawn considerable public attention due to security and privacy concerns in social media digital forensics. As the wildly spreading Deepfake videos on the Internet become more realistic, traditional detection techniques have failed in distinguishing between the real and fake. Most existing deep learning methods mainly focus on local features and relations within the face image using convolutional neural networks as a backbone. However, local features and relations are insufficient for model training to learn enough general information for Deepfake detection. Therefore, the existing Deepfake detection methods have reached a bottleneck to further improving the detection performance. To address this issue, we propose a deep convolutional Transformer to incorporate the decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance the efficacy. Moreover, we employ the barely discussed image keyframes in model training for performance improvement and visualize the feature quantity gap between the key and normal image frames caused by video compression. We finally illustrate the transferability with extensive experiments on several Deepfake benchmark datasets. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.

FP8 Formats for Deep Learning

  • Authors: Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, Hao Wu
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2209.05433
  • Pdf link: https://arxiv.org/pdf/2209.05433
  • Abstract FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

  • Authors: Mohit Shridhar, Lucas Manuelli, Dieter Fox
  • Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2209.05451
  • Pdf link: https://arxiv.org/pdf/2209.05451
  • Abstract Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can we still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action". Unlike frameworks that operate on 2D images, the voxelized observation and action space provides a strong structural prior for efficiently learning 6-DoF policies. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

Large-scale Evaluation of Transformer-based Article Encoders on the Task of Citation Recommendation

  • Authors: Zoran Medić, Jan Šnajder
  • Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2209.05452
  • Pdf link: https://arxiv.org/pdf/2209.05452
  • Abstract Recently introduced transformer-based article encoders (TAEs) designed to produce similar vector representations for mutually related scientific articles have demonstrated strong performance on benchmark datasets for scientific article recommendation. However, the existing benchmark datasets are predominantly focused on single domains and, in some cases, contain easy negatives in small candidate pools. Evaluating representations on such benchmarks might obscure the realistic performance of TAEs in setups with thousands of articles in candidate pools. In this work, we evaluate TAEs on large benchmarks with more challenging candidate pools. We compare the performance of TAEs with a lexical retrieval baseline model BM25 on the task of citation recommendation, where the model produces a list of recommendations for citing in a given input article. We find out that BM25 is still very competitive with the state-of-the-art neural retrievers, a finding which is surprising given the strong performance of TAEs on small benchmarks. As a remedy for the limitations of the existing benchmarks, we propose a new benchmark dataset for evaluating scientific article representations: Multi-Domain Citation Recommendation dataset (MDCR), which covers different scientific fields and contains challenging candidate pools.

Keyword: scene understanding

A Review on Visual-SLAM: Advancements from Geometric Modelling to Learning-based Semantic Scene Understanding

  • Authors: Tin Lai
  • Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2209.05222
  • Pdf link: https://arxiv.org/pdf/2209.05222
  • Abstract Simultaneous Localisation and Mapping (SLAM) is one of the fundamental problems in autonomous mobile robots where a robot needs to reconstruct a previously unseen environment while simultaneously localising itself with respect to the map. In particular, Visual-SLAM uses various sensors from the mobile robot for collecting and sensing a representation of the map. Traditionally, geometric model-based techniques were used to tackle the SLAM problem, which tends to be error-prone under challenging environments. Recent advancements in computer vision, such as deep learning techniques, have provided a data-driven approach to tackle the Visual-SLAM problem. This review summarises recent advancements in the Visual-SLAM domain using various learning-based methods. We begin by providing a concise overview of the geometric model-based approaches, followed by technical reviews on the current paradigms in SLAM. Then, we present the various learning-based approaches to collecting sensory inputs from mobile robots and performing scene understanding. The current paradigms in deep-learning-based semantic understanding are discussed and placed under the context of Visual-SLAM. Finally, we discuss challenges and further opportunities in the direction of learning-based approaches in Visual-SLAM.

Holistic Segmentation

  • Authors: Stefano Gasperini, Frithjof Winkelmann, Alvaro Marcos-Ramiro, Micheal Schmidt, Nassir Navab, Benjamin Busam, Federico Tombari
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2209.05407
  • Pdf link: https://arxiv.org/pdf/2209.05407
  • Abstract As panoptic segmentation provides a prediction for every pixel in input, non-standard and unseen objects systematically lead to wrong outputs. However, in safety-critical settings, robustness against out-of-distribution samples and corner cases is crucial to avoid dangerous behaviors, such as ignoring an animal or a lost cargo on the road. Since driving datasets cannot contain enough data points to properly sample the long tail of the underlying distribution, a method must deal with unknown and unseen scenarios to be deployed safely. Previous methods targeted part of this issue, by re-identifying already seen unlabeled objects. In this work, we broaden the scope proposing holistic segmentation: a task to identify and separate unseen unknown objects into instances, without learning from unknowns, while performing panoptic segmentation of known classes. We tackle this new problem with U3HS, which first finds unknowns as highly uncertain regions, then clusters the corresponding instance-aware embeddings into individual objects. By doing so, for the first time in panoptic segmentation with unknown objects, our U3HS is not trained with unknown data, thus leaving the settings unconstrained with respect to the type of objects and allowing for a holistic scene understanding. Extensive experiments and comparisons on two public datasets, namely Cityscapes and Lost&Found as a transfer, demonstrate the effectiveness of U3HS in the challenging task of holistic segmentation, with competitive closed-set panoptic segmentation performance.

Keyword: visual reasoning

There is no result

DongZhouGu avatar Sep 13 '22 04:09 DongZhouGu