arxiv-daily
arxiv-daily copied to clipboard
New submissions for Fri, 9 Sep 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
SANIP: Shopping Assistant and Navigation for the visually impaired
- Authors: Shubham Deshmukh, Favin Fernandes, Amey Chavan, Monali Ahire, Devashri Borse, Jyoti Madake
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.03570
- Pdf link: https://arxiv.org/pdf/2209.03570
- Abstract The proposed shopping assistant model SANIP is going to help blind persons to detect hand held objects and also to get a video feedback of the information retrieved from the detected and recognized objects. The proposed model consists of three python models i.e. Custom Object Detection, Text Detection and Barcode detection. For object detection of the hand held object, we have created our own custom dataset that comprises daily goods such as Parle-G, Tide, and Lays. Other than that we have also collected images of Cart and Exit signs as it is essential for any person to use a cart and also notice the exit sign in case of emergency. For the other 2 models proposed the text and barcode information retrieved is converted from text to speech and relayed to the Blind person. The model was used to detect objects that were trained on and was successful in detecting and recognizing the desired output with a good accuracy and precision.
nVFNet-RDC: Replay and Non-Local Distillation Collaboration for Continual Object Detection
- Authors: Jinxiang Lai, Wenlong Liu, Jun Liu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.03603
- Pdf link: https://arxiv.org/pdf/2209.03603
- Abstract Continual Learning (CL) focuses on developing algorithms with the ability to adapt to new environments and learn new skills. This very challenging task has generated a lot of interest in recent years, with new solutions appearing rapidly. In this paper, we propose a nVFNet-RDC approach for continual object detection. Our nVFNet-RDC consists of teacher-student models, and adopts replay and feature distillation strategies. As the 1st place solutions, we achieve 55.94% and 54.65% average mAP on the 3rd CLVision Challenge Track 2 and Track 3, respectively.
Enabling Connectivity for Automated Mobility: A Novel MQTT-based Interface Evaluated in a 5G Case Study on Edge-Cloud Lidar Object Detection
- Authors: Lennart Reiher, Bastian Lampe, Timo Woopen, Raphael van Kempen, Till Beemelmanns, Lutz Eckstein
- Subjects: Robotics (cs.RO); Networking and Internet Architecture (cs.NI)
- Arxiv link: https://arxiv.org/abs/2209.03630
- Pdf link: https://arxiv.org/pdf/2209.03630
- Abstract Enabling secure and reliable high-bandwidth lowlatency connectivity between automated vehicles and external servers, intelligent infrastructure, and other road users is a central step in making fully automated driving possible. The availability of data interfaces, which allow this kind of connectivity, has the potential to distinguish artificial agents' capabilities in connected, cooperative, and automated mobility systems from the capabilities of human operators, who do not possess such interfaces. Connected agents can for example share data to build collective environment models, plan collective behavior, and learn collectively from the shared data that is centrally combined. This paper presents multiple solutions that allow connected entities to exchange data. In particular, we propose a new universal communication interface which uses the Message Queuing Telemetry Transport (MQTT) protocol to connect agents running the Robot Operating System (ROS). Our work integrates methods to assess the connection quality in the form of various key performance indicators in real-time. We compare a variety of approaches that provide the connectivity necessary for the exemplary use case of edge-cloud lidar object detection in a 5G network. We show that the mean latency between the availability of vehicle-based sensor measurements and the reception of a corresponding object list from the edge-cloud is below 87 ms. All implemented solutions are made open-source and free to use. Source code is available at https://github.com/ika-rwth-aachen/ros-v2x-benchmarking-suite.
Exploring Target Representations for Masked Autoencoders
- Authors: Xingbin Liu, Jinghao Zhou, Tao Kong, Xianming Lin, Rongrong Ji
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.03917
- Pdf link: https://arxiv.org/pdf/2209.03917
- Abstract Masked autoencoders have become popular training paradigms for self-supervised visual representation learning. These models randomly mask a portion of the input and reconstruct the masked portion according to the target representations. In this paper, we first show that a careful choice of the target representation is unnecessary for learning good representations, since different targets tend to derive similarly behaved models. Driven by this observation, we propose a multi-stage masked distillation pipeline and use a randomly initialized model as the teacher, enabling us to effectively train high-capacity models without any efforts to carefully design target representations. Interestingly, we further explore using teachers of larger capacity, obtaining distilled students with remarkable transferring ability. On different tasks of classification, transfer learning, object detection, and semantic segmentation, the proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins. We hope our findings, as well as the proposed method, could motivate people to rethink the roles of target representations in pre-training masked autoencoders.
Keyword: transformer
Securing the Spike: On the Transferabilty and Security of Spiking Neural Networks to Adversarial Examples
- Authors: Nuo Xu, Kaleel Mahmood, Haowen Fang, Ethan Rathbun, Caiwen Ding, Wujie Wen
- Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.03358
- Pdf link: https://arxiv.org/pdf/2209.03358
- Abstract Spiking neural networks (SNNs) have attracted much attention for their high energy efficiency and for recent advances in their classification performance. However, unlike traditional deep learning approaches, the analysis and study of the robustness of SNNs to adversarial examples remains relatively underdeveloped. In this work we advance the field of adversarial machine learning through experimentation and analyses of three important SNN security attributes. First, we show that successful white-box adversarial attacks on SNNs are highly dependent on the underlying surrogate gradient technique. Second, we analyze the transferability of adversarial examples generated by SNNs and other state-of-the-art architectures like Vision Transformers and Big Transfer CNNs. We demonstrate that SNNs are not often deceived by adversarial examples generated by Vision Transformers and certain types of CNNs. Lastly, we develop a novel white-box attack that generates adversarial examples capable of fooling both SNN models and non-SNN models simultaneously. Our experiments and analyses are broad and rigorous covering two datasets (CIFAR-10 and CIFAR-100), five different white-box attacks and twelve different classifier models.
AILAB-Udine@SMM4H 22: Limits of Transformers and BERT Ensembles
- Authors: Beatrice Portelli, Simone Scaboro, Emmanuele Chersoni, Enrico Santus, Giuseppe Serra
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.03452
- Pdf link: https://arxiv.org/pdf/2209.03452
- Abstract This paper describes the models developed by the AILAB-Udine team for the SMM4H 22 Shared Task. We explored the limits of Transformer based models on text classification, entity extraction and entity normalization, tackling Tasks 1, 2, 5, 6 and 10. The main take-aways we got from participating in different tasks are: the overwhelming positive effects of combining different architectures when using ensemble learning, and the great potential of generative models for term normalization.
CLaCLab at SocialDisNER: Using Medical Gazetteers for Named-Entity Recognition of Disease Mentions in Spanish Tweets
- Authors: Harsh Verma, Parsa Bagherzadeh, Sabine Bergler
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.03528
- Pdf link: https://arxiv.org/pdf/2209.03528
- Abstract This paper summarizes the CLaC submission for SMM4H 2022 Task 10 which concerns the recognition of diseases mentioned in Spanish tweets. Before classifying each token, we encode each token with a transformer encoder using features from Multilingual RoBERTa Large, UMLS gazetteer, and DISTEMIST gazetteer, among others. We obtain a strict F1 score of 0.869, with competition mean of 0.675, standard deviation of 0.245, and median of 0.761.
Video Vision Transformers for Violence Detection
- Authors: Sanskar Singh, Shivaibhav Dewangan, Ghanta Sai Krishna, Vandit Tyagi, Sainath Reddy
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2209.03561
- Pdf link: https://arxiv.org/pdf/2209.03561
- Abstract Law enforcement and city safety are significantly impacted by detecting violent incidents in surveillance systems. Although modern (smart) cameras are widely available and affordable, such technological solutions are impotent in most instances. Furthermore, personnel monitoring CCTV recordings frequently show a belated reaction, resulting in the potential cause of catastrophe to people and property. Thus automated detection of violence for swift actions is very crucial. The proposed solution uses a novel end-to-end deep learning-based video vision transformer (ViViT) that can proficiently discern fights, hostile movements, and violent events in video sequences. The study presents utilizing a data augmentation strategy to overcome the downside of weaker inductive biasness while training vision transformers on a smaller training datasets. The evaluated results can be subsequently sent to local concerned authority, and the captured video can be analyzed. In comparison to state-of-theart (SOTA) approaches the proposed method achieved auspicious performance on some of the challenging benchmark datasets.
Multi-Granularity Prediction for Scene Text Recognition
- Authors: Peng Wang, Cheng Da, Cong Yao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.03592
- Pdf link: https://arxiv.org/pdf/2209.03592
- Abstract Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93.35% on standard benchmarks. Code will be released soon.
Levenshtein OCR
- Authors: Cheng Da, Peng Wang, Cong Yao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.03594
- Pdf link: https://arxiv.org/pdf/2209.03594
- Abstract A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed method (named Levenshtein OCR, and LevOCR for short) explores an alternative way for automatically transcribing textual content from cropped natural images. Specifically, we cast the problem of scene text recognition as an iterative sequence refinement process. The initial prediction sequence produced by a pure vision model is encoded and fed into a cross-modal transformer to interact and fuse with the visual features, to progressively approximate the ground truth. The refinement process is accomplished via two basic character-level operations: deletion and insertion, which are learned with imitation learning and allow for parallel decoding, dynamic length change and good interpretability. The quantitative experiments clearly demonstrate that LevOCR achieves state-of-the-art performances on standard benchmarks and the qualitative analyses verify the effectiveness and advantage of the proposed LevOCR algorithm. Code will be released soon.
Towards explainable evaluation of language models on the semantic similarity of visual concepts
- Authors: Maria Lymperaiou, George Manoliadis, Orfeas Menis Mastromichalakis, Edmund G. Dervakos, Giorgos Stamou
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2209.03723
- Pdf link: https://arxiv.org/pdf/2209.03723
- Abstract Recent breakthroughs in NLP research, such as the advent of Transformer models have indisputably contributed to major advancements in several tasks. However, few works research robustness and explainability issues of their evaluation strategies. In this work, we examine the behavior of high-performing pre-trained language models, focusing on the task of semantic similarity for visual vocabularies. First, we address the need for explainable evaluation metrics, necessary for understanding the conceptual quality of retrieved instances. Our proposed metrics provide valuable insights in local and global level, showcasing the inabilities of widely used approaches. Secondly, adversarial interventions on salient query semantics expose vulnerabilities of opaque metrics and highlight patterns in learned linguistic representations.
Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers
- Authors: Kevin Miao, Akash Gokul, Raghav Singh, Suzanne Petryk, Joseph Gonzalez, Kurt Keutzer, Trevor Darrell, Colorado Reed
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.03745
- Pdf link: https://arxiv.org/pdf/2209.03745
- Abstract Recent trends in self-supervised representation learning have focused on removing inductive biases from training pipelines. However, inductive biases can be useful in settings when limited data are available or provide additional insight into the underlying data distribution. We present spatial prior attention (SPAN), a framework that takes advantage of consistent spatial and semantic structure in unlabeled image datasets to guide Vision Transformer attention. SPAN operates by regularizing attention masks from separate transformer heads to follow various priors over semantic regions. These priors can be derived from data statistics or a single labeled sample provided by a domain expert. We study SPAN through several detailed real-world scenarios, including medical image analysis and visual quality assurance. We find that the resulting attention masks are more interpretable than those derived from domain-agnostic pretraining. SPAN produces a 58.7 mAP improvement for lung and heart segmentation. We also find that our method yields a 2.2 mAUC improvement compared to domain-agnostic pretraining when transferring the pretrained model to a downstream chest disease classification task. Lastly, we show that SPAN pretraining leads to higher downstream classification performance in low-data regimes compared to domain-agnostic pretraining.
Automatic fetal fat quantification from MRI
- Authors: Netanell Avisdris, Aviad Rabinowich, Daniel Fridkin, Ayala Zilberman, Sapir Lazar, Jacky Herzlich, Zeev Hananis, Daphna Link-Sourani, Liat Ben-Sira, Liran Hiersch, Dafna Ben Bashat, Leo Joskowicz
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.03748
- Pdf link: https://arxiv.org/pdf/2209.03748
- Abstract Normal fetal adipose tissue (AT) development is essential for perinatal well-being. AT, or simply fat, stores energy in the form of lipids. Malnourishment may result in excessive or depleted adiposity. Although previous studies showed a correlation between the amount of AT and perinatal outcome, prenatal assessment of AT is limited by lacking quantitative methods. Using magnetic resonance imaging (MRI), 3D fat- and water-only images of the entire fetus can be obtained from two point Dixon images to enable AT lipid quantification. This paper is the first to present a methodology for developing a deep learning based method for fetal fat segmentation based on Dixon MRI. It optimizes radiologists' manual fetal fat delineation time to produce annotated training dataset. It consists of two steps: 1) model-based semi-automatic fetal fat segmentations, reviewed and corrected by a radiologist; 2) automatic fetal fat segmentation using DL networks trained on the resulting annotated dataset. Three DL networks were trained. We show a significant improvement in segmentation times (3:38 hours to < 1 hour) and observer variability (Dice of 0.738 to 0.906) compared to manual segmentation. Automatic segmentation of 24 test cases with the 3D Residual U-Net, nn-UNet and SWIN-UNetR transformer networks yields a mean Dice score of 0.863, 0.787 and 0.856, respectively. These results are better than the manual observer variability, and comparable to automatic adult and pediatric fat segmentation. A radiologist reviewed and corrected six new independent cases segmented using the best performing network, resulting in a Dice score of 0.961 and a significantly reduced correction time of 15:20 minutes. Using these novel segmentation methods and short MRI acquisition time, whole body subcutaneous lipids can be quantified for individual fetuses in the clinic and large-cohort research.
Applying Transformer-based Text Summarization for Keyphrase Generation
- Authors: Anna Glazkova, Dmitry Morozov
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.03791
- Pdf link: https://arxiv.org/pdf/2209.03791
- Abstract Keyphrases are crucial for searching and systematizing scholarly documents. Most current methods for keyphrase extraction are aimed at the extraction of the most significant words in the text. But in practice, the list of keyphrases often includes words that do not appear in the text explicitly. In this case, the list of keyphrases represents an abstractive summary of the source text. In this paper, we experiment with popular transformer-based models for abstractive text summarization using four benchmark datasets for keyphrase extraction. We compare the results obtained with the results of common unsupervised and supervised methods for keyphrase extraction. Our evaluation shows that summarization models are quite effective in generating keyphrases in the terms of the full-match F1-score and BERTScore. However, they produce a lot of words that are absent in the author's list of keyphrases, which makes summarization models ineffective in terms of ROUGE-1. We also investigate several ordering strategies to concatenate target keyphrases. The results showed that the choice of strategy affects the performance of keyphrase generation.
Pre-Training a Graph Recurrent Network for Language Representation
- Authors: Yile Wang, Linyi Yang, Zhiyang Teng, Ming Zhou, Yue Zhang
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.03834
- Pdf link: https://arxiv.org/pdf/2209.03834
- Abstract Transformer-based pre-trained models have gained much advance in recent years, becoming one of the most important backbones in natural language processing. Recent work shows that the attention mechanism inside Transformer may not be necessary, both convolutional neural networks and multi-layer perceptron based models have also been investigated as Transformer alternatives. In this paper, we consider a graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications, together with a sentence-level representation decoupled from other tokens. The original model performs well in domain-specific text classification under supervised training, however, its potential in learning transfer knowledge by self-supervised way has not been fully exploited. We fill this gap by optimizing the architecture and verifying its effectiveness in more general language understanding tasks, for both English and Chinese languages. As for model efficiency, instead of the quadratic complexity in Transformer-based models, our model has linear complexity and performs more efficiently during inference. Moreover, we find that our model can generate more diverse outputs with less contextualized feature redundancy than existing attention-based models.
Transformer based Fingerprint Feature Extraction
- Authors: Saraansh Tandon, Anoop Namboodiri
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.03846
- Pdf link: https://arxiv.org/pdf/2209.03846
- Abstract Fingerprint feature extraction is a task that is solved using either a global or a local representation. State-of-the-art global approaches use heavy deep learning models to process the full fingerprint image at once, which makes the corresponding approach memory intensive. On the other hand, local approaches involve minutiae based patch extraction, multiple feature extraction steps and an expensive matching stage, which make the corresponding approach time intensive. However, both these approaches provide useful and sometimes exclusive insights for solving the problem. Using both approaches together for extracting fingerprint representations is semantically useful but quite inefficient. Our convolutional transformer based approach with an in-built minutiae extractor provides a time and memory efficient solution to extract a global as well as a local representation of the fingerprint. The use of these representations along with a smart matching process gives us state-of-the-art performance across multiple databases. The project page can be found at https://saraansh1999.github.io/global-plus-local-fp-transformer.
Transformer-based classification of premise in tweets related to COVID-19
- Authors: Vadim Porvatov, Natalia Semenova
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.03851
- Pdf link: https://arxiv.org/pdf/2209.03851
- Abstract Automation of social network data assessment is one of the classic challenges of natural language processing. During the COVID-19 pandemic, mining people's stances from public messages have become crucial regarding understanding attitudes towards health orders. In this paper, the authors propose the predictive model based on transformer architecture to classify the presence of premise in Twitter texts. This work is completed as part of the Social Media Mining for Health (SMM4H) Workshop 2022. We explored modern transformer-based classifiers in order to construct the pipeline efficiently capturing tweets semantics. Our experiments on a Twitter dataset showed that RoBERTa is superior to the other transformer models in the case of the premise prediction task. The model achieved competitive performance with respect to ROC AUC value 0.807, and 0.7648 for the F1 score.
W-Transformers : A Wavelet-based Transformer Framework for Univariate Time Series Forecasting
- Authors: Lena Sasal, Tanujit Chakraborty, Abdenour Hadid
- Subjects: Machine Learning (cs.LG); Econometrics (econ.EM); Signal Processing (eess.SP); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2209.03945
- Pdf link: https://arxiv.org/pdf/2209.03945
- Abstract Deep learning utilizing transformers has recently achieved a lot of success in many vital areas such as natural language processing, computer vision, anomaly detection, and recommendation systems, among many others. Among several merits of transformers, the ability to capture long-range temporal dependencies and interactions is desirable for time series forecasting, leading to its progress in various time series applications. In this paper, we build a transformer model for non-stationary time series. The problem is challenging yet crucially important. We present a novel framework for univariate time series representation learning based on the wavelet-based transformer encoder architecture and call it W-Transformer. The proposed W-Transformers utilize a maximal overlap discrete wavelet transformation (MODWT) to the time series data and build local transformers on the decomposed datasets to vividly capture the nonstationarity and long-range nonlinear dependencies in the time series. Evaluating our framework on several publicly available benchmark time series datasets from various domains and with diverse characteristics, we demonstrate that it performs, on average, significantly better than the baseline forecasters for short-term and long-term forecasting, even for datasets that consist of only a few hundred training samples.
Keyword: scene understanding
There is no result
Keyword: visual reasoning
There is no result