arxiv-daily
arxiv-daily copied to clipboard
New submissions for Mon, 26 Sep 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
TeST: Test-time Self-Training under Distribution Shift
- Authors: Samarth Sinha, Peter Gehler, Francesco Locatello, Bernt Schiele
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.11459
- Pdf link: https://arxiv.org/pdf/2209.11459
- Abstract Despite their recent success, deep neural networks continue to perform poorly when they encounter distribution shifts at test time. Many recently proposed approaches try to counter this by aligning the model to the new distribution prior to inference. With no labels available this requires unsupervised objectives to adapt the model on the observed test data. In this paper, we propose Test-Time Self-Training (TeST): a technique that takes as input a model trained on some source data and a novel data distribution at test time, and learns invariant and robust representations using a student-teacher framework. We find that models adapted using TeST significantly improve over baseline test-time adaptation algorithms. TeST achieves competitive performance to modern domain adaptation algorithms, while having access to 5-10x less data at time of adaption. We thoroughly evaluate a variety of baselines on two tasks: object detection and image segmentation and find that models adapted with TeST. We find that TeST sets the new state-of-the art for test-time domain adaptation algorithms.
UAV-miniUGV Hybrid System for Hidden Area Exploration and Manipulation
- Authors: Durgakant Pushp, Swapnil Kalhapure, Kaushik Das, Lantao Liu
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2209.11704
- Pdf link: https://arxiv.org/pdf/2209.11704
- Abstract We propose a novel hybrid system (both hardware and software) of an Unmanned Aerial Vehicle (UAV) carrying a miniature Unmanned Ground Vehicle (miniUGV) to perform a complex search and manipulation task. This system leverages heterogeneous robots to accomplish a task that cannot be done using a single robot system. It enables the UAV to explore a hidden space with a narrow opening through which the miniUGV can easily enter and escape. The hidden space is assumed to be navigable for the miniUGV. The miniUGV uses Infrared (IR) sensors and a monocular camera to search for an object in the hidden space. The proposed system takes advantage of a wider field of view (fov) of the camera as well as the stochastic nature of the object detection algorithms to guide the miniUGV in the hidden space to find the object. Upon finding the object the miniUGV grabs it using visual servoing and then returns back to its start point from where the UAV retracts it back and transports the object to a safe place. In case there is no object found in the hidden space, UAV continues the aerial search. The tethered miniUGV gives the UAV an ability to act beyond its reach and perform a search and manipulation task which was not possible before for any of the robots individually. The system has a wide range of applications and we have demonstrated its feasibility through repetitive experiments.
Keyword: transformer
XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages
- Authors: Shivprasad Sagare, Tushar Abhishek, Bhavyajeet Singh, Anubhav Sharma, Manish Gupta, Vasudeva Varma
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.11252
- Pdf link: https://arxiv.org/pdf/2209.11252
- Abstract Multiple business scenarios require an automated generation of descriptive human-readable text from structured input data. Hence, fact-to-text generation systems have been developed for various downstream tasks like generating soccer reports, weather and financial reports, medical reports, person biographies, etc. Unfortunately, previous work on fact-to-text (F2T) generation has focused primarily on English mainly due to the high availability of relevant datasets. Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed for generation across multiple languages alongwith a dataset, XALIGN for eight languages. However, there has been no rigorous work on the actual XF2T generation problem. We extend XALIGN dataset with annotated data for four more languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset, which we call XALIGNV2. Further, we investigate the performance of different text generation strategies: multiple variations of pretraining, fact-aware embeddings and structure-aware input encoding. Our extensive experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results on average across the twelve languages. We make our code, dataset and model publicly available, and hope that this will help advance further research in this critical area.
3DPCT: 3D Point Cloud Transformer with Dual Self-attention
- Authors: Dening Lu, Kyle Gao, Qian Xie, Linlin Xu, Jonathan Li
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.11255
- Pdf link: https://arxiv.org/pdf/2209.11255
- Abstract Transformers have resulted in remarkable achievements in the field of image processing. Inspired by this great success, the application of Transformers to 3D point cloud processing has drawn more and more attention. This paper presents a novel point cloud representational learning network, 3D Point Cloud Transformer with Dual Self-attention (3DPCT) and an encoder-decoder structure. Specifically, 3DPCT has a hierarchical encoder, which contains two local-global dual-attention modules for the classification task (three modules for the segmentation task), with each module consisting of a Local Feature Aggregation (LFA) block and a Global Feature Learning (GFL) block. The GFL block is dual self-attention, with both point-wise and channel-wise self-attention to improve feature extraction. Moreover, in LFA, to better leverage the local information extracted, a novel point-wise self-attention model, named as Point-Patch Self-Attention (PPSA), is designed. The performance is evaluated on both classification and segmentation datasets, containing both synthetic and real-world data. Extensive experiments demonstrate that the proposed method achieved state-of-the-art results on both classification and segmentation tasks.
Colonoscopy Landmark Detection using Vision Transformers
- Authors: Aniruddha Tamhane, Tse'ela Mida, Erez Posner, Moshe Bouhnik
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.11304
- Pdf link: https://arxiv.org/pdf/2209.11304
- Abstract Colonoscopy is a routine outpatient procedure used to examine the colon and rectum for any abnormalities including polyps, diverticula and narrowing of colon structures. A significant amount of the clinician's time is spent in post-processing snapshots taken during the colonoscopy procedure, for maintaining medical records or further investigation. Automating this step can save time and improve the efficiency of the process. In our work, we have collected a dataset of 120 colonoscopy videos and 2416 snapshots taken during the procedure, that have been annotated by experts. Further, we have developed a novel, vision-transformer based landmark detection algorithm that identifies key anatomical landmarks (the appendiceal orifice, ileocecal valve/cecum landmark and rectum retroflexion) from snapshots taken during colonoscopy. Our algorithm uses an adaptive gamma correction during preprocessing to maintain a consistent brightness for all images. We then use a vision transformer as the feature extraction backbone and a fully connected network based classifier head to categorize a given frame into four classes: the three landmarks or a non-landmark frame. We compare the vision transformer (ViT-B/16) backbone with ResNet-101 and ConvNext-B backbones that have been trained similarly. We report an accuracy of 82% with the vision transformer backbone on a test dataset of snapshots.
Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration
- Authors: Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2209.11345
- Pdf link: https://arxiv.org/pdf/2209.11345
- Abstract Compression plays an important role on the efficient transmission and storage of images and videos through band-limited systems such as streaming services, virtual reality or videogames. However, compression unavoidably leads to artifacts and the loss of the original information, which may severely degrade the visual quality. For these reasons, quality enhancement of compressed images has become a popular research topic. While most state-of-the-art image restoration methods are based on convolutional neural networks, other transformers-based methods such as SwinIR, show impressive performance on these tasks. In this paper, we explore the novel Swin Transformer V2, to improve SwinIR for image super-resolution, and in particular, the compressed input scenario. Using this method we can tackle the major issues in training transformer vision models, such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data. We conduct experiments on three representative tasks: JPEG compression artifacts removal, image super-resolution (classical and lightweight), and compressed image super-resolution. Experimental results demonstrate that our method, Swin2SR, can improve the training convergence and performance of SwinIR, and is a top-5 solution at the "AIM 2022 Challenge on Super-Resolution of Compressed Image and Video".
NasHD: Efficient ViT Architecture Performance Ranking using Hyperdimensional Computing
- Authors: Dongning Ma, Pengfei Zhao, Xun Jiao
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
- Arxiv link: https://arxiv.org/abs/2209.11356
- Pdf link: https://arxiv.org/pdf/2209.11356
- Abstract Neural Architecture Search (NAS) is an automated architecture engineering method for deep learning design automation, which serves as an alternative to the manual and error-prone process of model development, selection, evaluation and performance estimation. However, one major obstacle of NAS is the extremely demanding computation resource requirements and time-consuming iterations particularly when the dataset scales. In this paper, targeting at the emerging vision transformer (ViT), we present NasHD, a hyperdimensional computing based supervised learning model to rank the performance given the architectures and configurations. Different from other learning based methods, NasHD is faster thanks to the high parallel processing of HDC architecture. We also evaluated two HDC encoding schemes: Gram-based and Record-based of NasHD on their performance and efficiency. On the VIMER-UFO benchmark dataset of 8 applications from a diverse range of domains, NasHD Record can rank the performance of nearly 100K vision transformer models with about 1 minute while still achieving comparable results with sophisticated models.
A Robust and Explainable Data-Driven Anomaly Detection Approach For Power Electronics
- Authors: Alexander Beattie, Pavol Mulinka, Subham Sahoo, Ioannis T. Christou, Charalampos Kalalas, Daniel Gutierrez-Rojas, Pedro H. J. Nardelli
- Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.11427
- Pdf link: https://arxiv.org/pdf/2209.11427
- Abstract Timely and accurate detection of anomalies in power electronics is becoming increasingly critical for maintaining complex production systems. Robust and explainable strategies help decrease system downtime and preempt or mitigate infrastructure cyberattacks. This work begins by explaining the types of uncertainty present in current datasets and machine learning algorithm outputs. Three techniques for combating these uncertainties are then introduced and analyzed. We further present two anomaly detection and classification approaches, namely the Matrix Profile algorithm and anomaly transformer, which are applied in the context of a power electronic converter dataset. Specifically, the Matrix Profile algorithm is shown to be well suited as a generalizable approach for detecting real-time anomalies in streaming time-series data. The STUMPY python library implementation of the iterative Matrix Profile is used for the creation of the detector. A series of custom filters is created and added to the detector to tune its sensitivity, recall, and detection accuracy. Our numerical results show that, with simple parameter tuning, the detector provides high accuracy and performance in a variety of fault scenarios.
Lightweight Transformers for Human Activity Recognition on Mobile Devices
- Authors: Sannara EK, François Portet, Philippe Lalanda
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.11750
- Pdf link: https://arxiv.org/pdf/2209.11750
- Abstract Human Activity Recognition (HAR) on mobile devices has shown to be achievable with lightweight neural models learned from data generated by the user's inertial measurement units (IMUs). Most approaches for instanced-based HAR have used Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTMs), or a combination of the two to achieve state-of-the-art results with real-time performances. Recently, the Transformers architecture in the language processing domain and then in the vision domain has pushed further the state-of-the-art over classical architectures. However, such Transformers architecture is heavyweight in computing resources, which is not well suited for embedded applications of HAR that can be found in the pervasive computing domain. In this study, we present Human Activity Recognition Transformer (HART), a lightweight, sensor-wise transformer architecture that has been specifically adapted to the domain of the IMUs embedded on mobile devices. Our experiments on HAR tasks with several publicly available datasets show that HART uses fewer FLoating-point Operations Per Second (FLOPS) and parameters while outperforming current state-of-the-art results. Furthermore, we present evaluations across various architectures on their performances in heterogeneous environments and show that our models can better generalize on different sensing devices or on-body positions.
Keyword: scene understanding
There is no result
Keyword: visual reasoning
There is no result