arxiv-daily
arxiv-daily copied to clipboard
New submissions for Tue, 15 Nov 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
DEYO: DETR with YOLO for Step-by-Step Object Detection
- Authors: Haodong Ouyang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.06588
- Pdf link: https://arxiv.org/pdf/2211.06588
- Abstract Object detection is an important topic in computer vision, with post-processing, an essential part of the typical object detection pipeline, posing a significant bottleneck affecting the performance of traditional object detection models. The detection transformer (DETR), as the first end-to-end target detection model, discards the requirement of manual components like the anchor and non-maximum suppression (NMS), significantly simplifying the target detection process. However, compared with most traditional object detection models, DETR converges very slowly, and a query's meaning is obscure. Thus, inspired by the Step-by-Step concept, this paper proposes a new two-stage object detection model, named DETR with YOLO (DEYO), which relies on a progressive inference to solve the above problems. DEYO is a two-stage architecture comprising a classic target detection model and a DETR-like model as the first and second stages, respectively. Specifically, the first stage provides high-quality query and anchor feeding into the second stage, improving the performance and efficiency of the second stage compared to the original DETR model. Meanwhile, the second stage compensates for the performance degradation caused by the first stage detector's limitations. Extensive experiments demonstrate that DEYO attains 50.6 AP and 52.1 AP in 12 and 36 epochs, respectively, while utilizing ResNet-50 as the backbone and multi-scale features on the COCO dataset. Compared with DINO, an optimal DETR-like model, the developed DEYO model affords a significant performance improvement of 1.6 AP and 1.2 AP in two epoch settings.
Multistep feature aggregation framework for salient object detection
- Authors: Xiaogang Liu Shuang Song
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.06697
- Pdf link: https://arxiv.org/pdf/2211.06697
- Abstract Recent works on salient object detection have made use of multi-scale features in a way such that high-level features and low-level features can collaborate in locating salient objects. Many of the previous methods have achieved great performance in salient object detection. By merging the high-level and low-level features, a large number of feature information can be extracted. Generally, they are doing these in a one-way framework, and interweaving the variable features all the way to the final feature output. Which may cause some blurring or inaccurate localization of saliency maps. To overcome these difficulties, we introduce a multistep feature aggregation (MSFA) framework for salient object detection, which is composed of three modules, including the Diverse Reception (DR) module, multiscale interaction (MSI) module and Feature Enhancement (FE) module to accomplish better multi-level feature fusion. Experimental results on six benchmark datasets demonstrate that MSFA achieves state-of-the-art performance.
Boosting Semi-Supervised 3D Object Detection with Semi-Sampling
- Authors: Xiaopei Wu, Yang Zhao, Liang Peng, Hua Chen, Xiaoshui Huang, Binbin Lin, Haifeng Liu, Deng Cai, Wanli Ouyang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.07084
- Pdf link: https://arxiv.org/pdf/2211.07084
- Abstract Current 3D object detection methods heavily rely on an enormous amount of annotations. Semi-supervised learning can be used to alleviate this issue. Previous semi-supervised 3D object detection methods directly follow the practice of fully-supervised methods to augment labeled and unlabeled data, which is sub-optimal. In this paper, we design a data augmentation method for semi-supervised learning, which we call Semi-Sampling. Specifically, we use ground truth labels and pseudo labels to crop gt samples and pseudo samples on labeled frames and unlabeled frames, respectively. Then we can generate a gt sample database and a pseudo sample database. When training a teacher-student semi-supervised framework, we randomly select gt samples and pseudo samples to both labeled frames and unlabeled frames, making a strong data augmentation for them. Our semi-sampling can be regarded as an extension of gt-sampling to semi-supervised learning. Our method is simple but effective. We consistently improve state-of-the-art methods on ScanNet, SUN-RGBD, and KITTI benchmarks by large margins. For example, when training using only 10% labeled data on ScanNet, we achieve 3.1 mAP and 6.4 mAP improvement upon 3DIoUMatch in terms of [email protected] and [email protected]. When training using only 1% labeled data on KITTI, we boost 3DIoUMatch by 3.5 mAP, 6.7 mAP and 14.1 mAP on car, pedestrian and cyclist classes. Codes will be made publicly available at https://github.com/LittlePey/Semi-Sampling.
Recursive Cross-View: Use Only 2D Detectors to Achieve 3D Object Detection without 3D Annotations
- Authors: Shun Gui, Yan Luximon
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.07108
- Pdf link: https://arxiv.org/pdf/2211.07108
- Abstract Heavily relying on 3D annotations limits the real-world application of 3D object detection. In this paper, we propose a method that does not demand any 3D annotation, while being able to predict full-oriented 3D bounding boxes. Our method, called Recursive Cross-View (RCV), transforms 3D detection into several 2D detection tasks, which only consume some 2D labels, based on the three-view principle. We propose a recursive paradigm, in which instance segmentation and 3D bounding box generation by Cross-View are implemented recursively until convergence. Specifically, a frustum is proposed via a 2D detector, followed by the recursive paradigm that finally outputs a full-oriented 3D box, class, and score. To justify that our method can be quickly used to new tasks in real-world scenarios, we do three experiments, namely indoor 3D human detection, full-oriented 3D hand detection, and real-time detection on a real 3D sensor. RCV achieves decent performance in these experiments. Once trained, our method can be viewed as a 3D annotation tool. Consequently, we formulate two 3D labeled dataset, namely '3D_HUMAN' and 'D_HAND', based on RCV, which could be used to pre-train other 3D detectors. Furthermore, estimated on the SUN RGB-D benchmark, our method achieves comparable performance with some full 3D supervised learning methods. RCV is the first 3D detection method that does not consume 3D labels and yields full-oriented 3D boxes on point clouds.
Robust Collaborative 3D Object Detection in Presence of Pose Errors
- Authors: Yifan Lu, Quanhao Li, Baoan Liu, Mehrdad Dianat, Chen Feng, Siheng Chen, Yanfeng Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2211.07214
- Pdf link: https://arxiv.org/pdf/2211.07214
- Abstract Collaborative 3D object detection exploits information exchange among multiple agents to enhance accuracy of object detection in presence of sensor impairments such as occlusion. However, in practice, pose estimation errors due to imperfect localization would cause spatial message misalignment and significantly reduce the performance of collaboration. To alleviate adverse impacts of pose errors, we propose CoAlign, a novel hybrid collaboration framework that is robust to unknown pose errors. The proposed solution relies on a novel agent-object pose graph modeling to enhance pose consistency among collaborating agents. Furthermore, we adopt a multi-scale data fusion strategy to aggregate intermediate features at multiple spatial resolutions. Comparing with previous works, which require ground-truth pose for training supervision, our proposed CoAlign is more practical since it doesn't require any ground-truth pose supervision in the training and makes no specific assumptions on pose errors. Extensive evaluation of the proposed method is carried out on multiple datasets, certifying that CoAlign significantly reduce relative localization error and achieving the state of art detection performance when pose errors exist. Code are made available for the use of the research community at https://github.com/yifanlu0227/CoAlign.
The Role of Local Alignment and Uniformity in Image-Text Contrastive Learning on Medical Images
- Authors: Philip Müller, Georgios Kaissis, Daniel Rueckert
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2211.07254
- Pdf link: https://arxiv.org/pdf/2211.07254
- Abstract Image-text contrastive learning has proven effective for pretraining medical image models. When targeting localized downstream tasks like semantic segmentation or object detection, additional local contrastive losses that align image regions with sentences have shown promising results. We study how local contrastive losses are related to global (per-sample) contrastive losses and which effects they have on localized medical downstream tasks. Based on a theoretical comparison, we propose to remove some components of local losses and replace others by a novel distribution prior which enforces uniformity of representations within each sample. We empirically study this approach on chest X-ray tasks and find it to be very effective, outperforming methods without local losses on 12 of 18 tasks.
Butterfly Effect Attack: Tiny and Seemingly Unrelated Perturbations for Object Detection
- Authors: Nguyen Anh Vu Doan, Arda Yüksel, Chih-Hong Cheng
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2211.07483
- Pdf link: https://arxiv.org/pdf/2211.07483
- Abstract This work aims to explore and identify tiny and seemingly unrelated perturbations of images in object detection that will lead to performance degradation. While tininess can naturally be defined using $L_p$ norms, we characterize the degree of "unrelatedness" of an object by the pixel distance between the occurred perturbation and the object. Triggering errors in prediction while satisfying two objectives can be formulated as a multi-objective optimization problem where we utilize genetic algorithms to guide the search. The result successfully demonstrates that (invisible) perturbations on the right part of the image can drastically change the outcome of object detection on the left. An extensive evaluation reaffirms our conjecture that transformer-based object detection networks are more susceptible to butterfly effects in comparison to single-stage object detection networks such as YOLOv5.
Discovering a Variety of Objects in Spatio-Temporal Human-Object Interactions
- Authors: Yong-Lu Li, Hongwei Fan, Zuoyu Qiu, Yiming Dou, Liang Xu, Hao-Shu Fang, Peiyang Guo, Haisheng Su, Dongliang Wang, Wei Wu, Cewu Lu
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2211.07501
- Pdf link: https://arxiv.org/pdf/2211.07501
- Abstract Spatio-temporal Human-Object Interaction (ST-HOI) detection aims at detecting HOIs from videos, which is crucial for activity understanding. In daily HOIs, humans often interact with a variety of objects, e.g., holding and touching dozens of household items in cleaning. However, existing whole body-object interaction video benchmarks usually provide limited object classes. Here, we introduce a new benchmark based on AVA: Discovering Interacted Objects (DIO) including 51 interactions and 1,000+ objects. Accordingly, an ST-HOI learning task is proposed expecting vision systems to track human actors, detect interactions and simultaneously discover interacted objects. Even though today's detectors/trackers excel in object detection/tracking tasks, they perform unsatisfied to localize diverse/unseen objects in DIO. This profoundly reveals the limitation of current vision systems and poses a great challenge. Thus, how to leverage spatio-temporal cues to address object discovery is explored, and a Hierarchical Probe Network (HPN) is devised to discover interacted objects utilizing hierarchical spatio-temporal human/context cues. In extensive experiments, HPN demonstrates impressive performance. Data and code are available at https://github.com/DirtyHarryLYL/HAKE-AVA.
PKCAM: Previous Knowledge Channel Attention Module
- Authors: Eslam Mohamed Bakar, Ahmad El Sallab, Mohsen A. Rashwan
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.07521
- Pdf link: https://arxiv.org/pdf/2211.07521
- Abstract Recently, attention mechanisms have been explored with ConvNets, both across the spatial and channel dimensions. However, from our knowledge, all the existing methods devote the attention modules to capture local interactions from a uni-scale. In this paper, we propose a Previous Knowledge Channel Attention Module(PKCAM), that captures channel-wise relations across different layers to model the global context. Our proposed module PKCAM is easily integrated into any feed-forward CNN architectures and trained in an end-to-end fashion with a negligible footprint due to its lightweight property. We validate our novel architecture through extensive experiments on image classification and object detection tasks with different backbones. Our experiments show consistent improvements in performances against their counterparts. Our code is published at https://github.com/eslambakr/EMCA.
Marine Microalgae Detection in Microscopy Images: A New Dataset
- Authors: Shizheng Zhou, Juntao Jiang, Xiaohan Hong, Yajun Fang, Yan Hong, Pengcheng Fu
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
- Arxiv link: https://arxiv.org/abs/2211.07546
- Pdf link: https://arxiv.org/pdf/2211.07546
- Abstract Marine microalgae are widespread in the ocean and play a crucial role in the ecosystem. Automatic identification and location of marine microalgae in microscopy images would help establish marine ecological environment monitoring and water quality evaluation system. A new dataset for marine microalgae detection is proposed in this paper. Six classes of microalgae commonlyfound in the ocean (Bacillariophyta, Chlorella pyrenoidosa, Platymonas, Dunaliella salina, Chrysophyta, Symbiodiniaceae) are microscopically imaged in real-time. Images of Symbiodiniaceae in three physiological states known as normal, bleaching, and translating are also included. We annotated these images with bounding boxes using Labelme software and split them into the training and testing sets. The total number of images in the dataset is 937 and all the objects in these images were annotated. The total number of annotated objects is 4201. The training set contains 537 images and the testing set contains 430 images. Baselines of different object detection algorithms are trained, validated and tested on this dataset. This data set can be got accessed via tianchi.aliyun.com/competition/entrance/532036/information.
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
- Authors: Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2211.07636
- Pdf link: https://arxiv.org/pdf/2211.07636
- Abstract We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we will release all the code and models at \url{https://github.com/baaivision/EVA}.
Keyword: transformer
Grid-forming control of three-phase and single-phase converters across unbalanced transmission and distribution systems
- Authors: Shahin S. Nudehi, Dominic Groß
- Subjects: Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2211.06464
- Pdf link: https://arxiv.org/pdf/2211.06464
- Abstract In this work, we investigate grid-forming control for power systems containing three-phase and single-phase converters connected to unbalanced distribution and transmission networks, investigate self-balancing between single-phase converters, and propose a novel balancing feedback for grid-forming control that explicitly allows to trade-off unbalances in voltage and power. We develop a quasi-steady-state power network model that allows to analyze the interactions between three-phase and single-phase power converters across transmission, distribution, and standard transformer interconnections. We first investigate conditions under which this general network admits a well-posed kron-reduced quasi-steady-state network model. Our main contribution leverages this reduced-order model to develop analytical conditions for stability of the overall network with grid-forming three-phase and single-phase converters connected through standard transformer interconnections. Specifically, we provide conditions on the network topology under which (i) single-phase converters autonomously self-synchronize to a phase-balanced operating point and (ii) single-phase converters phase-balance through synchronization with three-phase converters. Moreover, we establish that the conditions can be relaxed if a phase-balancing feedback control is used. Finally, case studies combining detailed models of transmission systems (i.e., IEEE 9-bus) and distribution systems (i.e., IEEE 13-bus) are used to illustrate the results for (i) a power system containing a mix of transmission and distribution connected converters and, (ii) a power system solely using distribution-connected converters at the grid edge.
End-to-End Machine Learning Framework for Facial AU Detection in Intensive Care Units
- Authors: Subhash Nerella, Kia Khezeli, Andrea Davidson, Patrick Tighe, Azra Bihorac, Parisa Rashidi
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2211.06570
- Pdf link: https://arxiv.org/pdf/2211.06570
- Abstract Pain is a common occurrence among patients admitted to Intensive Care Units. Pain assessment in ICU patients still remains a challenge for clinicians and ICU staff, specifically in cases of non-verbal sedated, mechanically ventilated, and intubated patients. Current manual observation-based pain assessment tools are limited by the frequency of pain observations administered and are subjective to the observer. Facial behavior is a major component in observation-based tools. Furthermore, previous literature shows the feasibility of painful facial expression detection using facial action units (AUs). However, these approaches are limited to controlled or semi-controlled environments and have never been validated in clinical settings. In this study, we present our Pain-ICU dataset, the largest dataset available targeting facial behavior analysis in the dynamic ICU environment. Our dataset comprises 76,388 patient facial image frames annotated with AUs obtained from 49 adult patients admitted to ICUs at the University of Florida Health Shands hospital. In this work, we evaluated two vision transformer models, namely ViT and SWIN, for AU detection on our Pain-ICU dataset and also external datasets. We developed a completely end-to-end AU detection pipeline with the objective of performing real-time AU detection in the ICU. The SWIN transformer Base variant achieved 0.88 F1-score and 0.85 accuracy on the held-out test partition of the Pain-ICU dataset.
DEYO: DETR with YOLO for Step-by-Step Object Detection
- Authors: Haodong Ouyang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.06588
- Pdf link: https://arxiv.org/pdf/2211.06588
- Abstract Object detection is an important topic in computer vision, with post-processing, an essential part of the typical object detection pipeline, posing a significant bottleneck affecting the performance of traditional object detection models. The detection transformer (DETR), as the first end-to-end target detection model, discards the requirement of manual components like the anchor and non-maximum suppression (NMS), significantly simplifying the target detection process. However, compared with most traditional object detection models, DETR converges very slowly, and a query's meaning is obscure. Thus, inspired by the Step-by-Step concept, this paper proposes a new two-stage object detection model, named DETR with YOLO (DEYO), which relies on a progressive inference to solve the above problems. DEYO is a two-stage architecture comprising a classic target detection model and a DETR-like model as the first and second stages, respectively. Specifically, the first stage provides high-quality query and anchor feeding into the second stage, improving the performance and efficiency of the second stage compared to the original DETR model. Meanwhile, the second stage compensates for the performance degradation caused by the first stage detector's limitations. Extensive experiments demonstrate that DEYO attains 50.6 AP and 52.1 AP in 12 and 36 epochs, respectively, while utilizing ResNet-50 as the backbone and multi-scale features on the COCO dataset. Compared with DINO, an optimal DETR-like model, the developed DEYO model affords a significant performance improvement of 1.6 AP and 1.2 AP in two epoch settings.
AU-Aware Vision Transformers for Biased Facial Expression Recognition
- Authors: Shuyi Mao, Xinpeng Li, Qingyang Wu, Xiaojiang Peng
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.06609
- Pdf link: https://arxiv.org/pdf/2211.06609
- Abstract Studies have proven that domain bias and label bias exist in different Facial Expression Recognition (FER) datasets, making it hard to improve the performance of a specific dataset by adding other datasets. For the FER bias issue, recent researches mainly focus on the cross-domain issue with advanced domain adaption algorithms. This paper addresses another problem: how to boost FER performance by leveraging cross-domain datasets. Unlike the coarse and biased expression label, the facial Action Unit (AU) is fine-grained and objective suggested by psychological studies. Motivated by this, we resort to the AU information of different FER datasets for performance boosting and make contributions as follows. First, we experimentally show that the naive joint training of multiple FER datasets is harmful to the FER performance of individual datasets. We further introduce expression-specific mean images and AU cosine distances to measure FER dataset bias. This novel measurement shows consistent conclusions with experimental degradation of joint training. Second, we propose a simple yet conceptually-new framework, AU-aware Vision Transformer (AU-ViT). It improves the performance of individual datasets by jointly training auxiliary datasets with AU or pseudo-AU labels. We also find that the AU-ViT is robust to real-world occlusions. Moreover, for the first time, we prove that a carefully-initialized ViT achieves comparable performance to advanced deep convolutional networks. Our AU-ViT achieves state-of-the-art performance on three popular datasets, namely 91.10% on RAF-DB, 65.59% on AffectNet, and 90.15% on FERPlus. The code and models will be released soon.
Kinematics Transformer: Solving The Inverse Modeling Problem of Soft Robots using Transformers
- Authors: Abdelrahman Alkhodary, Berke Gur
- Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2211.06643
- Pdf link: https://arxiv.org/pdf/2211.06643
- Abstract Soft robotic manipulators provide numerous advantages over conventional rigid manipulators in fragile environments such as the marine environment. However, developing analytic inverse models necessary for shape, motion, and force control of such robots remains a challenging problem. As an alternative to analytic models, numerical models can be learned using powerful machine learned methods. In this paper, the Kinematics Transformer is proposed for developing accurate and precise inverse kinematic models of soft robotic limbs. The proposed method re-casts the inverse kinematics problem as a sequential prediction problem and is based on the transformer architecture. Numerical simulations reveal that the proposed method can effectively be used in controlling a soft limb. Benchmark studies also reveal that the proposed method has better accuracy and precision compared to the baseline feed-forward neural network
Far Away in the Deep Space: Nearest-Neighbor-Based Dense Out-of-Distribution Detection
- Authors: Silvio Galesso, Max Argus, Thomas Brox
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.06660
- Pdf link: https://arxiv.org/pdf/2211.06660
- Abstract The key to out-of-distribution detection is density estimation of the in-distribution data or of its feature representations. While good parametric solutions to this problem exist for well curated classification data, these are less suitable for complex domains, such as semantic segmentation. In this paper, we show that a k-Nearest-Neighbors approach can achieve surprisingly good results with small reference datasets and runtimes, and be robust with respect to hyperparameters, such as the number of neighbors and the choice of the support set size. Moreover, we show that it combines well with anomaly scores from standard parametric approaches, and we find that transformer features are particularly well suited to detect novel objects in combination with k-Nearest-Neighbors. Ultimately, the approach is simple and non-invasive, i.e., it does not affect the primary segmentation performance, avoids training on examples of anomalies, and achieves state-of-the-art results on the common benchmarks with +23% and +16% AP improvements on on RoadAnomaly and StreetHazards respectively.
MultiCrossViT: Multimodal Vision Transformer for Schizophrenia Prediction using Structural MRI and Functional Network Connectivity Data
- Authors: Yuda Bi, Anees Abrol, Zening Fu, Vince Calhoun
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.06726
- Pdf link: https://arxiv.org/pdf/2211.06726
- Abstract Vision Transformer (ViT) is a pioneering deep learning framework that can address real-world computer vision issues, such as image classification and object recognition. Importantly, ViTs are proven to outperform traditional deep learning models, such as convolutional neural networks (CNNs). Relatively recently, a number of ViT mutations have been transplanted into the field of medical imaging, thereby resolving a variety of critical classification and segmentation challenges, especially in terms of brain imaging data. In this work, we provide a novel multimodal deep learning pipeline, MultiCrossViT, which is capable of analyzing both structural MRI (sMRI) and static functional network connectivity (sFNC) data for the prediction of schizophrenia disease. On a dataset with minimal training subjects, our novel model can achieve an AUC of 0.832. Finally, we visualize multiple brain regions and covariance patterns most relevant to schizophrenia based on the resulting ViT attention maps by extracting features from transformer encoders.
Integrating Transformer and Autoencoder Techniques with Spectral Graph Algorithms for the Prediction of Scarcely Labeled Molecular Data
- Authors: Nicole Hayes, Ekaterina Merkurjev, Guo-Wei Wei
- Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
- Arxiv link: https://arxiv.org/abs/2211.06759
- Pdf link: https://arxiv.org/pdf/2211.06759
- Abstract In molecular and biological sciences, experiments are expensive, time-consuming, and often subject to ethical constraints. Consequently, one often faces the challenging task of predicting desirable properties from small data sets or scarcely-labeled data sets. Although transfer learning can be advantageous, it requires the existence of a related large data set. This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge. Specifically, graph-based modifications of the MBO scheme is integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder, in order to deal with scarcely-labeled data sets. In addition, a consensus technique is detailed. The proposed models are validated using five benchmark data sets. We also provide a thorough comparison to other competing methods, such as support vector machines, random forests, and gradient boosted decision trees, which are known for their good performance on small data sets. The performances of various methods are analyzed using residue-similarity (R-S) scores and R-S indices. Extensive computational experiments and theoretical analysis show that the new models perform very well even when as little as 1% of the data set is used as labeled data.
Adversarial and Random Transformations for Robust Domain Adaptation and Generalization
- Authors: Liang Xiao, Jiaolong Xu, Dawei Zhao, Erke Shang, Qi Zhu, Bin Dai
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2211.06788
- Pdf link: https://arxiv.org/pdf/2211.06788
- Abstract Data augmentation has been widely used to improve generalization in training deep neural networks. Recent works show that using worst-case transformations or adversarial augmentation strategies can significantly improve the accuracy and robustness. However, due to the non-differentiable properties of image transformations, searching algorithms such as reinforcement learning or evolution strategy have to be applied, which are not computationally practical for large scale problems. In this work, we show that by simply applying consistency training with random data augmentation, state-of-the-art results on domain adaptation (DA) and generalization (DG) can be obtained. To further improve the accuracy and robustness with adversarial examples, we propose a differentiable adversarial data augmentation method based on spatial transformer networks (STN). The combined adversarial and random transformations based method outperforms the state-of-the-art on multiple DA and DG benchmark datasets. Besides, the proposed method shows desirable robustness to corruption, which is also validated on commonly used datasets.
Enhancing Few-shot Image Classification with Cosine Transformer
- Authors: Quang-Huy Nguyen, Cuong Q. Nguyen, Dung D. Le, Hieu H. Pham, Minh N. Do
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.06828
- Pdf link: https://arxiv.org/pdf/2211.06828
- Abstract This paper addresses the few-shot image classification problem. One notable limitation of few-shot learning is the variation in describing the same category, which might result in a significant difference between small labeled support and large unlabeled query sets. Our approach is to obtain a relation heatmap between the two sets in order to label the latter one in a transductive setting manner. This can be solved by using cross-attention with the scaled dot-product mechanism. However, the magnitude differences between two separate sets of embedding vectors may cause a significant impact on the output attention map and affect model performance. We tackle this problem by improving the attention mechanism with cosine similarity. Specifically, we develop FS-CT (Few-shot Cosine Transformer), a few-shot image classification method based on prototypical embedding and transformer-based framework. The proposed Cosine attention improves FS-CT performances significantly from nearly 5% to over 20% in accuracy compared to the baseline scaled dot-product attention in various scenarios on three few-shot datasets mini-ImageNet, CUB-200, and CIFAR-FS. Additionally, we enhance the prototypical embedding for categorical representation with learnable weights before feeding them to the attention module. Our proposed method FS-CT along with the Cosine attention is simple to implement and can be applied for a wide range of applications. Our codes are available at https://github.com/vinuni-vishc/Few-Shot-Cosine-Transformer
Learning from partially labeled data for multi-organ and tumor segmentation
- Authors: Yutong Xie, Jianpeng Zhang, Yong Xia, Chunhua Shen
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.06894
- Pdf link: https://arxiv.org/pdf/2211.06894
- Abstract Medical image benchmarks for the segmentation of organs and tumors suffer from the partially labeling issue due to its intensive cost of labor and expertise. Current mainstream approaches follow the practice of one network solving one task. With this pipeline, not only the performance is limited by the typically small dataset of a single task, but also the computation cost linearly increases with the number of tasks. To address this, we propose a Transformer based dynamic on-demand network (TransDoDNet) that learns to segment organs and tumors on multiple partially labeled datasets. Specifically, TransDoDNet has a hybrid backbone that is composed of the convolutional neural network and Transformer. A dynamic head enables the network to accomplish multiple segmentation tasks flexibly. Unlike existing approaches that fix kernels after training, the kernels in the dynamic head are generated adaptively by the Transformer, which employs the self-attention mechanism to model long-range organ-wise dependencies and decodes the organ embedding that can represent each organ. We create a large-scale partially labeled Multi-Organ and Tumor Segmentation benchmark, termed MOTS, and demonstrate the superior performance of our TransDoDNet over other competitors on seven organ and tumor segmentation tasks. This study also provides a general 3D medical image segmentation model, which has been pre-trained on the large-scale MOTS benchmark and has demonstrated advanced performance over BYOL, the current predominant self-supervised learning method. Code will be available at \url{https://git.io/DoDNet}.
ALBERT with Knowledge Graph Encoder Utilizing Semantic Similarity for Commonsense Question Answering
- Authors: Byeongmin Choi, YongHyun Lee, Yeunwoong Kyung, Eunchan Kim
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2211.07065
- Pdf link: https://arxiv.org/pdf/2211.07065
- Abstract Recently, pre-trained language representation models such as bidirectional encoder representations from transformers (BERT) have been performing well in commonsense question answering (CSQA). However, there is a problem that the models do not directly use explicit information of knowledge sources existing outside. To augment this, additional methods such as knowledge-aware graph network (KagNet) and multi-hop graph relation network (MHGRN) have been proposed. In this study, we propose to use the latest pre-trained language model a lite bidirectional encoder representations from transformers (ALBERT) with knowledge graph information extraction technique. We also propose to applying the novel method, schema graph expansion to recent language models. Then, we analyze the effect of applying knowledge graph-based knowledge extraction techniques to recent pre-trained language models and confirm that schema graph expansion is effective in some extent. Furthermore, we show that our proposed model can achieve better performance than existing KagNet and MHGRN models in CommonsenseQA dataset.
BiViT: Extremely Compressed Binary Vision Transformer
- Authors: Yefei He, Zhenyu Lou, Luoming Zhang, Weijia Wu, Bohan Zhuang, Hong Zhou
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2211.07091
- Pdf link: https://arxiv.org/pdf/2211.07091
- Abstract Model binarization can significantly compress model size, reduce energy consumption, and accelerate inference through efficient bit-wise operations. Although binarizing convolutional neural networks have been extensively studied, there is little work on exploring binarization on vision Transformers which underpin most recent breakthroughs in visual recognition. To this end, we propose to solve two fundamental challenges to push the horizon of Binary Vision Transformers (BiViT). First, the traditional binary method does not take the long-tailed distribution of softmax attention into consideration, bringing large binarization errors in the attention module. To solve this, we propose Softmax-aware Binarization, which dynamically adapts to the data distribution and reduces the error caused by binarization. Second, to better exploit the information of the pretrained model and restore accuracy, we propose a Cross-layer Binarization scheme and introduce learnable channel-wise scaling factors for weight binarization. The former decouples the binarization of self-attention and MLP to avoid mutual interference while the latter enhances the representation capacity of binarized models. Overall, our method performs favorably against state-of-the-arts by 19.8% on the TinyImageNet dataset. On ImageNet, BiViT achieves a competitive 70.8% Top-1 accuracy over Swin-T model, outperforming the existing SOTA methods by a clear margin.
ParCNetV2: Oversized Kernel with Enhanced Attention
- Authors: Ruihan Xu, Haokui Zhang, Wenze Hu, Shiliang Zhang, Xiaoyu Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.07157
- Pdf link: https://arxiv.org/pdf/2211.07157
- Abstract Transformers have achieved tremendous success in various computer vision tasks. By borrowing design concepts from transformers, many studies revolutionized CNNs and showed remarkable results. This paper falls in this line of studies. More specifically, we introduce a convolutional neural network architecture named ParCNetV2, which extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units. The oversized convolution utilizes a kernel with $2\times$ the input size to model long-range dependencies through a global receptive field. Simultaneously, it achieves implicit positional encoding by removing the shift-invariant property from convolutional kernels, i.e., the effective kernels at different spatial locations are different when the kernel size is twice as large as the input size. The bifurcate gate unit implements an attention mechanism similar to self-attention in transformers. It splits the input into two branches, one serves as feature transformation while the other serves as attention weights. The attention is applied through element-wise multiplication of the two branches. Besides, we introduce a unified local-global convolution block to unify the design of the early and late stage convolutional blocks. Extensive experiments demonstrate that our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
Evade the Trap of Mediocrity: Promoting Diversity and Novelty in Text Generation via Concentrating Attention
- Authors: Wenhao Li, Xiaoyuan Yi, Jinyi Hu, Maosong Sun, Xing Xie
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2211.07164
- Pdf link: https://arxiv.org/pdf/2211.07164
- Abstract Recently, powerful Transformer architectures have proven superior in generating high-quality sentences. Nevertheless, these models tend to produce dull high-frequency phrases, severely hurting the diversity and novelty of generated text. In this work, we dig into the intrinsic mechanism of this problem and found that sparser attention values in Transformer could improve diversity. To understand such a phenomenon, we first conduct both empirical and theoretical analysis and then attribute it to representation degeneration caused by the attentive mixture of the hidden states during training. We term this process the Trap of Mediocrity. To escape from such a trap, we introduce a novel attention regularization loss to control the sharpness of the attention distribution, which is transparent to model structures and can be easily implemented within 20 lines of python code. We prove that this method could be mathematically regarded as learning a Bayesian approximation of posterior attention. Experiments show that our method improved the diversity and novelty of the generated text while maintaining comparable quality on a variety of conditional and unconditional generation tasks.
CabViT: Cross Attention among Blocks for Vision Transformer
- Authors: Haokui Zhang, Wenze Hu, Xiaoyu Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.07198
- Pdf link: https://arxiv.org/pdf/2211.07198
- Abstract Since the vision transformer (ViT) has achieved impressive performance in image classification, an increasing number of researchers pay their attentions to designing more efficient vision transformer models. A general research line is reducing computational cost of self attention modules by adopting sparse attention or using local attention windows. In contrast, we propose to design high performance transformer based architectures by densifying the attention pattern. Specifically, we propose cross attention among blocks of ViT (CabViT), which uses tokens from previous blocks in the same stage as extra input to the multi-head attention of transformers. The proposed CabViT enhances the interactions of tokens across blocks with potentially different semantics, and encourages more information flows to the lower levels, which together improves model performance and model convergence with limited extra cost. Based on the proposed CabViT, we design a series of CabViT models which achieve the best trade-off between model size, computational cost and accuracy. For instance without the need of knowledge distillation to strength the training, CabViT achieves 83.0% top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.9G FLOPs, saving almost half parameters and 13% computational cost while gaining 0.9% higher accuracy compared with ConvNext, use 52% of parameters but gaining 0.6% accuracy compared with distilled EfficientFormer
Finding Skill Neurons in Pre-trained Transformer-based Language Models
- Authors: Xiaozhi Wang, Kaiyue Wen, Zhengyan Zhang, Lei Hou, Zhiyuan Liu, Juanzi Li
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2211.07349
- Pdf link: https://arxiv.org/pdf/2211.07349
- Abstract Transformer-based pre-trained language models have demonstrated superior performance on various natural language processing tasks. However, it remains unclear how the skills required to handle these tasks distribute among model parameters. In this paper, we find that after prompt tuning for specific tasks, the activations of some neurons within pre-trained Transformers are highly predictive of the task labels. We dub these neurons skill neurons and confirm they encode task-specific skills by finding that: (1) Skill neurons are crucial for handling tasks. Performances of pre-trained Transformers on a task significantly drop when corresponding skill neurons are perturbed. (2) Skill neurons are task-specific. Similar tasks tend to have similar distributions of skill neurons. Furthermore, we demonstrate the skill neurons are most likely generated in pre-training rather than fine-tuning by showing that the skill neurons found with prompt tuning are also crucial for other fine-tuning methods freezing neuron weights, such as the adapter-based tuning and BitFit. We also explore the applications of skill neurons, including accelerating Transformers with network pruning and building better transferability indicators. These findings may promote further research on understanding Transformers. The source code can be obtained from https://github.com/THU-KEG/Skill-Neuron.
Language models are good pathologists: using attention-based sequence reduction and text-pretrained transformers for efficient WSI classification
- Authors: Juan I. Pisula, Katarzyna Bozek
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2211.07384
- Pdf link: https://arxiv.org/pdf/2211.07384
- Abstract In digital pathology, Whole Slide Image (WSI) analysis is usually formulated as a Multiple Instance Learning (MIL) problem. Although transformer-based architectures have been used for WSI classification, these methods require modifications to adapt them to specific challenges of this type of image data. Despite their power across domains, reference transformer models in classical Computer Vision (CV) and Natural Language Processing (NLP) tasks are not used for pathology slide analysis. In this work we demonstrate the use of standard, frozen, text-pretrained, transformer language models in application to WSI classification. We propose SeqShort, a multi-head attention-based sequence reduction input layer to summarize each WSI in a fixed and short size sequence of instances. This allows us to reduce the computational costs of self-attention on long sequences, and to include positional information that is unavailable in other MIL approaches. We demonstrate the effectiveness of our methods in the task of cancer subtype classification, without the need of designing a WSI-specific transformer or performing in-domain self-supervised pretraining, while keeping a reduced compute budget and number of trainable parameters.
Cracking Double-Blind Review: Authorship Attribution with Deep Learning
- Authors: Leonard Bauersfeld, Angel Romero, Manasi Muglikar, Davide Scaramuzza
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2211.07467
- Pdf link: https://arxiv.org/pdf/2211.07467
- Abstract Double-blind peer review is considered a pillar of academic research because it is perceived to ensure a fair, unbiased, and fact-centered scientific discussion. Yet, experienced researchers can often correctly guess from which research group an anonymous submission originates, biasing the peer-review process. In this work, we present a transformer-based, neural-network architecture that only uses the text content and the author names in the bibliography to atttribute an anonymous manuscript to an author. To train and evaluate our method, we created the largest authorship-identification dataset to date. It leverages all research papers publicly available on arXiv amounting to over 2 million manuscripts. In arXiv-subsets with up to 2,000 different authors, our method achieves an unprecedented authorship attribution accuracy, where up to 95% of papers are attributed correctly. Thanks to our method, we are not only able to predict the author of an anonymous work but we also identify weaknesses of the double-blind review process by finding the key aspects that make a paper attributable. We believe that this work gives precious insights into how a submission can remain anonymous in order to support an unbiased double-blind review process.
Electroadhesive Clutches for Programmable Shape Morphing of Soft Actuators
- Authors: Gregory M. Campbell, Jessica Yin, Yuyang Song, Umesh Gandhi, Mark Yim, James Pikul
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2211.07480
- Pdf link: https://arxiv.org/pdf/2211.07480
- Abstract Soft robotic actuators are safe and adaptable devices with inherent compliance, which makes them attractive for manipulating delicate and complex objects. Researchers have integrated stiff materials into soft actuators to increase their force capacity and direct their deformation. However, these embedded materials have largely been pre-prescribed and static, which constrains the actuators to a predetermined range of motion. In this work, electroadhesive (EA) clutches integrated on a single-chamber soft pneumatic actuator (SPA) provide local programmable stiffness modulation to control the actuator deformation. We show that activating different clutch patterns inflates a silicone membrane into pyramidal, round, and plateau shapes. Curvatures from these shapes are combined during actuation to apply forces on both a 3.7 g and 820 g object along five different degrees of freedom (DoF). The actuator workspace is up to 12 mm for light objects. Clutch deactivation, which results in local elastomeric expansion, rapidly applies forces up to 3.2 N to an object resting on the surface and launches a 3.7 g object in controlled directions. The actuator also rotates a heavier, 820 g, object by 5 degrees and rapidly restores it to horizontal alignment after clutch deactivation. This actuator is fully powered by a 5 V battery, AA battery, DC-DC transformer, and 4.5 V (63 g) DC air pump. These results demonstrate a first step towards realizing a soft actuator with high DoF shape change that preserves the inherent benefits of pneumatic actuation while gaining the electrical controllability and strength of EA clutches. We envision such a system supplying human contact forces in the form of a low-profile sit-to-stand assistance device, bed-ridden patient manipulator, or other ergonomic mechanism. This technology was also demonstrated at ICRA 2022: https://www.youtube.com/watch?v=6Y6-iHWNi6s
Butterfly Effect Attack: Tiny and Seemingly Unrelated Perturbations for Object Detection
- Authors: Nguyen Anh Vu Doan, Arda Yüksel, Chih-Hong Cheng
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2211.07483
- Pdf link: https://arxiv.org/pdf/2211.07483
- Abstract This work aims to explore and identify tiny and seemingly unrelated perturbations of images in object detection that will lead to performance degradation. While tininess can naturally be defined using $L_p$ norms, we characterize the degree of "unrelatedness" of an object by the pixel distance between the occurred perturbation and the object. Triggering errors in prediction while satisfying two objectives can be formulated as a multi-objective optimization problem where we utilize genetic algorithms to guide the search. The result successfully demonstrates that (invisible) perturbations on the right part of the image can drastically change the outcome of object detection on the left. An extensive evaluation reaffirms our conjecture that transformer-based object detection networks are more susceptible to butterfly effects in comparison to single-stage object detection networks such as YOLOv5.
On Analyzing the Role of Image for Visual-enhanced Relation Extraction
- Authors: Lei Li, Xiang Chen, Shuofei Qiao, Feiyu Xiong, Huajun Chen, Ningyu Zhang
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2211.07504
- Pdf link: https://arxiv.org/pdf/2211.07504
- Abstract Multimodal relation extraction is an essential task for knowledge graph construction. In this paper, we take an in-depth empirical analysis that indicates the inaccurate information in the visual scene graph leads to poor modal alignment weights, further degrading performance. Moreover, the visual shuffle experiments illustrate that the current approaches may not take full advantage of visual information. Based on the above observation, we further propose a strong baseline with an implicit fine-grained multimodal alignment based on Transformer for multimodal relation extraction. Experimental results demonstrate the better performance of our method. Codes are available at https://github.com/zjunlp/DeepKE/tree/main/example/re/multimodal.
Imagination is All You Need! Curved Contrastive Learning for Abstract Sequence Modeling Utilized on Long Short-Term Dialogue Planning
- Authors: Justus-Jonas Erker, Gerasimos Spanakis, Stefan Schaffer
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2211.07591
- Pdf link: https://arxiv.org/pdf/2211.07591
- Abstract Motivated by the entailment property of multi-turn dialogues through contrastive learning sentence embeddings, we introduce a novel technique, Curved Contrastive Learning (CCL), for generating semantically meaningful and conversational graph curved utterance embeddings that can be compared using cosine similarity. The resulting bi-encoder models can guide transformers as a response ranking model towards a goal in a zero-shot fashion by projecting the goal utterance and the corresponding reply candidates into a latent space. Here the cosine similarity indicates the distance/reachability of a candidate utterance towards the corresponding goal which we define as curved space. Furthermore, we explore how these forward-entailing language representations can be utilized for assessing the likelihood of sequences by the entailment strength i.e. through the cosine similarity of its individual members (encoded separately) as an emergent property in the curved space. This allows us to imagine the likelihood of future patterns in dialogues, specifically by ordering/identifying future goal utterances that are multiple turns away, given a dialogue context. As part of our analysis, we investigate characteristics that make conversations (un)plannable and find strong evidence of planning capability over multiple turns (in 61.56% over 3 turns) in conversations from the DailyDialog dataset. Finally, we will show how we can exploit the curved property to rank one million utterance & context pairs, in terms of GPU computation time over 7 million times faster than DialogRPT, while being in average 2.8% qualitatively superior for sequences longer than 2 turns.
Keyword: scene understanding
There is no result
Keyword: visual reasoning
There is no result