Paper-Daily-Notice
Paper-Daily-Notice copied to clipboard
New submissions for Mon, 27 Jun 22
New submissions for Mon, 27 Jun 22
Keyword: SLAM
There is no result
Keyword: odometry
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
Contrastive Learning of Features between Images and LiDAR
- Authors: Peng Jiang, Srikanth Saripalli
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.12071
- Pdf link: https://arxiv.org/pdf/2206.12071
- Abstract Image and Point Clouds provide different information for robots. Finding the correspondences between data from different sensors is crucial for various tasks such as localization, mapping, and navigation. Learning-based descriptors have been developed for single sensors; there is little work on cross-modal features. This work treats learning cross-modal features as a dense contrastive learning problem. We propose a Tuple-Circle loss function for cross-modality feature learning. Furthermore, to learn good features and not lose generality, we developed a variant of widely used PointNet++ architecture for point cloud and U-Net CNN architecture for images. Moreover, we conduct experiments on a real-world dataset to show the effectiveness of our loss function and network structure. We show that our models indeed learn information from both images as well as LiDAR by visualizing the features.
MaskRange: A Mask-classification Model for Range-view based LiDAR Segmentation
- Authors: Yi Gu, Yuming Huang, Chengzhong Xu, Hui Kong
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.12073
- Pdf link: https://arxiv.org/pdf/2206.12073
- Abstract Range-view based LiDAR segmentation methods are attractive for practical applications due to their direct inheritance from efficient 2D CNN architectures. In literature, most range-view based methods follow the per-pixel classification paradigm. Recently, in the image segmentation domain, another paradigm formulates segmentation as a mask-classification problem and has achieved remarkable performance. This raises an interesting question: can the mask-classification paradigm benefit the range-view based LiDAR segmentation and achieve better performance than the counterpart per-pixel paradigm? To answer this question, we propose a unified mask-classification model, MaskRange, for the range-view based LiDAR semantic and panoptic segmentation. Along with the new paradigm, we also propose a novel data augmentation method to deal with overfitting, context-reliance, and class-imbalance problems. Extensive experiments are conducted on the SemanticKITTI benchmark. Among all published range-view based methods, our MaskRange achieves state-of-the-art performance with $66.10$ mIoU on semantic segmentation and promising results with $53.10$ PQ on panoptic segmentation with high efficiency. Our code will be released.
Keyword: loop detection
There is no result
Keyword: nerf
UNeRF: Time and Memory Conscious U-Shaped Network for Training Neural Radiance Fields
- Authors: Abiramy Kuganesan, Shih-yang Su, James J. Little, Helge Rhodin
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
- Arxiv link: https://arxiv.org/abs/2206.11952
- Pdf link: https://arxiv.org/pdf/2206.11952
- Abstract Neural Radiance Fields (NeRFs) increase reconstruction detail for novel view synthesis and scene reconstruction, with applications ranging from large static scenes to dynamic human motion. However, the increased resolution and model-free nature of such neural fields come at the cost of high training times and excessive memory requirements. Recent advances improve the inference time by using complementary data structures yet these methods are ill-suited for dynamic scenes and often increase memory consumption. Little has been done to reduce the resources required at training time. We propose a method to exploit the redundancy of NeRF's sample-based computations by partially sharing evaluations across neighboring sample points. Our UNeRF architecture is inspired by the UNet, where spatial resolution is reduced in the middle of the network and information is shared between adjacent samples. Although this change violates the strict and conscious separation of view-dependent appearance and view-independent density estimation in the NeRF method, we show that it improves novel view synthesis. We also introduce an alternative subsampling strategy which shares computation while minimizing any violation of view invariance. UNeRF is a plug-in module for the original NeRF network. Our major contributions include reduction of the memory footprint, improved accuracy, and reduced amortized processing time both during training and inference. With only weak assumptions on locality, we achieve improved resource utilization on a variety of neural radiance fields tasks. We demonstrate applications to the novel view synthesis of static scenes as well as dynamic human shape and motion.
Keyword: mapping
Contrastive Learning of Features between Images and LiDAR
- Authors: Peng Jiang, Srikanth Saripalli
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.12071
- Pdf link: https://arxiv.org/pdf/2206.12071
- Abstract Image and Point Clouds provide different information for robots. Finding the correspondences between data from different sensors is crucial for various tasks such as localization, mapping, and navigation. Learning-based descriptors have been developed for single sensors; there is little work on cross-modal features. This work treats learning cross-modal features as a dense contrastive learning problem. We propose a Tuple-Circle loss function for cross-modality feature learning. Furthermore, to learn good features and not lose generality, we developed a variant of widely used PointNet++ architecture for point cloud and U-Net CNN architecture for images. Moreover, we conduct experiments on a real-world dataset to show the effectiveness of our loss function and network structure. We show that our models indeed learn information from both images as well as LiDAR by visualizing the features.
Keyword: localization
Contrastive Learning of Features between Images and LiDAR
- Authors: Peng Jiang, Srikanth Saripalli
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.12071
- Pdf link: https://arxiv.org/pdf/2206.12071
- Abstract Image and Point Clouds provide different information for robots. Finding the correspondences between data from different sensors is crucial for various tasks such as localization, mapping, and navigation. Learning-based descriptors have been developed for single sensors; there is little work on cross-modal features. This work treats learning cross-modal features as a dense contrastive learning problem. We propose a Tuple-Circle loss function for cross-modality feature learning. Furthermore, to learn good features and not lose generality, we developed a variant of widely used PointNet++ architecture for point cloud and U-Net CNN architecture for images. Moreover, we conduct experiments on a real-world dataset to show the effectiveness of our loss function and network structure. We show that our models indeed learn information from both images as well as LiDAR by visualizing the features.
Augmented Reality-Empowered Network Planning Services for Private Networks
- Authors: Qi Liao, Tianlun Hu, Nikolaj Marchenko
- Subjects: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.12139
- Pdf link: https://arxiv.org/pdf/2206.12139
- Abstract To support Industry 4.0 applications with haptics and human-machine interaction, the sixth generation (6G) requires a new framework that is fully autonomous, visual, and interactive. In this paper, we propose a novel framework for private network planning services, providing an end-to-end solution that receives visual and sensory data from the user device, reconstructs the 3D network environment and performs network planning on the server, and visualizes the network performance with augmented reality (AR) on the display of the user devices. The solution is empowered by three key technical components: 1) vision- and sensor fusion-based 3D environment reconstruction, 2) ray tracing-based radio map generation and network planning, and 3) AR-empowered network visualization enabled by real-time camera relocalization. We conducted the proof-of-concept in a Bosch plant in Germany and showed good network coverage of the optimized antenna location, as well as high accuracy in both environment reconstruction and camera relocalization. We also achieved real-time AR-supported network monitoring with an end-to-end latency of about 32 ms per frame.
Source Localization of Graph Diffusion via Variational Autoencoders for Graph Inverse Problems
- Authors: Chen Ling, Junji Jiang, Junxiang Wang, Liang Zhao
- Subjects: Machine Learning (cs.LG); Information Theory (cs.IT)
- Arxiv link: https://arxiv.org/abs/2206.12327
- Pdf link: https://arxiv.org/pdf/2206.12327
- Abstract Graph diffusion problems such as the propagation of rumors, computer viruses, or smart grid failures are ubiquitous and societal. Hence it is usually crucial to identify diffusion sources according to the current graph diffusion observations. Despite its tremendous necessity and significance in practice, source localization, as the inverse problem of graph diffusion, is extremely challenging as it is ill-posed: different sources may lead to the same graph diffusion patterns. Different from most traditional source localization methods, this paper focuses on a probabilistic manner to account for the uncertainty of different candidate sources. Such endeavors require overcoming challenges including 1) the uncertainty in graph diffusion source localization is hard to be quantified; 2) the complex patterns of the graph diffusion sources are difficult to be probabilistically characterized; 3) the generalization under any underlying diffusion patterns is hard to be imposed. To solve the above challenges, this paper presents a generic framework: Source Localization Variational AutoEncoder (SL-VAE) for locating the diffusion sources under arbitrary diffusion patterns. Particularly, we propose a probabilistic model that leverages the forward diffusion estimation model along with deep generative models to approximate the diffusion source distribution for quantifying the uncertainty. SL-VAE further utilizes prior knowledge of the source-observation pairs to characterize the complex patterns of diffusion sources by a learned generative prior. Lastly, a unified objective that integrates the forward diffusion estimation model is derived to enforce the model to generalize under arbitrary diffusion patterns. Extensive experiments are conducted on 7 real-world datasets to demonstrate the superiority of SL-VAE in reconstructing the diffusion sources by excelling other methods on average 20% in AUC score.
Keyword: transformer
Agriculture-Vision Challenge 2022 -- The Runner-Up Solution for Agricultural Pattern Recognition via Transformer-based Models
- Authors: Zhicheng Yang, Jui-Hsin Lai, Jun Zhou, Hang Zhou, Chen Du, Zhongcheng Lai
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.11920
- Pdf link: https://arxiv.org/pdf/2206.11920
- Abstract The Agriculture-Vision Challenge in CVPR is one of the most famous and competitive challenges for global researchers to break the boundary between computer vision and agriculture sectors, aiming at agricultural pattern recognition from aerial images. In this paper, we propose our solution to the third Agriculture-Vision Challenge in CVPR 2022. We leverage a data pre-processing scheme and several Transformer-based models as well as data augmentation techniques to achieve a mIoU of 0.582, accomplishing the 2nd place in this challenge.
Set Norm and Equivariant Skip Connections: Putting the Deep in Deep Sets
- Authors: Lily H. Zhang, Veronica Tozzo, John M. Higgins, Rajesh Ranganath
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2206.11925
- Pdf link: https://arxiv.org/pdf/2206.11925
- Abstract Permutation invariant neural networks are a promising tool for making predictions from sets. However, we show that existing permutation invariant architectures, Deep Sets and Set Transformer, can suffer from vanishing or exploding gradients when they are deep. Additionally, layer norm, the normalization of choice in Set Transformer, can hurt performance by removing information useful for prediction. To address these issues, we introduce the clean path principle for equivariant residual connections and develop set norm, a normalization tailored for sets. With these, we build Deep Sets++ and Set Transformer++, models that reach high depths with comparable or better performance than their original counterparts on a diverse suite of tasks. We additionally introduce Flow-RBC, a new single-cell dataset and real-world application of permutation invariant prediction. We open-source our data and code here: https://github.com/rajesh-lab/deep_permutation_invariant.
Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs
- Authors: Yi-Lun Liao, Tess Smidt
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
- Arxiv link: https://arxiv.org/abs/2206.11990
- Pdf link: https://arxiv.org/pdf/2206.11990
- Abstract 3D-related inductive biases like translational invariance and rotational equivariance are indispensable to graph neural networks operating on 3D atomistic graphs such as molecules. Inspired by the success of Transformers in various domains, we study how to incorporate these inductive biases into Transformers. In this paper, we present Equiformer, a graph neural network leveraging the strength of Transformer architectures and incorporating $SE(3)/E(3)$-equivariant features based on irreducible representations (irreps). Irreps features encode equivariant information in channel dimensions without complicating graph structures. The simplicity enables us to directly incorporate them by replacing original operations with equivariant counterparts. Moreover, to better adapt Transformers to 3D graphs, we propose a novel equivariant graph attention, which considers both content and geometric information such as relative position contained in irreps features. To improve expressivity of the attention, we replace dot product attention with multi-layer perceptron attention and include non-linear message passing. We benchmark Equiformer on two quantum properties prediction datasets, QM9 and OC20. For QM9, among models trained with the same data partition, Equiformer achieves best results on 11 out of 12 regression tasks. For OC20, under the setting of training with IS2RE data and optionally IS2RS data, Equiformer improves upon state-of-the-art models. Code reproducing all main results will be available soon.
A multi-model-based deep learning framework for short text multiclass classification with the imbalanced and extremely small data set
- Authors: Jiajun Tong, Zhixiao Wang, Xiaobin Rui
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2206.12027
- Pdf link: https://arxiv.org/pdf/2206.12027
- Abstract Text classification plays an important role in many practical applications. In the real world, there are extremely small datasets. Most existing methods adopt pre-trained neural network models to handle this kind of dataset. However, these methods are either difficult to deploy on mobile devices because of their large output size or cannot fully extract the deep semantic information between phrases and clauses. This paper proposes a multimodel-based deep learning framework for short-text multiclass classification with an imbalanced and extremely small data set. Our framework mainly includes five layers: The encoder layer uses DISTILBERT to obtain context-sensitive dynamic word vectors that are difficult to represent in traditional feature engineering methods. Since the transformer part of this layer is distilled, our framework is compressed. Then, we use the next two layers to extract deep semantic information. The output of the encoder layer is sent to a bidirectional LSTM network, and the feature matrix is extracted hierarchically through the LSTM at the word and sentence level to obtain the fine-grained semantic representation. After that, the max-pooling layer converts the feature matrix into a lower-dimensional matrix, preserving only the obvious features. Finally, the feature matrix is taken as the input of a fully connected softmax layer, which contains a function that can convert the predicted linear vector into the output value as the probability of the text in each classification. Extensive experiments on two public benchmarks demonstrate the effectiveness of our proposed approach on an extremely small data set. It retains the state-of-the-art baseline performance in terms of precision, recall, accuracy, and F1 score, and through the model size, training time, and convergence epoch, we can conclude that our method can be deployed faster and lighter on mobile devices.
The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation
- Authors: Leilei Cao, Zhuang Li, Bo Yan, Feng Zhang, Fengliang Qi, Yuchen Hu, Hongbin Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
- Arxiv link: https://arxiv.org/abs/2206.12035
- Pdf link: https://arxiv.org/pdf/2206.12035
- Abstract The referring video object segmentation task (RVOS) aims to segment object instances in a given video referred by a language expression in all video frames. Due to the requirement of understanding cross-modal semantics within individual instances, this task is more challenging than the traditional semi-supervised video object segmentation where the ground truth object masks in the first frame are given. With the great achievement of Transformer in object detection and object segmentation, RVOS has been made remarkable progress where ReferFormer achieved the state-of-the-art performance. In this work, based on the strong baseline framework--ReferFormer, we propose several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference. The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping
- Authors: Gasser Elbanna, Neil Scheidwasser-Clow, Mikolaj Kegler, Pierre Beckmann, Karl El Hajal, Milos Cernak
- Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2206.12038
- Pdf link: https://arxiv.org/pdf/2206.12038
- Abstract Methods for extracting audio and speech features have been studied since pioneering work on spectrum analysis decades ago. Recent efforts are guided by the ambition to develop general-purpose audio representations. For example, deep neural networks can extract optimal embeddings if they are trained on large audio datasets. This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets. Lastly, we present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features. All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks. Our results indicate that the hybrid model with a convolutional transformer as the encoder yields superior performance in most HEAR challenge tasks.
Bilateral Network with Channel Splitting Network and Transformer for Thermal Image Super-Resolution
- Authors: Bo Yan, Leilei Cao, Fengliang Qi, Hongbin Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2206.12046
- Pdf link: https://arxiv.org/pdf/2206.12046
- Abstract In recent years, the Thermal Image Super-Resolution (TISR) problem has become an attractive research topic. TISR would been used in a wide range of fields, including military, medical, agricultural and animal ecology. Due to the success of PBVS-2020 and PBVS-2021 workshop challenge, the result of TISR keeps improving and attracts more researchers to sign up for PBVS-2022 challenge. In this paper, we will introduce the technical details of our submission to PBVS-2022 challenge designing a Bilateral Network with Channel Splitting Network and Transformer(BN-CSNT) to tackle the TISR problem. Firstly, we designed a context branch based on channel splitting network with transformer to obtain sufficient context information. Secondly, we designed a spatial branch with shallow transformer to extract low level features which can preserve the spatial information. Finally, for the context branch in order to fuse the features from channel splitting network and transformer, we proposed an attention refinement module, and then features from context branch and spatial branch are fused by proposed feature fusion module. The proposed method can achieve PSNR=33.64, SSIM=0.9263 for x4 and PSNR=21.08, SSIM=0.7803 for x2 in the PBVS-2022 challenge test dataset.
TreeDRNet:A Robust Deep Model for Long Term Time Series Forecasting
- Authors: Tian Zhou, Jianqing Zhu, Xue Wang, Ziqing Ma, Qingsong Wen, Liang Sun, Rong Jin
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2206.12106
- Pdf link: https://arxiv.org/pdf/2206.12106
- Abstract Various deep learning models, especially some latest Transformer-based approaches, have greatly improved the state-of-art performance for long-term time series forecasting.However, those transformer-based models suffer a severe deterioration performance with prolonged input length, which prohibits them from using extended historical info.Moreover, these methods tend to handle complex examples in long-term forecasting with increased model complexity, which often leads to a significant increase in computation and less robustness in performance(e.g., overfitting). We propose a novel neural network architecture, called TreeDRNet, for more effective long-term forecasting. Inspired by robust regression, we introduce doubly residual link structure to make prediction more robust.Built upon Kolmogorov-Arnold representation theorem, we explicitly introduce feature selection, model ensemble, and a tree structure to further utilize the extended input sequence, which improves the robustness and representation power of TreeDRNet. Unlike previous deep models for sequential forecasting work, TreeDRNet is built entirely on multilayer perceptron and thus enjoys high computational efficiency. Our extensive empirical studies show that TreeDRNet is significantly more effective than state-of-the-art methods, reducing prediction errors by 20% to 40% for multivariate time series. In particular, TreeDRNet is over 10 times more efficient than transformer-based methods. The code will be released soon.
Capture Salient Historical Information: A Fast and Accurate Non-Autoregressive Model for Multi-turn Spoken Language Understanding
- Authors: Lizhi Cheng, Weijia jia, Wenmian Yang
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2206.12209
- Pdf link: https://arxiv.org/pdf/2206.12209
- Abstract Spoken Language Understanding (SLU), a core component of the task-oriented dialogue system, expects a shorter inference facing the impatience of human users. Existing work increases inference speed by designing non-autoregressive models for single-turn SLU tasks but fails to apply to multi-turn SLU in confronting the dialogue history. The intuitive idea is to concatenate all historical utterances and utilize the non-autoregressive models directly. However, this approach seriously misses the salient historical information and suffers from the uncoordinated-slot problems. To overcome those shortcomings, we propose a novel model for multi-turn SLU named Salient History Attention with Layer-Refined Transformer (SHA-LRT), which composes of an SHA module, a Layer-Refined Mechanism (LRM), and a Slot Label Generation (SLG) task. SHA captures salient historical information for the current dialogue from both historical utterances and results via a well-designed history-attention mechanism. LRM predicts preliminary SLU results from Transformer's middle states and utilizes them to guide the final prediction, and SLG obtains the sequential dependency information for the non-autoregressive encoder. Experiments on public datasets indicate that our model significantly improves multi-turn SLU performance (17.5% on Overall) with accelerating (nearly 15 times) the inference process over the state-of-the-art baseline as well as effective on the single-turn SLU tasks.
Text and author-level political inference using heterogeneous knowledge representations
- Authors: Samuel Caetano da Silva, Ivandre Paraboni
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2206.12293
- Pdf link: https://arxiv.org/pdf/2206.12293
- Abstract The inference of politically-charged information from text data is a popular research topic in Natural Language Processing (NLP) at both text- and author-level. In recent years, studies of this kind have been implemented with the aid of representations from transformers such as BERT. Despite considerable success, however, we may ask whether results may be improved even further by combining transformed-based models with additional knowledge representations. To shed light on this issue, the present work describes a series of experiments to compare alternative model configurations for political inference from text in both English and Portuguese languages. Results suggest that certain text representations - in particular, the combined use of BERT pre-trained language models with a syntactic dependency model - may outperform the alternatives across multiple experimental settings, making a potentially strong case for further research in the use of heterogeneous text representations in these and possibly other NLP tasks.
Megapixel Image Generation with Step-Unrolled Denoising Autoencoders
- Authors: Alex F. McKinney, Chris G. Willcocks
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2206.12351
- Pdf link: https://arxiv.org/pdf/2206.12351
- Abstract An ongoing trend in generative modelling research has been to push sample resolutions higher whilst simultaneously reducing computational requirements for training and sampling. We aim to push this trend further via the combination of techniques - each component representing the current pinnacle of efficiency in their respective areas. These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model. Unexpectedly, our method highlights weaknesses in the original formulation of hourglass transformers when applied to multidimensional data. In light of this, we propose modifications to the resampling mechanism, applicable in any task applying hierarchical transformers to multidimensional data. Additionally, we demonstrate the scalability of SUNDAE to long sequence lengths - four times longer than prior work. Our proposed framework scales to high-resolutions ($1024 \times 1024$) and trains quickly (2-4 days). Crucially, the trained model produces diverse and realistic megapixel samples in approximately 2 seconds on a consumer-grade GPU (GTX 1080Ti). In general, the framework is flexible: supporting an arbitrary number of sampling steps, sample-wise self-stopping, self-correction capabilities, conditional generation, and a NAR formulation that allows for arbitrary inpainting masks. We obtain FID scores of 10.56 on FFHQ256 - close to the original VQ-GAN in less than half the sampling steps - and 21.85 on FFHQ1024 in only 100 sampling steps.
Defending Backdoor Attacks on Vision Transformer via Patch Processing
- Authors: Khoa D. Doan, Yingjie Lao, Peng Yang, Ping Li
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2206.12381
- Pdf link: https://arxiv.org/pdf/2206.12381
- Abstract Vision Transformers (ViTs) have a radically different architecture with significantly less inductive bias than Convolutional Neural Networks. Along with the improvement in performance, security and robustness of ViTs are also of great importance to study. In contrast to many recent works that exploit the robustness of ViTs against adversarial examples, this paper investigates a representative causative attack, i.e., backdoor. We first examine the vulnerability of ViTs against various backdoor attacks and find that ViTs are also quite vulnerable to existing attacks. However, we observe that the clean-data accuracy and backdoor attack success rate of ViTs respond distinctively to patch transformations before the positional encoding. Then, based on this finding, we propose an effective method for ViTs to defend both patch-based and blending-based trigger backdoor attacks via patch processing. The performances are evaluated on several benchmark datasets, including CIFAR10, GTSRB, and TinyImageNet, which show the proposed novel defense is very successful in mitigating backdoor attacks for ViTs. To the best of our knowledge, this paper presents the first defensive strategy that utilizes a unique characteristic of ViTs against backdoor attacks.
QAGAN: Adversarial Approach To Learning Domain Invariant Language Features
- Authors: Shubham Shrivastava, Kaiyue Wang
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2206.12388
- Pdf link: https://arxiv.org/pdf/2206.12388
- Abstract Training models that are robust to data domain shift has gained an increasing interest both in academia and industry. Question-Answering language models, being one of the typical problem in Natural Language Processing (NLP) research, has received much success with the advent of large transformer models. However, existing approaches mostly work under the assumption that data is drawn from same distribution during training and testing which is unrealistic and non-scalable in the wild. In this paper, we explore adversarial training approach towards learning domain-invariant features so that language models can generalize well to out-of-domain datasets. We also inspect various other ways to boost our model performance including data augmentation by paraphrasing sentences, conditioning end of answer span prediction on the start word, and carefully designed annealing function. Our initial results show that in combination with these methods, we are able to achieve $15.2%$ improvement in EM score and $5.6%$ boost in F1 score on out-of-domain validation dataset over the baseline. We also dissect our model outputs and visualize the model hidden-states by projecting them onto a lower-dimensional space, and discover that our specific adversarial training approach indeed encourages the model to learn domain invariant embedding and bring them closer in the multi-dimensional space.
Keyword: autonomous driving
Adversarial Zoom Lens: A Novel Physical-World Attack to DNNs
- Authors: Chengyin Hu, Weiwen Shi
- Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2206.12251
- Pdf link: https://arxiv.org/pdf/2206.12251
- Abstract Although deep neural networks (DNNs) are known to be fragile, no one has studied the effects of zooming-in and zooming-out of images in the physical world on DNNs performance. In this paper, we demonstrate a novel physical adversarial attack technique called Adversarial Zoom Lens (AdvZL), which uses a zoom lens to zoom in and out of pictures of the physical world, fooling DNNs without changing the characteristics of the target object. The proposed method is so far the only adversarial attack technique that does not add physical adversarial perturbation attack DNNs. In a digital environment, we construct a data set based on AdvZL to verify the antagonism of equal-scale enlarged images to DNNs. In the physical environment, we manipulate the zoom lens to zoom in and out of the target object, and generate adversarial samples. The experimental results demonstrate the effectiveness of AdvZL in both digital and physical environments. We further analyze the antagonism of the proposed data set to the improved DNNs. On the other hand, we provide a guideline for defense against AdvZL by means of adversarial training. Finally, we look into the threat possibilities of the proposed approach to future autonomous driving and variant attack ideas similar to the proposed attack.