arxiv-daily
arxiv-daily copied to clipboard
New submissions for Fri, 30 Sep 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
Out-of-Distribution Detection for LiDAR-based 3D Object Detection
- Authors: Chengjie Huang, Van Duong Nguyen, Vahdat Abdelzad, Christopher Gus Mannes, Luke Rowe, Benjamin Therien, Rick Salay, Krzysztof Czarnecki
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2209.14435
- Pdf link: https://arxiv.org/pdf/2209.14435
- Abstract 3D object detection is an essential part of automated driving, and deep neural networks (DNNs) have achieved state-of-the-art performance for this task. However, deep models are notorious for assigning high confidence scores to out-of-distribution (OOD) inputs, that is, inputs that are not drawn from the training distribution. Detecting OOD inputs is challenging and essential for the safe deployment of models. OOD detection has been studied extensively for the classification task, but it has not received enough attention for the object detection task, specifically LiDAR-based 3D object detection. In this paper, we focus on the detection of OOD inputs for LiDAR-based 3D object detection. We formulate what OOD inputs mean for object detection and propose to adapt several OOD detection methods for object detection. We accomplish this by our proposed feature extraction method. To evaluate OOD detection methods, we develop a simple but effective technique of generating OOD objects for a given object detection model. Our evaluation based on the KITTI dataset shows that different OOD detection methods have biases toward detecting specific OOD objects. It emphasizes the importance of combined OOD detection methods and more research in this direction.
CompNet: A Designated Model to Handle Combinations of Images and Designed features
- Authors: Bowen Qiu, Daniela Raicu, Jacob Furst, Roselyne Tchoua
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2209.14454
- Pdf link: https://arxiv.org/pdf/2209.14454
- Abstract Convolutional neural networks (CNNs) are one of the most popular models of Artificial Neural Networks (ANN)s in Computer Vision (CV). A variety of CNN-based structures were developed by researchers to solve problems like image classification, object detection, and image similarity measurement. Although CNNs have shown their value in most cases, they still have a downside: they easily overfit when there are not enough samples in the dataset. Most medical image datasets are examples of such a dataset. Additionally, many datasets also contain both designed features and images, but CNNs can only deal with images directly. This represents a missed opportunity to leverage additional information. For this reason, we propose a new structure of CNN-based model: CompNet, a composite convolutional neural network. This is a specially designed neural network that accepts combinations of images and designed features as input in order to leverage all available information. The novelty of this structure is that it uses learned features from images to weight designed features in order to gain all information from both images and designed features. With the use of this structure on classification tasks, the results indicate that our approach has the capability to significantly reduce overfitting. Furthermore, we also found several similar approaches proposed by other researchers that can combine images and designed features. To make comparison, we first applied those similar approaches on LIDC and compared the results with the CompNet results, then we applied our CompNet on the datasets that those similar approaches originally used in their works and compared the results with the results they proposed in their papers. All these comparison results showed that our model outperformed those similar approaches on classification tasks either on LIDC dataset or on their proposed datasets.
Self-Configurable Stabilized Real-Time Detection Learning for Autonomous Driving Applications
- Authors: Won Joon Yun, Soohyun Park, Joongheon Kim, David Mohaisen
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2209.14525
- Pdf link: https://arxiv.org/pdf/2209.14525
- Abstract Guaranteeing real-time and accurate object detection simultaneously is paramount in autonomous driving environments. However, the existing object detection neural network systems are characterized by a tradeoff between computation time and accuracy, making it essential to optimize such a tradeoff. Fortunately, in many autonomous driving environments, images come in a continuous form, providing an opportunity to use optical flow. In this paper, we improve the performance of an object detection neural network utilizing optical flow estimation. In addition, we propose a Lyapunov optimization framework for time-average performance maximization subject to stability. It adaptively determines whether to use optical flow to suit the dynamic vehicle environment, thereby ensuring the vehicle's queue stability and the time-average maximum performance simultaneously. To verify the key ideas, we conduct numerical experiments with various object detection neural networks and optical flow estimation networks. In addition, we demonstrate the self-configurable stabilized detection with YOLOv3-tiny and FlowNet2-S, which are the real-time object detection network and an optical flow estimation network, respectively. In the demonstration, our proposed framework improves the accuracy by 3.02%, the number of detected objects by 59.6%, and the queue stability for computing capabilities.
Access Control with Encrypted Feature Maps for Object Detection Models
- Authors: Teru Nagamori, Hiroki Ito, AprilPyone MaungMaung, Hitoshi Kiya
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.14831
- Pdf link: https://arxiv.org/pdf/2209.14831
- Abstract In this paper, we propose an access control method with a secret key for object detection models for the first time so that unauthorized users without a secret key cannot benefit from the performance of trained models. The method enables us not only to provide a high detection performance to authorized users but to also degrade the performance for unauthorized users. The use of transformed images was proposed for the access control of image classification models, but these images cannot be used for object detection models due to performance degradation. Accordingly, in this paper, selected feature maps are encrypted with a secret key for training and testing models, instead of input images. In an experiment, the protected models allowed authorized users to obtain almost the same performance as that of non-protected models but also with robustness against unauthorized access without a key.
GDIP: Gated Differentiable Image Processing for Object-Detection in Adverse Conditions
- Authors: Sanket Kalwar, Dhruv Patel, Aakash Aanegola, Krishna Reddy Konda, Sourav Garg, K Madhava Krishna
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2209.14922
- Pdf link: https://arxiv.org/pdf/2209.14922
- Abstract Detecting objects under adverse weather and lighting conditions is crucial for the safe and continuous operation of an autonomous vehicle, and remains an unsolved problem. We present a Gated Differentiable Image Processing (GDIP) block, a domain-agnostic network architecture, which can be plugged into existing object detection networks (e.g., Yolo) and trained end-to-end with adverse condition images such as those captured under fog and low lighting. Our proposed GDIP block learns to enhance images directly through the downstream object detection loss. This is achieved by learning parameters of multiple image pre-processing (IP) techniques that operate concurrently, with their outputs combined using weights learned through a novel gating mechanism. We further improve GDIP through a multi-stage guidance procedure for progressive image enhancement. Finally, trading off accuracy for speed, we propose a variant of GDIP that can be used as a regularizer for training Yolo, which eliminates the need for GDIP-based image enhancement during inference, resulting in higher throughput and plausible real-world deployment. We demonstrate significant improvement in detection performance over several state-of-the-art methods through quantitative and qualitative studies on synthetic datasets such as PascalVOC, and real-world foggy (RTTS) and low-lighting (ExDark) datasets.
DirectTracker: 3D Multi-Object Tracking Using Direct Image Alignment and Photometric Bundle Adjustment
- Authors: Mariia Gladkova, Nikita Korobov, Nikolaus Demmel, Aljoša Ošep, Laura Leal-Taixé, Daniel Cremers
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.14965
- Pdf link: https://arxiv.org/pdf/2209.14965
- Abstract Direct methods have shown excellent performance in the applications of visual odometry and SLAM. In this work we propose to leverage their effectiveness for the task of 3D multi-object tracking. To this end, we propose DirectTracker, a framework that effectively combines direct image alignment for the short-term tracking and sliding-window photometric bundle adjustment for 3D object detection. Object proposals are estimated based on the sparse sliding-window pointcloud and further refined using an optimization-based cost function that carefully combines 3D and 2D cues to ensure consistency in image and world space. We propose to evaluate 3D tracking using the recently introduced higher-order tracking accuracy (HOTA) metric and the generalized intersection over union similarity measure to mitigate the limitations of the conventional use of intersection over union for the evaluation of vision-based trackers. We perform evaluation on the KITTI Tracking benchmark for the Car class and show competitive performance in tracking objects both in 2D and 3D.
Dilated Neighborhood Attention Transformer
- Authors: Ali Hassani, Humphrey Shi
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.15001
- Pdf link: https://arxiv.org/pdf/2209.15001
- Abstract Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over attention-based baselines such as NAT and Swin, as well as modern convolutional baseline ConvNeXt. Our Large model is ahead of its Swin counterpart by 1.5% box AP in COCO object detection, 1.3% mask AP in COCO instance segmentation, and 1.1% mIoU in ADE20K semantic segmentation, and faster in throughput. We believe combinations of NA and DiNA have the potential to empower various tasks beyond those presented in this paper. To support and encourage research in this direction, in vision and beyond, we open-source our project at: https://github.com/SHI-Labs/Neighborhood-Attention-Transformer.
Keyword: transformer
UNesT: Local Spatial Representation Learning with Hierarchical Transformer for Efficient Medical Segmentation
- Authors: Xin Yu, Qi Yang, Yinchi Zhou, Leon Y. Cai, Riqiang Gao, Ho Hin Lee, Thomas Li, Shunxing Bao, Zhoubing Xu, Thomas A. Lasko, Richard G. Abramson, Zizhao Zhang, Yuankai Huo, Bennett A. Landman, Yucheng Tang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.14378
- Pdf link: https://arxiv.org/pdf/2209.14378
- Abstract Transformer-based models, capable of learning better global dependencies, have recently demonstrated exceptional representation learning capabilities in computer vision and medical image analysis. Transformer reformats the image into separate patches and realize global communication via the self-attention mechanism. However, positional information between patches is hard to preserve in such 1D sequences, and loss of it can lead to sub-optimal performance when dealing with large amounts of heterogeneous tissues of various sizes in 3D medical image segmentation. Additionally, current methods are not robust and efficient for heavy-duty medical segmentation tasks such as predicting a large number of tissue classes or modeling globally inter-connected tissues structures. Inspired by the nested hierarchical structures in vision transformer, we proposed a novel 3D medical image segmentation method (UNesT), employing a simplified and faster-converging transformer encoder design that achieves local communication among spatially adjacent patch sequences by aggregating them hierarchically. We extensively validate our method on multiple challenging datasets, consisting anatomies of 133 structures in brain, 14 organs in abdomen, 4 hierarchical components in kidney, and inter-connected kidney tumors). We show that UNesT consistently achieves state-of-the-art performance and evaluate its generalizability and data efficiency. Particularly, the model achieves whole brain segmentation task complete ROI with 133 tissue classes in single network, outperforms prior state-of-the-art method SLANT27 ensembled with 27 network tiles, our model performance increases the mean DSC score of the publicly available Colin and CANDI dataset from 0.7264 to 0.7444 and from 0.6968 to 0.7025, respectively.
Downstream Datasets Make Surprisingly Good Pretraining Corpora
- Authors: Kundan Krishna, Saurabh Garg, Jeffrey P. Bigham, Zachary C. Lipton
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.14389
- Pdf link: https://arxiv.org/pdf/2209.14389
- Abstract For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around $10\times$--$500\times$ less data), outperforming the latter on $7$ and $5$ datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks, including the GLUE benchmark. Our results suggest that in many scenarios, performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the incorporation of massive datasets. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data.
Multi-stage Information Retrieval for Vietnamese Legal Texts
- Authors: Nhat-Minh Pham, Ha-Thanh Nguyen, Trong-Hop Do
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.14494
- Pdf link: https://arxiv.org/pdf/2209.14494
- Abstract This study deals with the problem of information retrieval (IR) for Vietnamese legal texts. Despite being well researched in many languages, information retrieval has still not received much attention from the Vietnamese research community. This is especially true for the case of legal documents, which are hard to process. This study proposes a new approach for information retrieval for Vietnamese legal documents using sentence-transformer. Besides, various experiments are conducted to make comparisons between different transformer models, ranking scores, syllable-level, and word-level training. The experiment results show that the proposed model outperforms models used in current research on information retrieval for Vietnamese documents.
DiGress: Discrete Denoising diffusion for graph generation
- Authors: Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, Pascal Frossard
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.14734
- Pdf link: https://arxiv.org/pdf/2209.14734
- Abstract This work introduces DiGress, a discrete denoising diffusion model for generating graphs with categorical node and edge attributes. Our model defines a diffusion process that progressively edits a graph with noise (adding or removing edges, changing the categories), and a graph transformer network that learns to revert this process. With these two ingredients in place, we reduce distribution learning over graphs to a simple sequence of classification tasks. We further improve sample quality by proposing a new Markovian noise model that preserves the marginal distribution of node and edge types during diffusion, and by adding auxiliary graph-theoretic features derived from the noisy graph at each diffusion step. Finally, we propose a guidance procedure for conditioning the generation on graph-level features. Overall, DiGress achieves state-of-the-art performance on both molecular and non-molecular datasets, with up to 3x validity improvement on a dataset of planar graphs. In particular, it is the first model that scales to the large GuacaMol dataset containing 1.3M drug-like molecules without using a molecule-specific representation such as SMILES or fragments.
Named Entity Recognition in Industrial Tables using Tabular Language Models
- Authors: Aneta Koleva, Martin Ringsquandl, Mark Buckley, Rakebul Hasan, Volker Tresp
- Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.14812
- Pdf link: https://arxiv.org/pdf/2209.14812
- Abstract Specialized transformer-based models for encoding tabular data have gained interest in academia. Although tabular data is omnipresent in industry, applications of table transformers are still missing. In this paper, we study how these models can be applied to an industrial Named Entity Recognition (NER) problem where the entities are mentioned in tabular-structured spreadsheets. The highly technical nature of spreadsheets as well as the lack of labeled data present major challenges for fine-tuning transformer-based models. Therefore, we develop a dedicated table data augmentation strategy based on available domain-specific knowledge graphs. We show that this boosts performance in our low-resource scenario considerably. Further, we investigate the benefits of tabular structure as inductive bias compared to tables as linearized sequences. Our experiments confirm that a table transformer outperforms other baselines and that its tabular inductive bias is vital for convergence of transformer-based models.
Towards Lightweight Black-Box Attacks against Deep Neural Networks
- Authors: Chenghao Sun, Yonggang Zhang, Wan Chaoqun, Qizhou Wang, Ya Li, Tongliang Liu, Bo Han, Xinmei Tian
- Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2209.14826
- Pdf link: https://arxiv.org/pdf/2209.14826
- Abstract Black-box attacks can generate adversarial examples without accessing the parameters of target model, largely exacerbating the threats of deployed deep neural networks (DNNs). However, previous works state that black-box attacks fail to mislead target models when their training data and outputs are inaccessible. In this work, we argue that black-box attacks can pose practical attacks in this extremely restrictive scenario where only several test samples are available. Specifically, we find that attacking the shallow layers of DNNs trained on a few test samples can generate powerful adversarial examples. As only a few samples are required, we refer to these attacks as lightweight black-box attacks. The main challenge to promoting lightweight attacks is to mitigate the adverse impact caused by the approximation error of shallow layers. As it is hard to mitigate the approximation error with few available samples, we propose Error TransFormer (ETF) for lightweight attacks. Namely, ETF transforms the approximation error in the parameter space into a perturbation in the feature space and alleviates the error by disturbing features. In experiments, lightweight black-box attacks with the proposed ETF achieve surprising results. For example, even if only 1 sample per category available, the attack success rate in lightweight black-box attacks is only about 3% lower than that of the black-box attacks with complete training data.
Lightweight Monocular Depth Estimation with an Edge Guided Network
- Authors: Xingshuai Dong, Matthew A. Garratt, Sreenatha G. Anavatti, Hussein A. Abbass, Junyu Dong
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2209.14829
- Pdf link: https://arxiv.org/pdf/2209.14829
- Abstract Monocular depth estimation is an important task that can be applied to many robotic applications. Existing methods focus on improving depth estimation accuracy via training increasingly deeper and wider networks, however these suffer from large computational complexity. Recent studies found that edge information are important cues for convolutional neural networks (CNNs) to estimate depth. Inspired by the above observations, we present a novel lightweight Edge Guided Depth Estimation Network (EGD-Net) in this study. In particular, we start out with a lightweight encoder-decoder architecture and embed an edge guidance branch which takes as input image gradients and multi-scale feature maps from the backbone to learn the edge attention features. In order to aggregate the context information and edge attention features, we design a transformer-based feature aggregation module (TRFA). TRFA captures the long-range dependencies between the context information and edge attention features through cross-attention mechanism. We perform extensive experiments on the NYU depth v2 dataset. Experimental results show that the proposed method runs about 96 fps on a Nvidia GTX 1080 GPU whilst achieving the state-of-the-art performance in terms of accuracy.
ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition
- Authors: Martin Radfar, Rohit Barnwal, Rupak Vignesh Swaminathan, Feng-Ju Chang, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris
- Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2209.14868
- Pdf link: https://arxiv.org/pdf/2209.14868
- Abstract The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and between attention layers. In this paper, we introduce a new streaming ASR model, Convolutional Augmented Recurrent Neural Network Transducers (ConvRNN-T) in which we augment the LSTM-based RNN-T with a novel convolutional frontend consisting of local and global context CNN encoders. ConvRNN-T takes advantage of causal 1-D convolutional layers, squeeze-and-excitation, dilation, and residual blocks to provide both global and local audio context representation to LSTM layers. We show ConvRNN-T outperforms RNN-T, Conformer, and ContextNet on Librispeech and in-house data. In addition, ConvRNN-T offers less computational complexity compared to Conformer. ConvRNN-T's superior accuracy along with its low footprint make it a promising candidate for on-device streaming ASR technologies.
Human Motion Diffusion Model
- Authors: Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Amit H. Bermano, Daniel Cohen-Or
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
- Arxiv link: https://arxiv.org/abs/2209.14916
- Pdf link: https://arxiv.org/pdf/2209.14916
- Abstract Natural and expressive human motion generation is the holy grail of computer animation. It is a challenging task, due to the diversity of possible motion, human perceptual sensitivity to it, and the difficulty of accurately describing it. Therefore, current generative solutions are either low-quality or limited in expressiveness. Diffusion models, which have already shown remarkable generative capabilities in other domains, are promising candidates for human motion due to their many-to-many nature, but they tend to be resource hungry and hard to control. In this paper, we introduce Motion Diffusion Model (MDM), a carefully adapted classifier-free diffusion-based generative model for the human motion domain. MDM is transformer-based, combining insights from motion generation literature. A notable design-choice is the prediction of the sample, rather than the noise, in each diffusion step. This facilitates the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss. As we demonstrate, MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion and action-to-motion. https://guytevet.github.io/mdm-page/ .
Transfer Learning with Pretrained Remote Sensing Transformers
- Authors: Anthony Fuller, Koreen Millard, James R. Green
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.14969
- Pdf link: https://arxiv.org/pdf/2209.14969
- Abstract Although the remote sensing (RS) community has begun to pretrain transformers (intended to be fine-tuned on RS tasks), it is unclear how these models perform under distribution shifts. Here, we pretrain a new RS transformer--called SatViT-V2--on 1.3 million satellite-derived RS images, then fine-tune it (along with five other models) to investigate how it performs on distributions not seen during training. We split an expertly labeled land cover dataset into 14 datasets based on source biome. We train each model on each biome separately and test them on all other biomes. In all, this amounts to 1638 biome transfer experiments. After fine-tuning, we find that SatViT-V2 outperforms SatViT-V1 by 3.1% on in-distribution (matching biomes) and 2.8% on out-of-distribution (mismatching biomes) data. Additionally, we find that initializing fine-tuning from the linear probed solution (i.e., leveraging LPFT [1]) improves SatViT-V2's performance by another 1.2% on in-distribution and 2.4% on out-of-distribution data. Next, we find that pretrained RS transformers are better calibrated under distribution shifts than non-pretrained models and leveraging LPFT results in further improvements in model calibration. Lastly, we find that five measures of distribution shift are moderately correlated with biome transfer performance. We share code and pretrained model weights. (https://github.com/antofuller/SatViT)
Transformer Meets Boundary Value Inverse Problems
- Authors: Ruchi Guo, Shuhao Cao, Long Chen
- Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
- Arxiv link: https://arxiv.org/abs/2209.14977
- Pdf link: https://arxiv.org/pdf/2209.14977
- Abstract A Transformer-based deep direct sampling method is proposed for solving a class of boundary value inverse problem. A real-time reconstruction is achieved by evaluating the learned inverse operator between carefully designed data and the reconstructed images. An effort is made to give a case study for a fundamental and critical question: whether and how one can benefit from the theoretical structure of a mathematical problem to develop task-oriented and structure-conforming deep neural network? Inspired by direct sampling methods for inverse problems, the 1D boundary data are preprocessed by a partial differential equation-based feature map to yield 2D harmonic extensions in different frequency input channels. Then, by introducing learnable non-local kernel, the approximation of direct sampling is recast to a modified attention mechanism. The proposed method is then applied to electrical impedance tomography, a well-known severely ill-posed nonlinear inverse problem. The new method achieves superior accuracy over its predecessors and contemporary operator learners, as well as shows robustness with respect to noise. This research shall strengthen the insights that the attention mechanism, despite being invented for natural language processing tasks, offers great flexibility to be modified in conformity with the a priori mathematical knowledge, which ultimately leads to the design of more physics-compatible neural architectures.
Dilated Neighborhood Attention Transformer
- Authors: Ali Hassani, Humphrey Shi
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.15001
- Pdf link: https://arxiv.org/pdf/2209.15001
- Abstract Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over attention-based baselines such as NAT and Swin, as well as modern convolutional baseline ConvNeXt. Our Large model is ahead of its Swin counterpart by 1.5% box AP in COCO object detection, 1.3% mask AP in COCO instance segmentation, and 1.1% mIoU in ADE20K semantic segmentation, and faster in throughput. We believe combinations of NA and DiNA have the potential to empower various tasks beyond those presented in this paper. To support and encourage research in this direction, in vision and beyond, we open-source our project at: https://github.com/SHI-Labs/Neighborhood-Attention-Transformer.
Effective Vision Transformer Training: A Data-Centric Perspective
- Authors: Benjia Zhou, Pichao Wang, Jun Wan, Yanyan Liang, Fan Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.15006
- Pdf link: https://arxiv.org/pdf/2209.15006
-
Abstract
Vision Transformers (ViTs) have shown promising performance compared with Convolutional Neural Networks (CNNs), but the training of ViTs is much harder than CNNs. In this paper, we define several metrics, including Dynamic Data Proportion (DDP) and Knowledge Assimilation Rate (KAR), to investigate the training process, and divide it into three periods accordingly: formation, growth and exploration. In particular, at the last stage of training, we observe that only a tiny portion of training examples is used to optimize the model. Given the data-hungry nature of ViTs, we thus ask a simple but important question: is it possible to provide abundant
effective'' training examples at EVERY stage of training? To address this issue, we need to address two critical questions, \ie, how to measure the
effectiveness'' of individual training examples, and how to systematically generate enough number ofeffective'' examples when they are running out. To answer the first question, we find that the
difficulty'' of training samples can be adopted as an indicator to measure theeffectiveness'' of training samples. To cope with the second question, we propose to dynamically adjust the
difficulty'' distribution of the training data in these evolution stages. To achieve these two purposes, we propose a novel data-centric ViT training framework to dynamically measure thedifficulty'' of training samples and generate
effective'' samples for models at different training stages. Furthermore, to further enlarge the number of ``effective'' samples and alleviate the overfitting problem in the late training stage of ViTs, we propose a patch-level erasing strategy dubbed PatchErasing. Extensive experiments demonstrate the effectiveness of the proposed data-centric ViT training framework and techniques.
Keyword: scene understanding
There is no result
Keyword: visual reasoning
There is no result