arxiv-daily
arxiv-daily copied to clipboard
New submissions for Mon, 17 Oct 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds
- Authors: Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, Dragomir Anguelov
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.07372
- Pdf link: https://arxiv.org/pdf/2210.07372
- Abstract 3D object detection in point clouds is a core component for modern robotics and autonomous driving systems. A key challenge in 3D object detection comes from the inherent sparse nature of point occupancy within the 3D scene. In this paper, we propose Sparse Window Transformer (SWFormer ), a scalable and accurate model for 3D object detection, which can take full advantage of the sparsity of point clouds. Built upon the idea of window-based Transformers, SWFormer converts 3D points into sparse voxels and windows, and then processes these variable-length sparse windows efficiently using a bucketing scheme. In addition to self-attention within each spatial window, our SWFormer also captures cross-window correlation with multi-scale feature fusion and window shifting operations. To further address the unique challenge of detecting 3D objects accurately from sparse features, we propose a new voxel diffusion technique. Experimental results on the Waymo Open Dataset show our SWFormer achieves state-of-the-art 73.36 L2 mAPH on vehicle and pedestrian for 3D object detection on the official test set, outperforming all previous single-stage and two-stage models, while being much more efficient.
SAILOR: Scaling Anchors via Insights into Latent Object
- Authors: Dušan Malić, Christian Fruhwirth-Reisinger, Horst Possegger, Horst Bischof
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.07811
- Pdf link: https://arxiv.org/pdf/2210.07811
- Abstract LiDAR 3D object detection models are inevitably biased towards their training dataset. The detector clearly exhibits this bias when employed on a target dataset, particularly towards object sizes. However, object sizes vary heavily between domains due to, for instance, different labeling policies or geographical locations. State-of-the-art unsupervised domain adaptation approaches outsource methods to overcome the object size bias. Mainstream size adaptation approaches exploit target domain statistics, contradicting the original unsupervised assumption. Our novel unsupervised anchor calibration method addresses this limitation. Given a model trained on the source data, we estimate the optimal target anchors in a completely unsupervised manner. The main idea stems from an intuitive observation: by varying the anchor sizes for the target domain, we inevitably introduce noise or even remove valuable object cues. The latent object representation, perturbed by the anchor size, is closest to the learned source features only under the optimal target anchors. We leverage this observation for anchor size optimization. Our experimental results show that, without any retraining, we achieve competitive results even compared to state-of-the-art weakly-supervised size adaptation approaches. In addition, our anchor calibration can be combined with such existing methods, making them completely unsupervised.
MOVE: Unsupervised Movable Object Segmentation and Detection
- Authors: Adam Bielski, Paolo Favaro
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.07920
- Pdf link: https://arxiv.org/pdf/2210.07920
- Abstract We introduce MOVE, a novel method to segment objects without any form of supervision. MOVE exploits the fact that foreground objects can be shifted locally relative to their initial position and result in realistic (undistorted) new images. This property allows us to train a segmentation model on a dataset of images without annotation and to achieve state of the art (SotA) performance on several evaluation datasets for unsupervised salient object detection and segmentation. In unsupervised single object discovery, MOVE gives an average CorLoc improvement of 7.2% over the SotA, and in unsupervised class-agnostic object detection it gives a relative AP improvement of 53% on average. Our approach is built on top of self-supervised features (e.g. from DINO or MAE), an inpainting network (based on the Masked AutoEncoder) and adversarial training.
Keyword: transformer
Can Demographic Factors Improve Text Classification? Revisiting Demographic Adaptation in the Age of Transformers
- Authors: Chia-Chien Hung, Anne Lauscher, Dirk Hovy, Simone Paolo Ponzetto, Goran Glavaš
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2210.07362
- Pdf link: https://arxiv.org/pdf/2210.07362
- Abstract Demographic factors (e.g., gender or age) shape our language. Previous work showed that incorporating demographic factors can consistently improve performance for various NLP tasks with traditional NLP models. In this work, we investigate whether these previous findings still hold with state-of-the-art pretrained Transformer-based language models (PLMs). We use three common specialization methods proven effective for incorporating external knowledge into pretrained Transformers (e.g., domain-specific or geographic knowledge). We adapt the language representations for the demographic dimensions of gender and age, using continuous language modeling and dynamic multi-task learning for adaptation, where we couple language modeling objectives with the prediction of demographic classes. Our results when employing a multilingual PLM show substantial performance gains across four languages (English, German, French, and Danish), which is consistent with the results of previous work. However, controlling for confounding factors -- primarily domain and language proficiency of Transformer-based PLMs -- shows that downstream performance gains from our demographic adaptation do not actually stem from demographic knowledge. Our results indicate that demographic specialization of PLMs, while holding promise for positive societal impact, still represents an unsolved problem for (modern) NLP.
SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds
- Authors: Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, Dragomir Anguelov
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.07372
- Pdf link: https://arxiv.org/pdf/2210.07372
- Abstract 3D object detection in point clouds is a core component for modern robotics and autonomous driving systems. A key challenge in 3D object detection comes from the inherent sparse nature of point occupancy within the 3D scene. In this paper, we propose Sparse Window Transformer (SWFormer ), a scalable and accurate model for 3D object detection, which can take full advantage of the sparsity of point clouds. Built upon the idea of window-based Transformers, SWFormer converts 3D points into sparse voxels and windows, and then processes these variable-length sparse windows efficiently using a bucketing scheme. In addition to self-attention within each spatial window, our SWFormer also captures cross-window correlation with multi-scale feature fusion and window shifting operations. To further address the unique challenge of detecting 3D objects accurately from sparse features, we propose a new voxel diffusion technique. Experimental results on the Waymo Open Dataset show our SWFormer achieves state-of-the-art 73.36 L2 mAPH on vehicle and pedestrian for 3D object detection on the official test set, outperforming all previous single-stage and two-stage models, while being much more efficient.
Behavior Cloned Transformers are Neurosymbolic Reasoners
- Authors: Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, Prithviraj Ammanabrolu
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2210.07382
- Pdf link: https://arxiv.org/pdf/2210.07382
- Abstract In this work, we explore techniques for augmenting interactive agents with information from symbolic modules, much like humans use tools like calculators and GPS systems to assist with arithmetic and navigation. We test our agent's abilities in text games -- challenging benchmarks for evaluating the multi-step reasoning abilities of game agents in grounded, language-based environments. Our experimental study indicates that injecting the actions from these symbolic modules into the action space of a behavior cloned transformer agent increases performance on four text game benchmarks that test arithmetic, navigation, sorting, and common sense reasoning by an average of 22%, allowing an agent to reach the highest possible performance on unseen games. This action injection technique is easily extended to new agents, environments, and symbolic modules.
Using Graph Algorithms to Pretrain Graph Completion Transformers
- Authors: Jonathan Pilault, Michael Galkin, Bahare Fatemi, Perouz Taslakian, David Vasquez, Christopher Pal
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.07453
- Pdf link: https://arxiv.org/pdf/2210.07453
- Abstract Recent work on Graph Neural Networks has demonstrated that self-supervised pretraining can further enhance performance on downstream graph, link, and node classification tasks. However, the efficacy of pretraining tasks has not been fully investigated for downstream large knowledge graph completion tasks. Using a contextualized knowledge graph embedding approach, we investigate five different pretraining signals, constructed using several graph algorithms and no external data, as well as their combination. We leverage the versatility of our Transformer-based model to explore graph structure generation pretraining tasks, typically inapplicable to most graph embedding methods. We further propose a new path-finding algorithm guided by information gain and find that it is the best-performing pretraining task across three downstream knowledge graph completion datasets. In a multitask setting that combines all pretraining tasks, our method surpasses some of the latest and strong performing knowledge graph embedding methods on all metrics for FB15K-237, on MRR and Hit@1 for WN18RR and on MRR and hit@10 for JF17K (a knowledge hypergraph dataset).
Adaptable Claim Rewriting with Offline Reinforcement Learning for Effective Misinformation Discovery
- Authors: Ashkan Kazemi, Artem Abzaliev, Naihao Deng, Rui Hou, Davis Liang, Scott A. Hale, Verónica Pérez-Rosas, Rada Mihalcea
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2210.07467
- Pdf link: https://arxiv.org/pdf/2210.07467
- Abstract We propose a novel system to help fact-checkers formulate search queries for known misinformation claims and effectively search across multiple social media platforms. We introduce an adaptable rewriting strategy, where editing actions (e.g., swap a word with its synonym; change verb tense into present simple) for queries containing claims are automatically learned through offline reinforcement learning. Specifically, we use a decision transformer to learn a sequence of editing actions that maximize query retrieval metrics such as mean average precision. Through several experiments, we show that our approach can increase the effectiveness of the queries by up to 42% relatively, while producing editing action sequences that are human readable, thus making the system easy to use and explain.
STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition
- Authors: Dasom Ahn, Sangwon Kim, Hyunsu Hong, Byoung Chul Ko
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2210.07503
- Pdf link: https://arxiv.org/pdf/2210.07503
- Abstract In action recognition, although the combination of spatio-temporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from the input video and skeleton sequence, video frames are output as global grid tokens and skeletons are output as joint map tokens, respectively. These tokens are then aggregated into multi-class tokens and input into STAR-transformer. The STAR-transformer encoder layer consists of a full self-attention (FAttn) module and a proposed zigzag spatio-temporal attention (ZAttn) module. Similarly, the continuous decoder consists of a FAttn module and a proposed binary spatio-temporal attention (BAttn) module. STAR-transformer learns an efficient multi-feature representation of the spatio-temporal features by properly arranging pairings of the FAttn, ZAttn, and BAttn modules. Experimental results on the Penn-Action, NTU RGB+D 60, and 120 datasets show that the proposed method achieves a promising improvement in performance in comparison to previous state-of-the-art methods.
Can Language Representation Models Think in Bets?
- Authors: Zhisheng Tang, Mayank Kejriwal
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2210.07519
- Pdf link: https://arxiv.org/pdf/2210.07519
- Abstract In recent years, transformer-based language representation models (LRMs) have achieved state-of-the-art results on difficult natural language understanding problems, such as question answering and text summarization. As these models are integrated into real-world applications, evaluating their ability to make rational decisions is an important research agenda, with practical ramifications. This article investigates LRMs' rational decision-making ability through a carefully designed set of decision-making benchmarks and experiments. Inspired by classic work in cognitive science, we model the decision-making problem as a bet. We then investigate an LRM's ability to choose outcomes that have optimal, or at minimum, positive expected gain. Through a robust body of experiments on four established LRMs, we show that a model is only able to `think in bets' if it is first fine-tuned on bet questions with an identical structure. Modifying the bet question's structure, while still retaining its fundamental characteristics, decreases an LRM's performance by more than 25%, on average, although absolute performance remains well above random. LRMs are also found to be more rational when selecting outcomes with non-negative expected gain, rather than optimal or strictly positive expected gain. Our results suggest that LRMs could potentially be applied to tasks that rely on cognitive decision-making skills, but that more research is necessary before they can robustly make rational decisions.
AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers
- Authors: Ganesh Jawahar, Subhabrata Mukherjee, Xiaodong Liu, Young Jin Kim, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah, Sebastien Bubeck, Jianfeng Gao
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.07535
- Pdf link: https://arxiv.org/pdf/2210.07535
- Abstract Neural architecture search (NAS) has demonstrated promising results on identifying efficient Transformer architectures which outperform manually designed ones for natural language tasks like neural machine translation (NMT). Existing NAS methods operate on a space of dense architectures, where all of the sub-architecture weights are activated for every input. Motivated by the recent advances in sparsely activated models like the Mixture-of-Experts (MoE) model, we introduce sparse architectures with conditional computation into the NAS search space. Given this expressive search space which subsumes prior densely activated architectures, we develop a new framework AutoMoE to search for efficient sparsely activated sub-Transformers. AutoMoE-generated sparse models obtain (i) 3x FLOPs reduction over manually designed dense Transformers and (ii) 23% FLOPs reduction over state-of-the-art NAS-generated dense sub-Transformers with parity in BLEU score on benchmark datasets for NMT. AutoMoE consists of three training phases: (a) Heterogeneous search space design with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?); (b) SuperNet training that jointly trains several subnetworks sampled from the large search space by weight-sharing; (c) Evolutionary search for the architecture with the optimal trade-off between task performance and computational constraint like FLOPs and latency. AutoMoE code, data and trained models are available at https://github.com/microsoft/AutoMoE.
When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture
- Authors: Yichuan Mo, Dongxian Wu, Yifei Wang, Yiwen Guo, Yisen Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.07540
- Pdf link: https://arxiv.org/pdf/2210.07540
- Abstract Vision Transformers (ViTs) have recently achieved competitive performance in broad vision tasks. Unfortunately, on popular threat models, naturally trained ViTs are shown to provide no more adversarial robustness than convolutional neural networks (CNNs). Adversarial training is still required for ViTs to defend against such adversarial attacks. In this paper, we provide the first and comprehensive study on the adversarial training recipe of ViTs via extensive evaluation of various training techniques across benchmark datasets. We find that pre-training and SGD optimizer are necessary for ViTs' adversarial training. Further considering ViT as a new type of model architecture, we investigate its adversarial robustness from the perspective of its unique architectural components. We find, when randomly masking gradients from some attention blocks or masking perturbations on some patches during adversarial training, the adversarial robustness of ViTs can be remarkably improved, which may potentially open up a line of work to explore the architectural information inside the newly designed models like ViTs. Our code is available at https://github.com/mo666666/When-Adversarial-Training-Meets-Vision-Transformers.
Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation
- Authors: Abhay Shukla, Paheli Bhattacharya, Soham Poddar, Rajdeep Mukherjee, Kripabandhu Ghosh, Pawan Goyal, Saptarshi Ghosh
- Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2210.07544
- Pdf link: https://arxiv.org/pdf/2210.07544
- Abstract Summarization of legal case judgement documents is a challenging problem in Legal NLP. However, not much analyses exist on how different families of summarization models (e.g., extractive vs. abstractive) perform when applied to legal case documents. This question is particularly important since many recent transformer-based abstractive summarization models have restrictions on the number of input tokens, and legal documents are known to be very long. Also, it is an open question on how best to evaluate legal case document summarization systems. In this paper, we carry out extensive experiments with several extractive and abstractive summarization methods (both supervised and unsupervised) over three legal summarization datasets that we have developed. Our analyses, that includes evaluation by law practitioners, lead to several interesting insights on legal summarization in specific and long document summarization in general.
Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario
- Authors: Emily R. Bartusiak, Edward J. Delp
- Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2210.07546
- Pdf link: https://arxiv.org/pdf/2210.07546
- Abstract Speech synthesis methods can create realistic-sounding speech, which may be used for fraud, spoofing, and misinformation campaigns. Forensic methods that detect synthesized speech are important for protection against such attacks. Forensic attribution methods provide even more information about the nature of synthesized speech signals because they identify the specific speech synthesis method (i.e., speech synthesizer) used to create a speech signal. Due to the increasing number of realistic-sounding speech synthesizers, we propose a speech attribution method that generalizes to new synthesizers not seen during training. To do so, we investigate speech synthesizer attribution in both a closed set scenario and an open set scenario. In other words, we consider some speech synthesizers to be "known" synthesizers (i.e., part of the closed set) and others to be "unknown" synthesizers (i.e., part of the open set). We represent speech signals as spectrograms and train our proposed method, known as compact attribution transformer (CAT), on the closed set for multi-class classification. Then, we extend our analysis to the open set to attribute synthesized speech signals to both known and unknown synthesizers. We utilize a t-distributed stochastic neighbor embedding (tSNE) on the latent space of the trained CAT to differentiate between each unknown synthesizer. Additionally, we explore poly-1 loss formulations to improve attribution results. Our proposed approach successfully attributes synthesized speech signals to their respective speech synthesizers in both closed and open set scenarios.
TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers
- Authors: Hyeong Kyu Choi, Joonmyung Choi, Hyunwoo J. Kim
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.07562
- Pdf link: https://arxiv.org/pdf/2210.07562
- Abstract Mixup is a commonly adopted data augmentation technique for image classification. Recent advances in mixup methods primarily focus on mixing based on saliency. However, many saliency detectors require intense computation and are especially burdensome for parameter-heavy transformer models. To this end, we propose TokenMixup, an efficient attention-guided token-level data augmentation method that aims to maximize the saliency of a mixed set of tokens. TokenMixup provides x15 faster saliency-aware data augmentation compared to gradient-based methods. Moreover, we introduce a variant of TokenMixup which mixes tokens within a single instance, thereby enabling multi-scale feature augmentation. Experiments show that our methods significantly improve the baseline models' performance on CIFAR and ImageNet-1K, while being more efficient than previous methods. We also reach state-of-the-art performance on CIFAR-100 among from-scratch transformer models. Code is available at https://github.com/mlvlab/TokenMixup.
MCTNet: A Multi-Scale CNN-Transformer Network for Change Detection in Optical Remote Sensing Images
- Authors: Weiming Li, Lihui Xue, Xueqian Wang, Gang Li
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.07601
- Pdf link: https://arxiv.org/pdf/2210.07601
- Abstract For the task of change detection (CD) in remote sensing images, deep convolution neural networks (CNNs)-based methods have recently aggregated transformer modules to improve the capability of global feature extraction. However, they suffer degraded CD performance on small changed areas due to the simple single-scale integration of deep CNNs and transformer modules. To address this issue, we propose a hybrid network based on multi-scale CNN-transformer structure, termed MCTNet, where the multi-scale global and local information are exploited to enhance the robustness of the CD performance on changed areas with different sizes. Especially, we design the ConvTrans block to adaptively aggregate global features from transformer modules and local features from CNN layers, which provides abundant global-local features with different scales. Experimental results demonstrate that our MCTNet achieves better detection performance than existing state-of-the-art CD methods.
Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?
- Authors: Van-Anh Nguyen, Khanh Pham Dinh, Long Tung Vuong, Thanh-Toan Do, Quan Hung Tran, Dinh Phung, Trung Le
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.07646
- Pdf link: https://arxiv.org/pdf/2210.07646
- Abstract Recently vision transformers (ViT) have been applied successfully for various tasks in computer vision. However, important questions such as why they work or how they behave still remain largely unknown. In this paper, we propose an effective visualization technique, to assist us in exposing the information carried in neurons and feature embeddings across the ViT's layers. Our approach departs from the computational process of ViTs with a focus on visualizing the local and global information in input images and the latent feature embeddings at multiple levels. Visualizations at the input and embeddings at level 0 reveal interesting findings such as providing support as to why ViTs are rather generally robust to image occlusions and patch shuffling; or unlike CNNs, level 0 embeddings already carry rich semantic details. Next, we develop a rigorous framework to perform effective visualizations across layers, exposing the effects of ViTs filters and grouping/clustering behaviors to object patches. Finally, we provide comprehensive experiments on real datasets to qualitatively and quantitatively demonstrate the merit of our proposed methods as well as our findings. https://github.com/byM1902/ViT_visualization
Towards Transformer-based Homogenization of Satellite Imagery for Landsat-8 and Sentinel-2
- Authors: Venkatesh Thirugnana Sambandham, Konstantin Kirchheim, Sayan Mukhopadhaya, Frank Ortmeier
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2210.07654
- Pdf link: https://arxiv.org/pdf/2210.07654
- Abstract Landsat-8 (NASA) and Sentinel-2 (ESA) are two prominent multi-spectral imaging satellite projects that provide publicly available data. The multi-spectral imaging sensors of the satellites capture images of the earth's surface in the visible and infrared region of the electromagnetic spectrum. Since the majority of the earth's surface is constantly covered with clouds, which are not transparent at these wavelengths, many images do not provide much information. To increase the temporal availability of cloud-free images of a certain area, one can combine the observations from multiple sources. However, the sensors of satellites might differ in their properties, making the images incompatible. This work provides a first glance at the possibility of using a transformer-based model to reduce the spectral and spatial differences between observations from both satellite projects. We compare the results to a model based on a fully convolutional UNet architecture. Somewhat surprisingly, we find that, while deep models outperform classical approaches, the UNet significantly outperforms the transformer in our experiments.
CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling
- Authors: Jun Zhang, Shuyang Jiang, Jiangtao Feng, Lin Zheng, Lingpeng Kong
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.07661
- Pdf link: https://arxiv.org/pdf/2210.07661
- Abstract Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. Although designing cross and causal variants of an attention method is straightforward for vanilla attention, it is often challenging for efficient attentions with subquadratic time and memory complexity. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.
Pretrained Transformers Do not Always Improve Robustness
- Authors: Swaroop Mishra, Bhavdeep Singh Sachdeva, Chitta Baral
- Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.07663
- Pdf link: https://arxiv.org/pdf/2210.07663
- Abstract Pretrained Transformers (PT) have been shown to improve Out of Distribution (OOD) robustness than traditional models such as Bag of Words (BOW), LSTMs, Convolutional Neural Networks (CNN) powered by Word2Vec and Glove embeddings. How does the robustness comparison hold in a real world setting where some part of the dataset can be noisy? Do PT also provide more robust representation than traditional models on exposure to noisy data? We perform a comparative study on 10 models and find an empirical evidence that PT provide less robust representation than traditional models on exposure to noisy data. We investigate further and augment PT with an adversarial filtering (AF) mechanism that has been shown to improve OOD generalization. However, increase in generalization does not necessarily increase robustness, as we find that noisy data fools the AF method powered by PT.
Low-power In-pixel Computing with Current-modulated Switched Capacitors
- Authors: David Zhang, Gooitzen van der Wal, Saurabh Farkya, Thomas Senko, Aswin Raghavan, Michael Isnardi, Michael Piacentino
- Subjects: Hardware Architecture (cs.AR)
- Arxiv link: https://arxiv.org/abs/2210.07826
- Pdf link: https://arxiv.org/pdf/2210.07826
- Abstract We present a scalable in-pixel processing architecture that can reduce the data throughput by 10X and consume less than 30 mW per megapixel at the imager frontend. Unlike the state-of-the-art (SOA) analog process-in-pixel (PIP) that modulates the exposure time of photosensors when performing matrix-vector multiplications, we use switched capacitors and pulse width modulation (PWM). This non-destructive approach decouples the sensor exposure and computing, providing processing parallelism and high data fidelity. Our design minimizes the computational complexity and chip density by leveraging the patch-based feature extraction that can perform as well as the CNN. We further reduce data using partial observation of the attended objects, which performs closely to the full frame observations. We have been studying the reduction of output features as a function of accuracy, chip density and power consumption from a transformer-based backend model for object classification and detection.
A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing
- Authors: Amir Zeldes, Nick Howell, Noam Ordan, Yifat Ben Moshe
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2210.07873
- Pdf link: https://arxiv.org/pdf/2210.07873
- Abstract Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have relied to date on various versions of the Hebrew Treebank (HTB, Sima'an et al. 2001). However, the data in HTB, a single-source newswire corpus, is now over 30 years old, and does not cover many aspects of contemporary Hebrew on the web. This paper presents a new, freely available UD treebank of Hebrew stratified from a range of topics selected from Hebrew Wikipedia. In addition to introducing the corpus and evaluating the quality of its annotations, we deploy automatic validation tools based on grew (Guillaume, 2021), and conduct the first cross domain parsing experiments in Hebrew. We obtain new state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the latest language modelling and some incremental improvements to existing transformer based approaches. We also release a new version of the UD HTB matching annotation scheme updates from our new corpus.
PedFormer: Pedestrian Behavior Prediction via Cross-Modal Attention Modulation and Gated Multitask Learning
- Authors: Amir Rasouli, Iuliia Kotseruba
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2210.07886
- Pdf link: https://arxiv.org/pdf/2210.07886
- Abstract Predicting pedestrian behavior is a crucial task for intelligent driving systems. Accurate predictions require a deep understanding of various contextual elements that potentially impact the way pedestrians behave. To address this challenge, we propose a novel framework that relies on different data modalities to predict future trajectories and crossing actions of pedestrians from an ego-centric perspective. Specifically, our model utilizes a cross-modal Transformer architecture to capture dependencies between different data types. The output of the Transformer is augmented with representations of interactions between pedestrians and other traffic agents conditioned on the pedestrian and ego-vehicle dynamics that are generated via a semantic attentive interaction module. Lastly, the context encodings are fed into a multi-stream decoder framework using a gated-shared network. We evaluate our algorithm on public pedestrian behavior benchmarks, PIE and JAAD, and show that our model improves state-of-the-art in trajectory and action prediction by up to 22% and 13% respectively on various metrics. The advantages brought by components of our model are investigated via extensive ablation studies.
HashFormers: Towards Vocabulary-independent Pre-trained Transformers
- Authors: Huiyin Xue, Nikolaos Aletras
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2210.07904
- Pdf link: https://arxiv.org/pdf/2210.07904
- Abstract Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding. This one-to-one mapping results into embedding matrices that occupy a lot of memory (i.e. millions of parameters) and grow linearly with the size of the vocabulary. Previous work on on-device transformers dynamically generate token embeddings on-the-fly without embedding matrices using locality-sensitive hashing over morphological information. These embeddings are subsequently fed into transformer layers for text classification. However, these methods are not pre-trained. Inspired by this line of work, we propose HashFormers, a new family of vocabulary-independent pre-trained transformers that support an unlimited vocabulary (i.e. all possible tokens in a corpus) given a substantially smaller fixed-sized embedding matrix. We achieve this by first introducing computationally cheap hashing functions that bucket together individual tokens to embeddings. We also propose three variants that do not require an embedding matrix at all, further reducing the memory requirements. We empirically demonstrate that HashFormers are more memory efficient compared to standard pre-trained transformers while achieving comparable predictive performance when fine-tuned on multiple text classification tasks. For example, our most efficient HashFormer variant has a negligible performance degradation (0.4% on GLUE) using only 99.1K parameters for representing the embeddings compared to 12.3-38M parameters of state-of-the-art models.
Efficiently Controlling Multiple Risks with Pareto Testing
- Authors: Bracha Laufer-Goldshtein, Adam Fisch, Regina Barzilay, Tommi Jaakkola
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2210.07913
- Pdf link: https://arxiv.org/pdf/2210.07913
- Abstract Machine learning applications frequently come with multiple diverse objectives and constraints that can change over time. Accordingly, trained models can be tuned with sets of hyper-parameters that affect their predictive behavior (e.g., their run-time efficiency versus error rate). As the number of constraints and hyper-parameter dimensions grow, naively selected settings may lead to sub-optimal and/or unreliable results. We develop an efficient method for calibrating models such that their predictions provably satisfy multiple explicit and simultaneous statistical guarantees (e.g., upper-bounded error rates), while also optimizing any number of additional, unconstrained objectives (e.g., total run-time cost). Building on recent results in distribution-free, finite-sample risk control for general losses, we propose Pareto Testing: a two-stage process which combines multi-objective optimization with multiple hypothesis testing. The optimization stage constructs a set of promising combinations on the Pareto frontier. We then apply statistical testing to this frontier only to identify configurations that have (i) high utility with respect to our objectives, and (ii) guaranteed risk levels with respect to our constraints, with specifiable high probability. We demonstrate the effectiveness of our approach to reliably accelerate the execution of large-scale Transformer models in natural language processing (NLP) applications. In particular, we show how Pareto Testing can be used to dynamically configure multiple inter-dependent model attributes -- including the number of layers computed before exiting, number of attention heads pruned, or number of text tokens considered -- to simultaneously control and optimize various accuracy and cost metrics.
Trailers12k: Evaluating Transfer Learning for Movie Trailer Genre Classification
- Authors: Ricardo Montalvo-Lezama, Berenice Montalvo-Lezama, Gibran Fuentes-Pineda
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.07983
- Pdf link: https://arxiv.org/pdf/2210.07983
- Abstract Transfer learning is a cornerstone for a wide range of computer vision problems.It has been broadly studied for image analysis tasks. However, literature for video analysis is scarce and has been mainly focused on transferring representations learned from ImageNet to human action recognition tasks. In this paper, we study transfer learning for Multi-label Movie Trailer Genre Classification (MTGC). In particular, we introduce Trailers12k}, a new manually-curated movie trailer dataset and evaluate the transferability of spatial and spatio-temporal representations learned from ImageNet and/or Kinetics to Trailers12k MTGC. In order to reduce the spatio-temporal structure gap between the source and target tasks and improve transferability, we propose a method that performs shot detection so as to segment the trailer into highly correlated clips. We study different aspects that influence transferability, such as segmentation strategy, frame rate, input video extension, and spatio-temporal modeling. Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k, although they provide complementary information that can be combined to improve classification performance. Having a similar number of parameters and FLOPS, Transformers provide a better transferability base than ConvNets. Nevertheless, competitive performance can be achieved using lightweight ConvNets, becoming an attractive option for low-resource environments.
Keyword: scene understanding
SQA3D: Situated Question Answering in 3D Scenes
- Authors: Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, Siyuan Huang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.07474
- Pdf link: https://arxiv.org/pdf/2210.07474
- Abstract We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.
Novel 3D Scene Understanding Applications From Recurrence in a Single Image
- Authors: Shimian Zhang, Skanda Bharadwaj, Keaton Kraiger, Yashasvi Asthana, Hong Zhang, Robert Collins, Yanxi Liu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.07991
- Pdf link: https://arxiv.org/pdf/2210.07991
- Abstract We demonstrate the utility of recurring pattern discovery from a single image for spatial understanding of a 3D scene in terms of (1) vanishing point detection, (2) hypothesizing 3D translation symmetry and (3) counting the number of RP instances in the image. Furthermore, we illustrate the feasibility of leveraging RP discovery output to form a more precise, quantitative text description of the scene. Our quantitative evaluations on a new 1K+ Recurring Pattern (RP) benchmark with diverse variations show that visual perception of recurrence from one single view leads to scene understanding outcomes that are as good as or better than existing supervised methods and/or unsupervised methods that use millions of images.
Keyword: visual reasoning
There is no result