arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Wed, 16 Nov 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Local Magnification for Data and Feature Augmentation

  • Authors: Kun He, Chang Liu, Stephen Lin, John E. Hopcroft
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.07859
  • Pdf link: https://arxiv.org/pdf/2211.07859
  • Abstract In recent years, many data augmentation techniques have been proposed to increase the diversity of input data and reduce the risk of overfitting on deep neural networks. In this work, we propose an easy-to-implement and model-free data augmentation method called Local Magnification (LOMA). Different from other geometric data augmentation methods that perform global transformations on images, LOMA generates additional training data by randomly magnifying a local area of the image. This local magnification results in geometric changes that significantly broaden the range of augmentations while maintaining the recognizability of objects. Moreover, we extend the idea of LOMA and random cropping to the feature space to augment the feature map, which further boosts the classification accuracy considerably. Experiments show that our proposed LOMA, though straightforward, can be combined with standard data augmentation to significantly improve the performance on image classification and object detection. And further combination with our feature augmentation techniques, termed LOMA_IF&FO, can continue to strengthen the model and outperform advanced intensity transformation methods for data augmentation.

PAI3D: Painting Adaptive Instance-Prior for 3D Object Detection

  • Authors: Hao Liu, Zhuoran Xu, Dan Wang, Baofeng Zhang, Guan Wang, Bo Dong, Xin Wen, Xinyu Xu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.08055
  • Pdf link: https://arxiv.org/pdf/2211.08055
  • Abstract 3D object detection is a critical task in autonomous driving. Recently multi-modal fusion-based 3D object detection methods, which combine the complementary advantages of LiDAR and camera, have shown great performance improvements over mono-modal methods. However, so far, no methods have attempted to utilize the instance-level contextual image semantics to guide the 3D object detection. In this paper, we propose a simple and effective Painting Adaptive Instance-prior for 3D object detection (PAI3D) to fuse instance-level image semantics flexibly with point cloud features. PAI3D is a multi-modal sequential instance-level fusion framework. It first extracts instance-level semantic information from images, the extracted information, including objects categorical label, point-to-object membership and object position, are then used to augment each LiDAR point in the subsequent 3D detection network to guide and improve detection performance. PAI3D outperforms the state-of-the-art with a large margin on the nuScenes dataset, achieving 71.4 in mAP and 74.2 in NDS on the test split. Our comprehensive experiments show that instance-level image semantics contribute the most to the performance gain, and PAI3D works well with any good-quality instance segmentation models and any modern point cloud 3D encoders, making it a strong candidate for deployment on autonomous vehicles.

3D Cascade RCNN: High Quality Object Detection in Point Clouds

  • Authors: Qi Cai, Yingwei Pan, Ting Yao, Tao Mei
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.08248
  • Pdf link: https://arxiv.org/pdf/2211.08248
  • Abstract Recent progress on 2D object detection has featured Cascade RCNN, which capitalizes on a sequence of cascade detectors to progressively improve proposal quality, towards high-quality object detection. However, there has not been evidence in support of building such cascade structures for 3D object detection, a challenging detection scenario with highly sparse LiDAR point clouds. In this work, we present a simple yet effective cascade architecture, named 3D Cascade RCNN, that allocates multiple detectors based on the voxelized point clouds in a cascade paradigm, pursuing higher quality 3D object detector progressively. Furthermore, we quantitatively define the sparsity level of the points within 3D bounding box of each object as the point completeness score, which is exploited as the task weight for each proposal to guide the learning of each stage detector. The spirit behind is to assign higher weights for high-quality proposals with relatively complete point distribution, while down-weight the proposals with extremely sparse points that often incur noise during training. This design of completeness-aware re-weighting elegantly upgrades the cascade paradigm to be better applicable for the sparse input data, without increasing any FLOP budgets. Through extensive experiments on both the KITTI dataset and Waymo Open Dataset, we validate the superiority of our proposed 3D Cascade RCNN, when comparing to state-of-the-art 3D object detection techniques. The source code is publicly available at \url{https://github.com/caiqi/Cascasde-3D}.

Keyword: transformer

Multilevel Transformer For Multimodal Emotion Recognition

  • Authors: Junyi He, Meimei Wu, Meng Li, Xiaobo Zhu, Feng Ye
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.07711
  • Pdf link: https://arxiv.org/pdf/2211.07711
  • Abstract Multimodal emotion recognition has attracted much attention recently. Fusing multiple modalities effectively with limited labeled data is a challenging task. Considering the success of pre-trained model and fine-grained nature of emotion expression, it is reasonable to take these two aspects into consideration. Unlike previous methods that mainly focus on one aspect, we introduce a novel multi-granularity framework, which combines fine-grained representation with pre-trained utterance-level representation. Inspired by Transformer TTS, we propose a multilevel transformer model to perform fine-grained multimodal emotion recognition. Specifically, we explore different methods to incorporate phoneme-level embedding with word-level embedding. To perform multi-granularity learning, we simply combine multilevel transformer model with Albert. Extensive experimental results show that both our multilevel transformer model and multi-granularity model outperform previous state-of-the-art approaches on IEMOCAP dataset with text transcripts and speech signal.

Fast DistilBERT on CPUs

  • Authors: Haihao Shen, Ofir Zafrir, Bo Dong, Hengyu Meng, Xinyu Ye, Zhe Wang, Yi Ding, Hanwen Chang, Guy Boudoukh, Moshe Wasserblat
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.07715
  • Pdf link: https://arxiv.org/pdf/2211.07715
  • Abstract Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve inference efficiency. However, these compression techniques require specialized software to apply and deploy at scale. In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy loss on the question-answering SQuADv1.1 benchmark, and throughput results under typical production constraints and environments. Our results outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50% and up to 4.1x performance speedup over ONNX Runtime.

Zero-Shot Text Matching for Automated Auditing using Sentence Transformers

  • Authors: David Biesner, Maren Pielka, Rajkumar Ramamurthy, Tim Dilmaghani, Bernd Kliem, Rüdiger Loitz, Rafet Sifa
  • Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.07716
  • Pdf link: https://arxiv.org/pdf/2211.07716
  • Abstract Natural language processing methods have several applications in automated auditing, including document or passage classification, information retrieval, and question answering. However, training such models requires a large amount of annotated data which is scarce in industrial settings. At the same time, techniques like zero-shot and unsupervised learning allow for application of models pre-trained using general domain data to unseen domains. In this work, we study the efficiency of unsupervised text matching using Sentence-Bert, a transformer-based model, by applying it to the semantic similarity of financial passages. Experimental results show that this model is robust to documents from in- and out-of-domain data.

The Birds Need Attention Too: Analysing usage of Self Attention in identifying bird calls in soundscapes

  • Authors: Chandra Kanth Nagesh, Abhishek Purushothama
  • Subjects: Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2211.07722
  • Pdf link: https://arxiv.org/pdf/2211.07722
  • Abstract Birds are vital parts of ecosystems across the world and are an excellent measure of the quality of life on earth. Many bird species are endangered while others are already extinct. Ecological efforts in understanding and monitoring bird populations are important to conserve their habitat and species, but this mostly relies on manual methods in rough terrains. Recent advances in Machine Learning and Deep Learning have made automatic bird recognition in diverse environments possible. Birdcall recognition till now has been performed using convolutional neural networks. In this work, we try and understand how self-attention can aid in this endeavor. With that we build an pre-trained Attention-based Spectrogram Transformer baseline for BirdCLEF 2022 and compare the results against the pre-trained Convolution-based baseline. Our results show that the transformer models outperformed the convolutional model and we further validate our results by building baselines and analyzing the results for the previous year BirdCLEF 2021 challenge. Source code available at https://github.com/ck090/BirdCLEF-22

QueryForm: A Simple Zero-shot Form Entity Query Framework

  • Authors: Zifeng Wang, Zizhao Zhang, Jacob Devlin, Chen-Yu Lee, Guolong Su, Hao Zhang, Jennifer Dy, Vincent Perot, Tomas Pfister
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2211.07730
  • Pdf link: https://arxiv.org/pdf/2211.07730
  • Abstract Zero-shot transfer learning for document understanding is a crucial yet under-investigated scenario to help reduce the high cost involved in annotating document entities. We present a novel query-based framework, QueryForm, that extracts entity values from form-like documents in a zero-shot fashion. QueryForm contains a dual prompting mechanism that composes both the document schema and a specific entity type into a query, which is used to prompt a Transformer model to perform a single entity extraction task. Furthermore, we propose to leverage large-scale query-entity pairs generated from form-like webpages with weak HTML annotations to pre-train QueryForm. By unifying pre-training and fine-tuning into the same query-based framework, QueryForm enables models to learn from structured documents containing various entities and layouts, leading to better generalization to target document types without the need for target-specific training data. QueryForm sets new state-of-the-art average F1 score on both the XFUND (+4.6%~10.1%) and the Payment (+3.2%~9.5%) zero-shot benchmark, with a smaller model size and no additional image input.

General Intelligence Requires Rethinking Exploration

  • Authors: Minqi Jiang, Tim Rocktäschel, Edward Grefenstette
  • Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.07819
  • Pdf link: https://arxiv.org/pdf/2211.07819
  • Abstract We are at the cusp of a transition from "learning from data" to "learning what data to learn from" as a central focus of artificial intelligence (AI) research. While the first-order learning problem is not completely solved, large models under unified architectures, such as transformers, have shifted the learning bottleneck from how to effectively train our models to how to effectively acquire and use task-relevant data. This problem, which we frame as exploration, is a universal aspect of learning in open-ended domains, such as the real world. Although the study of exploration in AI is largely limited to the field of reinforcement learning, we argue that exploration is essential to all learning systems, including supervised learning. We propose the problem of generalized exploration to conceptually unify exploration-driven learning between supervised learning and reinforcement learning, allowing us to highlight key similarities across learning settings and open research challenges. Importantly, generalized exploration serves as a necessary objective for maintaining open-ended learning processes, which in continually learning to discover and solve new problems, provides a promising path to more general intelligence.

Using Human Perception to Regularize Transfer Learning

  • Authors: Justin Dulay, Walter J. Scheirer
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.07885
  • Pdf link: https://arxiv.org/pdf/2211.07885
  • Abstract Recent trends in the machine learning community show that models with fidelity toward human perceptual measurements perform strongly on vision tasks. Likewise, human behavioral measurements have been used to regularize model performance. But can we transfer latent knowledge gained from this across different learning objectives? In this work, we introduce PERCEP-TL (Perceptual Transfer Learning), a methodology for improving transfer learning with the regularization power of psychophysical labels in models. We demonstrate which models are affected the most by perceptual transfer learning and find that models with high behavioral fidelity -- including vision transformers -- improve the most from this regularization by as much as 1.9% Top@1 accuracy points. These findings suggest that biologically inspired learning agents can benefit from human behavioral measurements as regularizers and psychophysical learned representations can be transferred to independent evaluation tasks.

YORO -- Lightweight End to End Visual Grounding

  • Authors: Chih-Hui Ho, Srikar Appalaraju, Bhavan Jasani, R. Manmatha, Nuno Vasconcelos
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.07912
  • Pdf link: https://arxiv.org/pdf/2211.07912
  • Abstract We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. This task involves localizing, in an image, an object referred via natural language. Unlike the recent trend in the literature of using multi-stage approaches that sacrifice speed for accuracy, YORO seeks a better trade-off between speed an accuracy by embracing a single-stage design, without CNN backbone. YORO consumes natural language queries, image patches, and learnable detection tokens and predicts coordinates of the referred object, using a single transformer encoder. To assist the alignment between text and visual objects, a novel patch-text alignment loss is proposed. Extensive experiments are conducted on 5 different datasets with ablations on architecture design choices. YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins. It is also the fastest VG model and achieves the best speed/accuracy trade-off in the literature.

Breakpoint Transformers for Modeling and Tracking Intermediate Beliefs

  • Authors: Kyle Richardson, Ronen Tamari, Oren Sultan, Reut Tsarfaty, Dafna Shahaf, Ashish Sabharwal
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2211.07950
  • Pdf link: https://arxiv.org/pdf/2211.07950
  • Abstract Can we teach natural language understanding models to track their beliefs through intermediate points in text? We propose a representation learning framework called breakpoint modeling that allows for learning of this type. Given any text encoder and data marked with intermediate states (breakpoints) along with corresponding textual queries viewed as true/false propositions (i.e., the candidate beliefs of a model, consisting of information changing through time) our approach trains models in an efficient and end-to-end fashion to build intermediate representations that facilitate teaching and direct querying of beliefs at arbitrary points alongside solving other end tasks. To show the benefit of our approach, we experiment with a diverse set of NLU tasks including relational reasoning on CLUTRR and narrative understanding on bAbI. Using novel belief prediction tasks for both tasks, we show the benefit of our main breakpoint transformer, based on T5, over conventional representation learning approaches in terms of processing efficiency, prediction accuracy and prediction consistency, all with minimal to no effect on corresponding QA end tasks. To show the feasibility of incorporating our belief tracker into more complex reasoning pipelines, we also obtain SOTA performance on the three-tiered reasoning challenge for the TRIP benchmark (around 23-32% absolute improvement on Tasks 2-3).

Adaptive Multi-Neighborhood Attention based Transformer for Graph Representation Learning

  • Authors: Gaichao Li, Jinsong Chen, Kun He
  • Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
  • Arxiv link: https://arxiv.org/abs/2211.07970
  • Pdf link: https://arxiv.org/pdf/2211.07970
  • Abstract By incorporating the graph structural information into Transformers, graph Transformers have exhibited promising performance for graph representation learning in recent years. Existing graph Transformers leverage specific strategies, such as Laplacian eigenvectors and shortest paths of the node pairs, to preserve the structural features of nodes and feed them into the vanilla Transformer to learn the representations of nodes. It is hard for such predefined rules to extract informative graph structural features for arbitrary graphs whose topology structure varies greatly, limiting the learning capacity of the models. To this end, we propose an adaptive graph Transformer, termed Multi-Neighborhood Attention based Graph Transformer (MNA-GT), which captures the graph structural information for each node from the multi-neighborhood attention mechanism adaptively. By defining the input to perform scaled-dot product as an attention kernel, MNA-GT constructs multiple attention kernels based on different hops of neighborhoods such that each attention kernel can capture specific graph structural information of the corresponding neighborhood for each node pair. In this way, MNA-GT can preserve the graph structural information efficiently by incorporating node representations learned by different attention kernels. MNA-GT further employs an attention layer to learn the importance of different attention kernels to enable the model to adaptively capture the graph structural information for different nodes. Extensive experiments are conducted on a variety of graph benchmarks, and the empirical results show that MNA-GT outperforms many strong baselines.

Contextual Transformer for Offline Meta Reinforcement Learning

  • Authors: Runji Lin, Ye Li, Xidong Feng, Zhaowei Zhang, Xian Hong Wu Fung, Haifeng Zhang, Jun Wang, Yali Du, Yaodong Yang
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.08016
  • Pdf link: https://arxiv.org/pdf/2211.08016
  • Abstract The pretrain-finetuning paradigm in large-scale sequence models has made significant progress in natural language processing and computer vision tasks. However, such a paradigm is still hindered by several challenges in Reinforcement Learning (RL), including the lack of self-supervised pretraining algorithms based on offline data and efficient fine-tuning/prompt-tuning over unseen downstream tasks. In this work, we explore how prompts can improve sequence modeling-based offline reinforcement learning (offline-RL) algorithms. Firstly, we propose prompt tuning for offline RL, where a context vector sequence is concatenated with the input to guide the conditional policy generation. As such, we can pretrain a model on the offline dataset with self-supervised loss and learn a prompt to guide the policy towards desired actions. Secondly, we extend our framework to Meta-RL settings and propose Contextual Meta Transformer (CMT); CMT leverages the context among different tasks as the prompt to improve generalization on unseen tasks. We conduct extensive experiments across three different offline-RL settings: offline single-agent RL on the D4RL dataset, offline Meta-RL on the MuJoCo benchmark, and offline MARL on the SMAC benchmark. Superior results validate the strong performance, and generality of our methods.

NAR-Former: Neural Architecture Representation Learning towards Holistic Attributes Prediction

  • Authors: Yun Yi, Haokui Zhang, Wenze Hu, Nannan Wang, Xiaoyu Wang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.08024
  • Pdf link: https://arxiv.org/pdf/2211.08024
  • Abstract With the wide and deep adoption of deep learning models in real applications, there is an increasing need to model and learn the representations of the neural networks themselves. These models can be used to estimate attributes of different neural network architectures such as the accuracy and latency, without running the actual training or inference tasks. In this paper, we propose a neural architecture representation model that can be used to estimate these attributes holistically. Specifically, we first propose a simple and effective tokenizer to encode both the operation and topology information of a neural network into a single sequence. Then, we design a multi-stage fusion transformer to build a compact vector representation from the converted sequence. For efficient model training, we further propose an information flow consistency augmentation and correspondingly design an architecture consistency loss, which brings more benefits with less augmentation samples compared with previous random augmentation strategies. Experiment results on NAS-Bench-101, NAS-Bench-201, DARTS search space and NNLQP show that our proposed framework can be used to predict the aforementioned latency and accuracy attributes of both cell architectures and whole deep neural networks, and achieves promising performance.

FedTune: A Deep Dive into Efficient Federated Fine-Tuning with Pre-trained Transformers

  • Authors: Jinyu Chen, Wenchao Xu, Song Guo, Junxiao Wang, Jie Zhang, Haozhao Wang
  • Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.08025
  • Pdf link: https://arxiv.org/pdf/2211.08025
  • Abstract Federated Learning (FL) is an emerging paradigm that enables distributed users to collaboratively and iteratively train machine learning models without sharing their private data. Motivated by the effectiveness and robustness of self-attention-based architectures, researchers are turning to using pre-trained Transformers (i.e., foundation models) instead of traditional convolutional neural networks in FL to leverage their excellent transfer learning capabilities. Despite recent progress, how pre-trained Transformer models play a role in FL remains obscure, that is, how to efficiently fine-tune these pre-trained models in FL and how FL users could benefit from this new paradigm. In this paper, we explore this issue and demonstrate that the fine-tuned Transformers achieve extraordinary performance on FL, and that the lightweight fine-tuning method facilitates a fast convergence rate and low communication costs. Concretely, we conduct a rigorous empirical study of three tuning methods (i.e., modifying the input, adding extra modules, and adjusting the backbone) using two types of pre-trained models (i.e., vision-language models and vision models) for FL. Our experiments show that 1) Fine-tuning the bias term of the backbone performs best when relying on a strong pre-trained model; 2) The vision-language model (e.g., CLIP) outperforms the pure vision model (e.g., ViT) and is more robust to the few-shot settings; 3) Compared to pure local training, FL with pre-trained models has a higher accuracy because it alleviates the problem of over-fitting. We will release our code and encourage further exploration of pre-trained Transformers and FL.

Persian Emotion Detection using ParsBERT and Imbalanced Data Handling Approaches

  • Authors: Amirhossein Abaskohi, Nazanin Sabri, Behnam Bahrak
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2211.08029
  • Pdf link: https://arxiv.org/pdf/2211.08029
  • Abstract Emotion recognition is one of the machine learning applications which can be done using text, speech, or image data gathered from social media spaces. Detecting emotion can help us in different fields, including opinion mining. With the spread of social media, different platforms like Twitter have become data sources, and the language used in these platforms is informal, making the emotion detection task difficult. EmoPars and ArmanEmo are two new human-labeled emotion datasets for the Persian language. These datasets, especially EmoPars, are suffering from inequality between several samples between two classes. In this paper, we evaluate EmoPars and compare them with ArmanEmo. Throughout this analysis, we use data augmentation techniques, data re-sampling, and class-weights with Transformer-based Pretrained Language Models(PLMs) to handle the imbalance problem of these datasets. Moreover, feature selection is used to enhance the models' performance by emphasizing the text's specific features. In addition, we provide a new policy for selecting data from EmoPars, which selects the high-confidence samples; as a result, the model does not see samples that do not have specific emotion during training. Our model reaches a Macro-averaged F1-score of 0.81 and 0.76 on ArmanEmo and EmoPars, respectively, which are new state-of-the-art results in these benchmarks.

Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling

  • Authors: Yu Wang, Xin Li, Shengzhao Wen, Fukui Yang, Wanping Zhang, Gang Zhang, Haocheng Feng, Junyu Han, Errui Ding
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.08071
  • Pdf link: https://arxiv.org/pdf/2211.08071
  • Abstract DETR is a novel end-to-end transformer architecture object detector, which significantly outperforms classic detectors when scaling up the model size. In this paper, we focus on the compression of DETR with knowledge distillation. While knowledge distillation has been well-studied in classic detectors, there is a lack of researches on how to make it work effectively on DETR. We first provide experimental and theoretical analysis to point out that the main challenge in DETR distillation is the lack of consistent distillation points. Distillation points refer to the corresponding inputs of the predictions for student to mimic, and reliable distillation requires sufficient distillation points which are consistent between teacher and student. Based on this observation, we propose a general knowledge distillation paradigm for DETR(KD-DETR) with consistent distillation points sampling. Specifically, we decouple detection and distillation tasks by introducing a set of specialized object queries to construct distillation points. In this paradigm, we further propose a general-to-specific distillation points sampling strategy to explore the extensibility of KD-DETR. Extensive experiments on different DETR architectures with various scales of backbones and transformer layers validate the effectiveness and generalization of KD-DETR. KD-DETR boosts the performance of DAB-DETR with ResNet-18 and ResNet-50 backbone to 41.4$%$, 45.7$%$ mAP, respectively, which are 5.2$%$, 3.5$%$ higher than the baseline, and ResNet-50 even surpasses the teacher model by $2.2%$.

ShadowDiffusion: Diffusion-based Shadow Removal using Classifier-driven Attention and Structure Preservation

  • Authors: Yeying Jin, Wenhan Yang, Wei Ye, Yuan Yuan, Robby T. Tan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.08089
  • Pdf link: https://arxiv.org/pdf/2211.08089
  • Abstract Shadow removal from a single image is challenging, particularly with the presence of soft and self shadows. Unlike hard shadows, soft shadows do not show any clear boundaries, while self shadows are shadows that cast on the object itself. Most existing methods require the detection/annotation of binary shadow masks, without taking into account the ambiguous boundaries of soft and self shadows. Most deep learning shadow removal methods are GAN-based and require statistical similarity between shadow and shadow-free domains. In contrast to these methods, in this paper, we present ShadowDiffusion, the first diffusion-based shadow removal method. ShadowDiffusion focuses on single-image shadow removal, even in the presence of soft and self shadows. To guide the diffusion process to recover semantically meaningful structures during the reverse diffusion, we introduce a structure preservation loss, where we extract features from the pre-trained Vision Transformer (DINO-ViT). Moreover, to focus on the recovery of shadow regions, we inject classifier-driven attention into the architecture of the diffusion model. To maintain the consistent colors of the regions where the shadows have been removed, we introduce a chromaticity consistency loss. Our ShadowDiffusion outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Our method outperforms the SOTA method by 20% of the RMSE of the whole image on the SRD dataset.

HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

  • Authors: Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Kenneth Liu, Zhenglun Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, Yanzhi Wang
  • Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.08110
  • Pdf link: https://arxiv.org/pdf/2211.08110
  • Abstract While vision transformers (ViTs) have continuously achieved new milestones in the field of computer vision, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-limited edge devices. In this paper, we propose a hardware-efficient image-adaptive token pruning framework called HeatViT for efficient yet accurate ViT acceleration on embedded FPGAs. By analyzing the inherent computational patterns in ViTs, we first design an effective attention-based multi-head token selector, which can be progressively inserted before transformer blocks to dynamically identify and consolidate the non-informative tokens from input images. Moreover, we implement the token selector on hardware by adding miniature control logic to heavily reuse existing hardware components built for the backbone ViT. To improve the hardware efficiency, we further employ 8-bit fixed-point quantization, and propose polynomial approximations with regularization effect on quantization error for the frequently used nonlinear functions in ViTs. Finally, we propose a latency-aware multi-stage training strategy to determine the transformer blocks for inserting token selectors and optimize the desired (average) pruning rates for inserted token selectors, in order to improve both the model accuracy and inference latency on hardware. Compared to existing ViT pruning studies, under the similar computation cost, HeatViT can achieve 0.7%$\sim$8.9% higher accuracy; while under the similar model accuracy, HeatViT can achieve more than 28.4%$\sim$65.3% computation reduction, for various widely used ViTs, including DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, on the ImageNet dataset. Compared to the baseline hardware accelerator, our implementations of HeatViT on the Xilinx ZCU102 FPGA achieve 3.46$\times$$\sim$4.89$\times$ speedup.

Design of Unmanned Air Vehicles Using Transformer Surrogate Models

  • Authors: Adam D. Cobb, Anirban Roy, Daniel Elenius, Susmit Jha
  • Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2211.08138
  • Pdf link: https://arxiv.org/pdf/2211.08138
  • Abstract Computer-aided design (CAD) is a promising new area for the application of artificial intelligence (AI) and machine learning (ML). The current practice of design of cyber-physical systems uses the digital twin methodology, wherein the actual physical design is preceded by building detailed models that can be evaluated by physics simulation models. These physics models are often slow and the manual design process often relies on exploring near-by variations of existing designs. AI holds the promise of breaking these design silos and increasing the diversity and performance of designs by accelerating the exploration of the design space. In this paper, we focus on the design of electrical unmanned aerial vehicles (UAVs). The high-density batteries and purely electrical propulsion systems have disrupted the space of UAV design, making this domain an ideal target for AI-based design. In this paper, we develop an AI Designer that synthesizes novel UAV designs. Our approach uses a deep transformer model with a novel domain-specific encoding such that we can evaluate the performance of new proposed designs without running expensive flight dynamics models and CAD tools. We demonstrate that our approach significantly reduces the overall compute requirements for the design process and accelerates the design space exploration. Finally, we identify future research directions to achieve full-scale deployment of AI-assisted CAD for UAVs.

Multi-Task Learning for massive MIMO CSI Feedback

  • Authors: Sharan Mourya, SaiDhiraj Amuru, Kiran Kumar Kuchi
  • Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2211.08173
  • Pdf link: https://arxiv.org/pdf/2211.08173
  • Abstract Deep learning-based massive MIMO CSI feedback has received a lot of attention in recent years. Now, there exists a plethora of CSI feedback models that exploit a wide variety of deep learning models and techniques ranging from convolutional neural networks (CNNs) to the recent attention-based transformer networks. Most of the models are based on auto-encoders (AE) architecture with an encoder network at the user equipment (UE) and a decoder network at the gNB (base station). However, these models are trained for a single user in a single channel scenario, making them ineffective in scenarios where a gNB is addressing various users while each user has different abilities and may employ a different CSI feedback encoder network and also in scenarios where the users are employing the same encoder network but are experiencing different channel conditions. In this work, we address these specific issues by exploiting the techniques of multi-task learning (MTL) in the context of massive MIMO CSI feedback.

Premonition Net, A Multi-Timeline Transformer Network Architecture Towards Strawberry Tabletop Yield Forecasting

  • Authors: George Onoufriou, Marc Hanheide, Georgios Leontidis
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.08177
  • Pdf link: https://arxiv.org/pdf/2211.08177
  • Abstract Yield forecasting is a critical first step necessary for yield optimisation, with important consequences for the broader food supply chain, procurement, price-negotiation, logistics, and supply. However yield forecasting is notoriously difficult, and oft-inaccurate. Premonition Net is a multi-timeline, time sequence ingesting approach towards processing the past, the present, and premonitions of the future. We show how this structure combined with transformers attains critical yield forecasting proficiency towards improving food security, lowering prices, and reducing waste. We find data availability to be a continued difficulty however using our premonition network and our own collected data we attain yield forecasts 3 weeks ahead with a a testing set RMSE loss of ~0.08 across our latest season.

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

  • Authors: Pieter Delobelle, Thomas Winters, Bettina Berendt
  • Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.08192
  • Pdf link: https://arxiv.org/pdf/2211.08192
  • Abstract Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this paper, we update RobBERT, a RoBERTa-based state-of-the-art Dutch language model, which was trained in 2019. First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus, e.g. corona-related words. Then we further pre-train the RobBERT model using this dataset. To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.We found that for certain language tasks this update results in a significant performance increase. These results highlight the benefit of continually updating a language model to account for evolving language use.

Dynamic Temporal Filtering in Video Models

  • Authors: Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Chong-Wah Ngo, Tao Mei
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.08252
  • Pdf link: https://arxiv.org/pdf/2211.08252
  • Abstract Video temporal dynamics is conventionally modeled with 3D spatial-temporal kernel or its factorized version comprised of 2D spatial kernel and 1D temporal kernel. The modeling power, nevertheless, is limited by the fixed window size and static weights of a kernel along the temporal dimension. The pre-determined kernel size severely limits the temporal receptive fields and the fixed weights treat each spatial location across frames equally, resulting in sub-optimal solution for long-range temporal modeling in natural scenes. In this paper, we present a new recipe of temporal feature learning, namely Dynamic Temporal Filter (DTF), that novelly performs spatial-aware temporal modeling in frequency domain with large temporal receptive field. Specifically, DTF dynamically learns a specialized frequency filter for every spatial location to model its long-range temporal dynamics. Meanwhile, the temporal feature of each spatial location is also transformed into frequency feature spectrum via 1D Fast Fourier Transform (FFT). The spectrum is modulated by the learnt frequency filter, and then transformed back to temporal domain with inverse FFT. In addition, to facilitate the learning of frequency filter in DTF, we perform frame-wise aggregation to enhance the primary temporal feature with its temporal neighbors by inter-frame correlation. It is feasible to plug DTF block into ConvNets and Transformer, yielding DTF-Net and DTF-Transformer. Extensive experiments conducted on three datasets demonstrate the superiority of our proposals. More remarkably, DTF-Transformer achieves an accuracy of 83.5% on Kinetics-400 dataset. Source code is available at \url{https://github.com/FuchenUSTC/DTF}.

Empowering Language Models with Knowledge Graph Reasoning for Question Answering

  • Authors: Ziniu Hu, Yichong Xu, Wenhao Yu, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Kai-Wei Chang, Yizhou Sun
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.08380
  • Pdf link: https://arxiv.org/pdf/2211.08380
  • Abstract Answering open-domain questions requires world knowledge about in-context entities. As pre-trained Language Models (LMs) lack the power to store all required knowledge, external knowledge sources, such as knowledge graphs, are often used to augment LMs. In this work, we propose knOwledge REasOning empowered Language Model (OREO-LM), which consists of a novel Knowledge Interaction Layer that can be flexibly plugged into existing Transformer-based LMs to interact with a differentiable Knowledge Graph Reasoning module collaboratively. In this way, LM guides KG to walk towards the desired answer, while the retrieved knowledge improves LM. By adopting OREO-LM to RoBERTa and T5, we show significant performance gain, achieving state-of-art results in the Closed-Book setting. The performance enhancement is mainly from the KG reasoning's capacity to infer missing relational facts. In addition, OREO-LM provides reasoning paths as rationales to interpret the model's decision.

Structured Knowledge Distillation Towards Efficient and Compact Multi-View 3D Detection

  • Authors: Linfeng Zhang, Yukang Shi, Hung-Shuo Tai, Zhipeng Zhang, Yuan He, Ke Wang, Kaisheng Ma
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.08398
  • Pdf link: https://arxiv.org/pdf/2211.08398
  • Abstract Detecting 3D objects from multi-view images is a fundamental problem in 3D computer vision. Recently, significant breakthrough has been made in multi-view 3D detection tasks. However, the unprecedented detection performance of these vision BEV (bird's-eye-view) detection models is accompanied with enormous parameters and computation, which make them unaffordable on edge devices. To address this problem, in this paper, we propose a structured knowledge distillation framework, aiming to improve the efficiency of modern vision-only BEV detection models. The proposed framework mainly includes: (a) spatial-temporal distillation which distills teacher knowledge of information fusion from different timestamps and views, (b) BEV response distillation which distills teacher response to different pillars, and (c) weight-inheriting which solves the problem of inconsistent inputs between students and teacher in modern transformer architectures. Experimental results show that our method leads to an average improvement of 2.16 mAP and 2.27 NDS on the nuScenes benchmark, outperforming multiple baselines by a large margin.

Keyword: scene understanding

Visually Grounded VQA by Lattice-based Retrieval

  • Authors: Daniel Reich, Felix Putze, Tanja Schultz
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.08086
  • Pdf link: https://arxiv.org/pdf/2211.08086
  • Abstract Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions. Systems with strong VG are considered intuitively interpretable and suggest an improved scene understanding. While VQA accuracy performances have seen impressive gains over the past few years, explicit improvements to VG performance and evaluation thereof have often taken a back seat on the road to overall accuracy improvements. A cause of this originates in the predominant choice of learning paradigm for VQA systems, which consists of training a discriminative classifier over a predetermined set of answer options. In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task. As such, the developed system directly ties VG into its core search procedure. Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question. We give a detailed analysis of our approach and discuss its distinctive properties and limitations. Our approach achieves the strongest VG performance among examined systems and exhibits exceptional generalization capabilities in a number of scenarios.

Monocular BEV Perception of Road Scenes via Front-to-Top View Projection

  • Authors: Wenxi Liu, Qi Li, Weixiang Yang, Jiaxin Cai, Yuanlong Yu, Yuexin Ma, Shengfeng He, Jia Pan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.08144
  • Pdf link: https://arxiv.org/pdf/2211.08144
  • Abstract HD map reconstruction is crucial for autonomous driving. LiDAR-based methods are limited due to expensive sensors and time-consuming computation. Camera-based methods usually need to perform road segmentation and view transformation separately, which often causes distortion and missing content. To push the limits of the technology, we present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view given a front-view monocular image only. We propose a front-to-top view projection (FTVP) module, which takes the constraint of cycle consistency between views into account and makes full use of their correlation to strengthen the view transformation and scene understanding. In addition, we also apply multi-scale FTVP modules to propagate the rich spatial information of low-level features to mitigate spatial deviation of the predicted object location. Experiments on public benchmarks show that our method achieves the state-of-the-art performance in the tasks of road layout estimation, vehicle occupancy estimation, and multi-class semantic estimation. For multi-class semantic estimation, in particular, our model outperforms all competitors by a large margin. Furthermore, our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.

FlowGrad: Using Motion for Visual Sound Source Localization

  • Authors: Rajsuryan Singh, Pablo Zinemanas, Xavier Serra, Juan Pablo Bello, Magdalena Fuentes
  • Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2211.08367
  • Pdf link: https://arxiv.org/pdf/2211.08367
  • Abstract Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information. An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding.

Keyword: visual reasoning

There is no result

DongZhouGu avatar Nov 16 '22 03:11 DongZhouGu