arxiv-daily New submissions for Thu, 4 Jan 24

New submissions for Thu, 4 Jan 24

Open DongZhouGu opened this issue 1 year ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Assisting Blind People Using Object Detection with Vocal Feedback

Authors: Heba Najm, Khirallah Elferjani, Alhaam Alariyibi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.01362
Pdf link: https://arxiv.org/pdf/2401.01362
Abstract For visually impaired people, it is highly difficult to make independent movement and safely move in both indoors and outdoors environment. Furthermore, these physically and visually challenges prevent them from in day-today live activities. Similarly, they have problem perceiving objects of surrounding environment that may pose a risk to them. The proposed approach suggests detection of objects in real-time video by using a web camera, for the object identification, process. You Look Only Once (YOLO) model is utilized which is CNN-based real-time object detection technique. Additionally, The OpenCV libraries of Python is used to implement the software program as well as deep learning process is performed. Image recognition results are transferred to the visually impaired users in audible form by means of Google text-to-speech library and determine object location relative to its position in the screen. The obtaining result was evaluated by using the mean Average Precision (mAP), and it was found that the proposed approach achieves excellent results when it compared to previous approaches.

Fast Quantum Convolutional Neural Networks for Low-Complexity Object Detection in Autonomous Driving Applications

Authors: Hankyul Baek, Donghyeon Kim, Joongheon Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2401.01370
Pdf link: https://arxiv.org/pdf/2401.01370
Abstract Spurred by consistent advances and innovation in deep learning, object detection applications have become prevalent, particularly in autonomous driving that leverages various visual data. As convolutional neural networks (CNNs) are being optimized, the performances and computation speeds of object detection in autonomous driving have been significantly improved. However, due to the exponentially rapid growth in the complexity and scale of data used in object detection, there are limitations in terms of computation speeds while conducting object detection solely with classical computing. Motivated by this, quantum convolution-based object detection (QCOD) is proposed to adopt quantum computing to perform object detection at high speed. The QCOD utilizes our proposed fast quantum convolution that uploads input channel information and re-constructs output channels for achieving reduced computational complexity and thus improving performances. Lastly, the extensive experiments with KITTI autonomous driving object detection dataset verify that the proposed fast quantum convolution and QCOD are successfully operated in real object detection applications.

DiffYOLO: Object Detection for Anti-Noise via YOLO and Diffusion Models

Authors: Yichen Liu, Huajian Zhang, Daqing Gao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.01659
Pdf link: https://arxiv.org/pdf/2401.01659
Abstract Object detection models represented by YOLO series have been widely used and have achieved great results on the high quality datasets, but not all the working conditions are ideal. To settle down the problem of locating targets on low quality datasets, the existing methods either train a new object detection network, or need a large collection of low-quality datasets to train. However, we propose a framework in this paper and apply it on the YOLO models called DiffYOLO. Specifically, we extract feature maps from the denoising diffusion probabilistic models to enhance the well-trained models, which allows us fine-tune YOLO on high-quality datasets and test on low-quality datasets. The results proved this framework can not only prove the performance on noisy datasets, but also prove the detection results on high-quality test datasets. We will supplement more experiments later (with various datasets and network architectures).

Keyword: transformer

SwapTransformer: highway overtaking tactical planner model via imitation learning on OSHA dataset

Authors: Alireza Shamsoshoara, Safin B Salih, Pedram Aghazadeh
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2401.01425
Pdf link: https://arxiv.org/pdf/2401.01425
Abstract This paper investigates the high-level decision-making problem in highway scenarios regarding lane changing and over-taking other slower vehicles. In particular, this paper aims to improve the Travel Assist feature for automatic overtaking and lane changes on highways. About 9 million samples including lane images and other dynamic objects are collected in simulation. This data; Overtaking on Simulated HighwAys (OSHA) dataset is released to tackle this challenge. To solve this problem, an architecture called SwapTransformer is designed and implemented as an imitation learning approach on the OSHA dataset. Moreover, auxiliary tasks such as future points and car distance network predictions are proposed to aid the model in better understanding the surrounding environment. The performance of the proposed solution is compared with a multi-layer perceptron (MLP) and multi-head self-attention networks as baselines in a simulation environment. We also demonstrate the performance of the model with and without auxiliary tasks. All models are evaluated based on different metrics such as time to finish each lap, number of overtakes, and speed difference with speed limit. The evaluation shows that the SwapTransformer model outperforms other models in different traffic densities in the inference phase.

ProbMCL: Simple Probabilistic Contrastive Learning for Multi-label Visual Classification

Authors: Ahmad Sajedi, Samir Khaki, Yuri A. Lawryshyn, Konstantinos N. Plataniotis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.01448
Pdf link: https://arxiv.org/pdf/2401.01448
Abstract Multi-label image classification presents a challenging task in many domains, including computer vision and medical imaging. Recent advancements have introduced graph-based and transformer-based methods to improve performance and capture label dependencies. However, these methods often include complex modules that entail heavy computation and lack interpretability. In this paper, we propose Probabilistic Multi-label Contrastive Learning (ProbMCL), a novel framework to address these challenges in multi-label image classification tasks. Our simple yet effective approach employs supervised contrastive learning, in which samples that share enough labels with an anchor image based on a decision threshold are introduced as a positive set. This structure captures label dependencies by pulling positive pair embeddings together and pushing away negative samples that fall below the threshold. We enhance representation learning by incorporating a mixture density network into contrastive learning and generating Gaussian mixture distributions to explore the epistemic uncertainty of the feature encoder. We validate the effectiveness of our framework through experimentation with datasets from the computer vision and medical imaging domains. Our method outperforms the existing state-of-the-art methods while achieving a low computational footprint on both datasets. Visualization analyses also demonstrate that ProbMCL-learned classifiers maintain a meaningful semantic topology.

Token Propagation Controller for Efficient Vision Transformer

Authors: Wentao Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2401.01470
Pdf link: https://arxiv.org/pdf/2401.01470
Abstract Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks, however their quadratic complexity in the number of input tokens has limited their application specially in resource-constrained settings. Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers. We empirically demonstrate that this assumption is often not correct, i.e., tokens that are redundant in one layer can be useful in later layers. We employ this key insight to propose a novel token propagation controller (TPC) that incorporates two different token-distributions, i.e., pause probability and restart probability to control the reduction and reuse of tokens respectively, which results in more efficient token utilization. To improve the estimates of token distributions, we propose a smoothing mechanism that acts as a regularizer and helps remove noisy outliers. Furthermore, to improve the training-stability of our proposed TPC, we introduce a model stabilizer that is able to implicitly encode local image structures and minimize accuracy fluctuations during model training. We present extensive experimental results on the ImageNet-1K dataset using DeiT, LV-ViT and Swin models to demonstrate the effectiveness of our proposed method. For example, compared to baseline models, our proposed method improves the inference speed of the DeiT-S by 250% while increasing the classification accuracy by 1.0%.

Kernel-U-Net: Hierarchical and Symmetrical Framework for Multivariate Time Series Forecasting

Authors: Jiang You, Reńe Natowicz, Arben Cela, Jacob Ouanounou, Patrick Siarry
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.01479
Pdf link: https://arxiv.org/pdf/2401.01479
Abstract Time series forecasting task predicts future trends based on historical information. Recent U-Net-based methods have demonstrated superior performance in predicting real-world datasets. However, the performance of these models is lower than patch-based models or linear models. In this work, we propose a symmetric and hierarchical framework, Kernel-U-Net, which cuts the input sequence into slices at each layer of the network and then computes them using kernels. Furthermore, it generalizes the concept of convolutional kernels in classic U-Net to accept custom kernels that follow the same design pattern. Compared to the existing linear or transformer-based solution, our model contains 3 advantages: 1) A small number of parameters: the parameters size is $O(log(L)^2)$ where $L$ is the look-back window size, 2) Flexibility: its kernels can be customized and fitted to the datasets, 3) Computation efficiency: the computation complexity of transformer modules is reduced to $O(log(L)^2)$ if they are placed close to the latent vector. Kernel-U-Net accuracy was greater than or equal to the state-of-the-art model on six (out of seven) real-world datasets.

Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports

Authors: Haopeng Li, Andong Deng, Qiuhong Ke, Jun Liu, Hossein Rahmani, Yulan Guo, Bernt Schiele, Chen Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.01505
Pdf link: https://arxiv.org/pdf/2401.01505
Abstract Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval. However, this task has not been explored due to the lack of relevant datasets and the challenging nature it presents. Most datasets for video question answering (VideoQA) focus mainly on general and coarse-grained understanding of daily-life videos, which is not applicable to sports scenarios requiring professional action understanding and fine-grained motion analysis. In this paper, we introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task. The Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions, covering multiple sports. Furthermore, to address the characteristics of the sports VideoQA task, we propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering. We conduct extensive experiments on Sports-QA, including baseline studies and the evaluation of different methods. The results demonstrate that our AFT achieves state-of-the-art performance.

CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Authors: Yi Rong, Haoran Zhou, Lixin Yuan, Cheng Mei, Jiahao Wang, Tong Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.01552
Pdf link: https://arxiv.org/pdf/2401.01552
Abstract Point cloud completion is an indispensable task for recovering complete point clouds due to incompleteness caused by occlusion, limited sensor resolution, etc. The family of coarse-to-fine generation architectures has recently exhibited great success in point cloud completion and gradually became mainstream. In this work, we unveil one of the key ingredients behind these methods: meticulously devised feature extraction operations with explicit cross-resolution aggregation. We present Cross-Resolution Transformer that efficiently performs cross-resolution aggregation with local attention mechanisms. With the help of our recursive designs, the proposed operation can capture more scales of features than common aggregation operations, which is beneficial for capturing fine geometric characteristics. While prior methodologies have ventured into various manifestations of inter-level cross-resolution aggregation, the effectiveness of intra-level one and their combination has not been analyzed. With unified designs, Cross-Resolution Transformer can perform intra- or inter-level cross-resolution aggregation by switching inputs. We integrate two forms of Cross-Resolution Transformers into one up-sampling block for point generation, and following the coarse-to-fine manner, we construct CRA-PCN to incrementally predict complete shapes with stacked up-sampling blocks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin on several widely used benchmarks. Codes are available at https://github.com/EasyRy/CRA-PCN.

A Transformer-Based Adaptive Semantic Aggregation Method for UAV Visual Geo-Localization

Authors: Shishen Li, Cuiwei Liu, Huaijun Qiu, Zhaokui Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.01574
Pdf link: https://arxiv.org/pdf/2401.01574
Abstract This paper addresses the task of Unmanned Aerial Vehicles (UAV) visual geo-localization, which aims to match images of the same geographic target taken by different platforms, i.e., UAVs and satellites. In general, the key to achieving accurate UAV-satellite image matching lies in extracting visual features that are robust against viewpoint changes, scale variations, and rotations. Current works have shown that part matching is crucial for UAV visual geo-localization since part-level representations can capture image details and help to understand the semantic information of scenes. However, the importance of preserving semantic characteristics in part-level representations is not well discussed. In this paper, we introduce a transformer-based adaptive semantic aggregation method that regards parts as the most representative semantics in an image. Correlations of image patches to different parts are learned in terms of the transformer's feature map. Then our method decomposes part-level features into an adaptive sum of all patch features. By doing this, the learned parts are encouraged to focus on patches with typical semantics. Extensive experiments on the University-1652 dataset have shown the superiority of our method over the current works.

Context-Guided Spatio-Temporal Video Grounding

Authors: Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, Libo Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.01578
Pdf link: https://arxiv.org/pdf/2401.01578
Abstract Spatio-temporal video grounding (or STVG) task aims at locating a spatio-temporal tube for a specific instance given a text query. Despite advancements, current methods easily suffer the distractors or heavy object appearance variations in videos due to insufficient object information from the text, leading to degradation. Addressing this, we propose a novel framework, context-guided STVG (CG-STVG), which mines discriminative instance context for object in videos and applies it as a supplementary guidance for target localization. The key of CG-STVG lies in two specially designed modules, including instance context generation (ICG), which focuses on discovering visual context information (in both appearance and motion) of the instance, and instance context refinement (ICR), which aims to improve the instance context from ICG by eliminating irrelevant or even harmful information from the context. During grounding, ICG, together with ICR, are deployed at each decoding stage of a Transformer architecture for instance context learning. Particularly, instance context learned from one decoding stage is fed to the next stage, and leveraged as a guidance containing rich and discriminative object feature to enhance the target-awareness in decoding feature, which conversely benefits generating better new instance context for improving localization finally. Compared to existing methods, CG-STVG enjoys object information in text query and guidance from mined instance visual context for more accurate target localization. In our experiments on three benchmarks, including HCSTVG-v1/-v2 and VidSTG, CG-STVG sets new state-of-the-arts in m_tIoU and m_vIoU on all of them, showing its efficacy. The code will be released at https://github.com/HengLan/CGSTVG.

Entropy-based Probing Beam Selection and Beam Prediction via Deep Learning

Authors: Fan Meng, Cheng Zhang, Yongming Huang, Zhilei Zhang, Xiaoyu Bai, Zhaohua Lu
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2401.01609
Pdf link: https://arxiv.org/pdf/2401.01609
Abstract Hierarchical beam search in mmWave communications incurs substantial training overhead, necessitating deep learning-enabled beam predictions to effectively leverage channel priors and mitigate this overhead. In this study, we introduce a comprehensive probabilistic model of power distribution in beamspace, and formulate the joint optimization problem of probing beam selection and probabilistic beam prediction as an entropy minimization problem. Then, we propose a greedy scheme to iteratively and alternately solve this problem, where a transformer-based beam predictor is trained to estimate the conditional power distribution based on the probing beams and user location within each iteration, and the trained predictor selects an unmeasured beam that minimizes the entropy of remaining beams. To further reduce the number of interactions and the computational complexity of the iterative scheme, we propose a two-stage probing beam selection scheme. Firstly, probing beams are selected from a location-specific codebook designed by an entropy-based criterion, and predictions are made with corresponding feedback. Secondly, the optimal beam is identified using additional probing beams with the highest predicted power values. Simulation results demonstrate the superiority of the proposed schemes compared to hierarchical beam search and beam prediction with uniform probing beams.

MLPs Compass: What is learned when MLPs are combined with PLMs?

Authors: Li Zhou, Wenyu Chen, Yong Cao, Dingyi Zeng, Wanlong Liu, Hong Qu
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.01667
Pdf link: https://arxiv.org/pdf/2401.01667
Abstract While Transformer-based pre-trained language models and their variants exhibit strong semantic representation capabilities, the question of comprehending the information gain derived from the additional components of PLMs remains an open question in this field. Motivated by recent efforts that prove Multilayer-Perceptrons (MLPs) modules achieving robust structural capture capabilities, even outperforming Graph Neural Networks (GNNs), this paper aims to quantify whether simple MLPs can further enhance the already potent ability of PLMs to capture linguistic information. Specifically, we design a simple yet effective probing framework containing MLPs components based on BERT structure and conduct extensive experiments encompassing 10 probing tasks spanning three distinct linguistic levels. The experimental results demonstrate that MLPs can indeed enhance the comprehension of linguistic structure by PLMs. Our research provides interpretable and valuable insights into crafting variations of PLMs utilizing MLPs for tasks that emphasize diverse linguistic structures.

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Authors: Dengdi Sun, Yajie Pan, Andong Lu, Chenglong Li, Bin Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.01674
Pdf link: https://arxiv.org/pdf/2401.01674
Abstract Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitations, we propose a novel Transformer RGBT tracking approach, which mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes, for robust RGBT tracking. We introduce independent dynamic template tokens to interact with the search region, embedding temporal information to address appearance changes, while also retaining the involvement of the initial static template tokens in the joint feature extraction process to ensure the preservation of the original reliable target appearance information that prevent deviations from the target appearance caused by traditional temporal updates. We also use attention mechanisms to enhance the target features of multimodal template tokens by incorporating supplementary modal cues, and make the multimodal search region tokens interact with multimodal dynamic template tokens via attention mechanisms, which facilitates the conveyance of multimodal-enhanced target change information. Our module is inserted into the transformer backbone network and inherits joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms while running at 39.1 FPS.

Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement

Authors: Zheng Yuan, Jie Zhang, Yude Wang, Shiguang Shan, Xilin Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.01750
Pdf link: https://arxiv.org/pdf/2401.01750
Abstract The attention mechanism has been proven effective on various visual tasks in recent years. In the semantic segmentation task, the attention mechanism is applied in various methods, including the case of both Convolution Neural Networks (CNN) and Vision Transformer (ViT) as backbones. However, we observe that the attention mechanism is vulnerable to patch-based adversarial attacks. Through the analysis of the effective receptive field, we attribute it to the fact that the wide receptive field brought by global attention may lead to the spread of the adversarial patch. To address this issue, in this paper, we propose a Robust Attention Mechanism (RAM) to improve the robustness of the semantic segmentation model, which can notably relieve the vulnerability against patch-based attacks. Compared to the vallina attention mechanism, RAM introduces two novel modules called Max Attention Suppression and Random Attention Dropout, both of which aim to refine the attention matrix and limit the influence of a single adversarial patch on the semantic segmentation results of other positions. Extensive experiments demonstrate the effectiveness of our RAM to improve the robustness of semantic segmentation models against various patch-based attack methods under different attack settings.

FullLoRA-AT: Efficiently Boosting the Robustness of Pretrained Vision Transformers

Authors: Zheng Yuan, Jie Zhang, Shiguang Shan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.01752
Pdf link: https://arxiv.org/pdf/2401.01752
Abstract In recent years, the Vision Transformer (ViT) model has gradually become mainstream in various computer vision tasks, and the robustness of the model has received increasing attention. However, existing large models tend to prioritize performance during training, potentially neglecting the robustness, which may lead to serious security concerns. In this paper, we establish a new challenge: exploring how to use a small number of additional parameters for adversarial finetuning to quickly and effectively enhance the adversarial robustness of a standardly trained model. To address this challenge, we develop the novel LNLoRA module, incorporating a learnable layer normalization before the conventional LoRA module, which helps mitigate magnitude differences in parameters between the adversarial and standard training paradigms. Furthermore, we propose the FullLoRA-AT framework by integrating the learnable LNLoRA modules into all key components of ViT-based models while keeping the pretrained model frozen, which can significantly improve the model robustness via adversarial finetuning in a parameter-efficient manner. Extensive experiments on CIFAR-10, CIFAR-100, and Imagenette demonstrate the superiority of our proposed FullLoRA-AT framework. It achieves comparable robustness with full finetuning while only requiring about 5% of the learnable parameters. This also effectively addresses concerns regarding extra model storage space and enormous training time caused by adversarial finetuning.

Incremental FastPitch: Chunk-based High Quality Text to Speech

Authors: Muyang Du, Chuan Liu, Junjie Lai
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2401.01755
Pdf link: https://arxiv.org/pdf/2401.01755
Abstract Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.

Iterative Mask Filling: An Effective Text Augmentation Method Using Masked Language Modeling

Authors: Himmet Toprak Kesgin, Mehmet Fatih Amasyali
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.01830
Pdf link: https://arxiv.org/pdf/2401.01830
Abstract Data augmentation is an effective technique for improving the performance of machine learning models. However, it has not been explored as extensively in natural language processing (NLP) as it has in computer vision. In this paper, we propose a novel text augmentation method that leverages the Fill-Mask feature of the transformer-based BERT model. Our method involves iteratively masking words in a sentence and replacing them with language model predictions. We have tested our proposed method on various NLP tasks and found it to be effective in many cases. Our results are presented along with a comparison to existing augmentation methods. Experimental results show that our proposed method significantly improves performance, especially on topic classification datasets.

Transformer Neural Autoregressive Flows

Authors: Massimiliano Patacchiola, Aliaksandra Shysheya, Katja Hofmann, Richard E. Turner
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.01855
Pdf link: https://arxiv.org/pdf/2401.01855
Abstract Density estimation, a central problem in machine learning, can be performed using Normalizing Flows (NFs). NFs comprise a sequence of invertible transformations, that turn a complex target distribution into a simple one, by exploiting the change of variables theorem. Neural Autoregressive Flows (NAFs) and Block Neural Autoregressive Flows (B-NAFs) are arguably the most perfomant members of the NF family. However, they suffer scalability issues and training instability due to the constraints imposed on the network structure. In this paper, we propose a novel solution to these challenges by exploiting transformers to define a new class of neural flows called Transformer Neural Autoregressive Flows (T-NAFs). T-NAFs treat each dimension of a random variable as a separate input token, using attention masking to enforce an autoregressive constraint. We take an amortization-inspired approach where the transformer outputs the parameters of an invertible transformation. The experimental results demonstrate that T-NAFs consistently match or outperform NAFs and B-NAFs across multiple datasets from the UCI benchmark. Remarkably, T-NAFs achieve these results using an order of magnitude fewer parameters than previous approaches, without composing multiple flows.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

Jan 04 '24 02:01 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Thu, 4 Jan 24

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

Assisting Blind People Using Object Detection with Vocal Feedback

Fast Quantum Convolutional Neural Networks for Low-Complexity Object Detection in Autonomous Driving Applications

DiffYOLO: Object Detection for Anti-Noise via YOLO and Diffusion Models

Keyword: transformer

SwapTransformer: highway overtaking tactical planner model via imitation learning on OSHA dataset

ProbMCL: Simple Probabilistic Contrastive Learning for Multi-label Visual Classification

Token Propagation Controller for Efficient Vision Transformer

Kernel-U-Net: Hierarchical and Symmetrical Framework for Multivariate Time Series Forecasting

Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports

CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

A Transformer-Based Adaptive Semantic Aggregation Method for UAV Visual Geo-Localization

Context-Guided Spatio-Temporal Video Grounding

Entropy-based Probing Beam Selection and Beam Prediction via Deep Learning

MLPs Compass: What is learned when MLPs are combined with PLMs?

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement

FullLoRA-AT: Efficiently Boosting the Robustness of Pretrained Vision Transformers

Incremental FastPitch: Chunk-based High Quality Text to Speech

Iterative Mask Filling: An Effective Text Augmentation Method Using Masked Language Modeling

Transformer Neural Autoregressive Flows

Keyword: scene understanding

Keyword: visual reasoning

arxiv-daily
arxiv-daily copied to clipboard