arxiv-daily New submissions for Fri, 12 Jan 24

New submissions for Fri, 12 Jan 24

Open DongZhouGu opened this issue 1 year ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

DualTeacher: Bridging Coexistence of Unlabelled Classes for Semi-supervised Incremental Object Detection

Authors: Ziqi Yuan, Liyuan Wang, Wenbo Ding, Xingxing Zhang, Jiachen Zhong, Jianyong Ai, Jianmin Li, Jun Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.05362
Pdf link: https://arxiv.org/pdf/2401.05362
Abstract In real-world applications, an object detector often encounters object instances from new classes and needs to accommodate them effectively. Previous work formulated this critical problem as incremental object detection (IOD), which assumes the object instances of new classes to be fully annotated in incremental data. However, as supervisory signals are usually rare and expensive, the supervised IOD may not be practical for implementation. In this work, we consider a more realistic setting named semi-supervised IOD (SSIOD), where the object detector needs to learn new classes incrementally from a few labelled data and massive unlabelled data without catastrophic forgetting of old classes. A commonly-used strategy for supervised IOD is to encourage the current model (as a student) to mimic the behavior of the old model (as a teacher), but it generally fails in SSIOD because a dominant number of object instances from old and new classes are coexisting and unlabelled, with the teacher only recognizing a fraction of them. Observing that learning only the classes of interest tends to preclude detection of other classes, we propose to bridge the coexistence of unlabelled classes by constructing two teacher models respectively for old and new classes, and using the concatenation of their predictions to instruct the student. This approach is referred to as DualTeacher, which can serve as a strong baseline for SSIOD with limited resource overhead and no extra hyperparameters. We build various benchmarks for SSIOD and perform extensive experiments to demonstrate the superiority of our approach (e.g., the performance lead is up to 18.28 AP on MS-COCO). Our code is available at \url{https://github.com/chuxiuhong/DualTeacher}.

Domain Similarity-Perceived Label Assignment for Domain Generalized Underwater Object Detection

Authors: Xisheng Li, Wei Li, Pinhao Song, Mingjun Zhang, Jie Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.05401
Pdf link: https://arxiv.org/pdf/2401.05401
Abstract The inherent characteristics and light fluctuations of water bodies give rise to the huge difference between different layers and regions in underwater environments. When the test set is collected in a different marine area from the training set, the issue of domain shift emerges, significantly compromising the model's ability to generalize. The Domain Adversarial Learning (DAL) training strategy has been previously utilized to tackle such challenges. However, DAL heavily depends on manually one-hot domain labels, which implies no difference among the samples in the same domain. Such an assumption results in the instability of DAL. This paper introduces the concept of Domain Similarity-Perceived Label Assignment (DSP). The domain label for each image is regarded as its similarity to the specified domains. Through domain-specific data augmentation techniques, we achieved state-of-the-art results on the underwater cross-domain object detection benchmark S-UODAC2020. Furthermore, we validated the effectiveness of our method in the Cityscapes dataset.

Consensus Focus for Object Detection and minority classes

Authors: Erik Isai Valle Salgado, Chen Li, Yaqi Han, Linchao Shi, Xinghui Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.05530
Pdf link: https://arxiv.org/pdf/2401.05530
Abstract Ensemble methods exploit the availability of a given number of classifiers or detectors trained in single or multiple source domains and tasks to address machine learning problems such as domain adaptation or multi-source transfer learning. Existing research measures the domain distance between the sources and the target dataset, trains multiple networks on the same data with different samples per class, or combines predictions from models trained under varied hyperparameters and settings. Their solutions enhanced the performance on small or tail categories but hurt the rest. To this end, we propose a modified consensus focus for semi-supervised and long-tailed object detection. We introduce a voting system based on source confidence that spots the contribution of each model in a consensus, lets the user choose the relevance of each class in the target label space so that it relaxes minority bounding boxes suppression, and combines multiple models' results without discarding the poisonous networks. Our tests on synthetic driving datasets retrieved higher confidence and more accurate bounding boxes than the NMS, soft-NMS, and WBF.

Wasserstein Distance-based Expansion of Low-Density Latent Regions for Unknown Class Detection

Authors: Prakash Mallick, Feras Dayoub, Jamie Sherrah
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.05594
Pdf link: https://arxiv.org/pdf/2401.05594
Abstract This paper addresses the significant challenge in open-set object detection (OSOD): the tendency of state-of-the-art detectors to erroneously classify unknown objects as known categories with high confidence. We present a novel approach that effectively identifies unknown objects by distinguishing between high and low-density regions in latent space. Our method builds upon the Open-Det (OD) framework, introducing two new elements to the loss function. These elements enhance the known embedding space's clustering and expand the unknown space's low-density regions. The first addition is the Class Wasserstein Anchor (CWA), a new function that refines the classification boundaries. The second is a spectral normalisation step, improving the robustness of the model. Together, these augmentations to the existing Contrastive Feature Learner (CFL) and Unknown Probability Learner (UPL) loss functions significantly improve OSOD performance. Our proposed OpenDet-CWA (OD-CWA) method demonstrates: a) a reduction in open-set errors by approximately 17%-22%, b) an enhancement in novelty detection capability by 1.5%-16%, and c) a decrease in the wilderness index by 2%-20% across various open-set scenarios. These results represent a substantial advancement in the field, showcasing the potential of our approach in managing the complexities of open-set object detection.

Keyword: transformer

STR-Cert: Robustness Certification for Deep Text Recognition on Deep Learning Pipelines and Vision Transformers

Authors: Daqian Shao, Lukas Fesser, Marta Kwiatkowska
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.05338
Pdf link: https://arxiv.org/pdf/2401.05338
Abstract Robustness certification, which aims to formally certify the predictions of neural networks against adversarial inputs, has become an integral part of important tool for safety-critical applications. Despite considerable progress, existing certification methods are limited to elementary architectures, such as convolutional networks, recurrent networks and recently Transformers, on benchmark datasets such as MNIST. In this paper, we focus on the robustness certification of scene text recognition (STR), which is a complex and extensively deployed image-based sequence prediction problem. We tackle three types of STR model architectures, including the standard STR pipelines and the Vision Transformer. We propose STR-Cert, the first certification method for STR models, by significantly extending the DeepPoly polyhedral verification framework via deriving novel polyhedral bounds and algorithms for key STR model components. Finally, we certify and compare STR models on six datasets, demonstrating the efficiency and scalability of robustness certification, particularly for the Vision Transformer.

Efficient LLM inference solution on Intel GPU

Authors: Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu
Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.05391
Pdf link: https://arxiv.org/pdf/2401.05391
Abstract Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with massive operations and perform inference in the auto-regressive mode, making it a challenging task to design a system with high efficiency. In this paper, we propose an efficient LLM inference solution with low latency and high throughput. Firstly, we simplify the LLM decoder layer by fusing data movement and element-wise operations to reduce the memory access frequency and lower system latency. We also propose a segment KV cache policy to keep key/value of the request and response tokens in separate physical memory for effective device memory management, helping enlarge the runtime batch size and improve system throughput. A customized Scaled-Dot-Product-Attention kernel is designed to match our fusion policy based on the segment KV cache solution. We implement our LLM inference solution on Intel GPU and publish it publicly. Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput for some popular LLMs on Intel GPU.

Automated Assessment of Students' Code Comprehension using LLMs

Authors: Priti Oli, Rabin Banjade, Jeevan Chapagain, Vasile Rus
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.05399
Pdf link: https://arxiv.org/pdf/2401.05399
Abstract Assessing student's answers and in particular natural language answers is a crucial challenge in the field of education. Advances in machine learning, including transformer-based models such as Large Language Models(LLMs), have led to significant progress in various natural language tasks. Nevertheless, amidst the growing trend of evaluating LLMs across diverse tasks, evaluating LLMs in the realm of automated answer assesment has not received much attention. To address this gap, we explore the potential of using LLMs for automated assessment of student's short and open-ended answer. Particularly, we use LLMs to compare students' explanations with expert explanations in the context of line-by-line explanations of computer programs. For comparison purposes, we assess both Large Language Models (LLMs) and encoder-based Semantic Textual Similarity (STS) models in the context of assessing the correctness of students' explanation of computer code. Our findings indicate that LLMs, when prompted in few-shot and chain-of-thought setting perform comparable to fine-tuned encoder-based models in evaluating students' short answers in programming domain.

Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics

Authors: Xueyuan Yang, Chao Yao, Xiaojuan Ban
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2401.05412
Pdf link: https://arxiv.org/pdf/2401.05412
Abstract Leveraging wearable devices for motion reconstruction has emerged as an economical and viable technique. Certain methodologies employ sparse Inertial Measurement Units (IMUs) on the human body and harness data-driven strategies to model human poses. However, the reconstruction of motion based solely on sparse IMUs data is inherently fraught with ambiguity, a consequence of numerous identical IMU readings corresponding to different poses. In this paper, we explore the spatial importance of multiple sensors, supervised by text that describes specific actions. Specifically, uncertainty is introduced to derive weighted features for each IMU. We also design a Hierarchical Temporal Transformer (HTT) and apply contrastive learning to achieve precise temporal and feature alignment of sensor data with textual semantics. Experimental results demonstrate our proposed approach achieves significant improvements in multiple metrics compared to existing methods. Notably, with textual supervision, our method not only differentiates between ambiguous actions such as sitting and standing but also produces more precise and natural motion.

FourCastNeXt: Improving FourCastNet Training with Limited Compute

Authors: Edison Guo, Maruf Ahmed, Yue Sun, Rahul Mahendru, Rui Yang, Harrison Cook, Tennessee Leeuwenburg, Ben Evans
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.05584
Pdf link: https://arxiv.org/pdf/2401.05584
Abstract Recently, the FourCastNet Neural Earth System Model (NESM) has shown impressive results on predicting various atmospheric variables, trained on the ERA5 reanalysis dataset. While FourCastNet enjoys quasi-linear time and memory complexity in sequence length compared to quadratic complexity in vanilla transformers, training FourCastNet on ERA5 from scratch still requires large amount of compute resources, which is expensive or even inaccessible to most researchers. In this work, we will show improved methods that can train FourCastNet using only 1% of the compute required by the baseline, while maintaining model performance or par or even better than the baseline.

Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach

Authors: Gang Wu, Junjun Jiang, Junpeng Jiang, Xianming Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2401.05633
Pdf link: https://arxiv.org/pdf/2401.05633
Abstract Recent progress in single-image super-resolution (SISR) has achieved remarkable performance, yet the computational costs of these methods remain a challenge for deployment on resource-constrained devices. Especially for transformer-based methods, the self-attention mechanism in such models brings great breakthroughs while incurring substantial computational costs. To tackle this issue, we introduce the Convolutional Transformer layer (ConvFormer) and the ConvFormer-based Super-Resolution network (CFSR), which offer an effective and efficient solution for lightweight image super-resolution tasks. In detail, CFSR leverages the large kernel convolution as the feature mixer to replace the self-attention module, efficiently modeling long-range dependencies and extensive receptive fields with a slight computational cost. Furthermore, we propose an edge-preserving feed-forward network, simplified as EFN, to obtain local feature aggregation and simultaneously preserve more high-frequency information. Extensive experiments demonstrate that CFSR can achieve an advanced trade-off between computational cost and performance when compared to existing lightweight SR methods. Compared to state-of-the-art methods, e.g. ShuffleMixer, the proposed CFSR achieves 0.39 dB gains on Urban100 dataset for x2 SR task while containing 26% and 31% fewer parameters and FLOPs, respectively. Code and pre-trained models are available at https://github.com/Aitical/CFSR.

Masked Attribute Description Embedding for Cloth-Changing Person Re-identification

Authors: Chunlei Peng, Boyu Wang, Decheng Liu, Nannan Wang, Ruimin Hu, Xinbo Gao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.05646
Pdf link: https://arxiv.org/pdf/2401.05646
Abstract Cloth-changing person re-identification (CC-ReID) aims to match persons who change clothes over long periods. The key challenge in CC-ReID is to extract clothing-independent features, such as face, hairstyle, body shape, and gait. Current research mainly focuses on modeling body shape using multi-modal biological features (such as silhouettes and sketches). However, it does not fully leverage the personal description information hidden in the original RGB image. Considering that there are certain attribute descriptions which remain unchanged after the changing of cloth, we propose a Masked Attribute Description Embedding (MADE) method that unifies personal visual appearance and attribute description for CC-ReID. Specifically, handling variable clothing-sensitive information, such as color and type, is challenging for effective modeling. To address this, we mask the clothing and color information in the personal attribute description extracted through an attribute detection model. The masked attribute description is then connected and embedded into Transformer blocks at various levels, fusing it with the low-level to high-level features of the image. This approach compels the model to discard clothing information. Experiments are conducted on several CC-ReID benchmarks, including PRCC, LTCC, Celeb-reID-light, and LaST. Results demonstrate that MADE effectively utilizes attribute description, enhancing cloth-changing person re-identification performance, and compares favorably with state-of-the-art methods. The code is available at https://github.com/moon-wh/MADE.

LKCA: Large Kernel Convolutional Attention

Authors: Chenghao Li, Boheng Zeng, Yi Lu, Pengbo Shi, Qingzi Chen, Jirui Liu, Lingyun Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.05738
Pdf link: https://arxiv.org/pdf/2401.05738
Abstract We revisit the relationship between attention mechanisms and large kernel ConvNets in visual transformers and propose a new spatial attention named Large Kernel Convolutional Attention (LKCA). It simplifies the attention operation by replacing it with a single large kernel convolution. LKCA combines the advantages of convolutional neural networks and visual transformers, possessing a large receptive field, locality, and parameter sharing. We explained the superiority of LKCA from both convolution and attention perspectives, providing equivalent code implementations for each view. Experiments confirm that LKCA implemented from both the convolutional and attention perspectives exhibit equivalent performance. We extensively experimented with the LKCA variant of ViT in both classification and segmentation tasks. The experiments demonstrated that LKCA exhibits competitive performance in visual tasks. Our code will be made publicly available at https://github.com/CatworldLee/LKCA.

Surface Normal Estimation with Transformers

Authors: Barry Shichen Hu, Siyun Liang, Johannes Paetzold, Huy H. Nguyen, Isao Echizen, Jiapeng Tang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.05745
Pdf link: https://arxiv.org/pdf/2401.05745
Abstract We propose the use of a Transformer to accurately predict normals from point clouds with noise and density variations. Previous learning-based methods utilize PointNet variants to explicitly extract multi-scale features at different input scales, then focus on a surface fitting method by which local point cloud neighborhoods are fitted to a geometric surface approximated by either a polynomial function or a multi-layer perceptron (MLP). However, fitting surfaces to fixed-order polynomial functions can suffer from overfitting or underfitting, and learning MLP-represented hyper-surfaces requires pre-generated per-point weights. To avoid these limitations, we first unify the design choices in previous works and then propose a simplified Transformer-based model to extract richer and more robust geometric features for the surface normal estimation task. Through extensive experiments, we demonstrate that our Transformer-based method achieves state-of-the-art performance on both the synthetic shape dataset PCPNet, and the real-world indoor scene dataset SceneNN, exhibiting more noise-resilient behavior and significantly faster inference. Most importantly, we demonstrate that the sophisticated hand-designed modules in existing works are not necessary to excel at the task of surface normal estimation.

Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection

Authors: Heqing Zou, Meng Shen, Yuchen Hu, Chen Chen, Eng Siong Chng, Deepu Rajan
Subjects: Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2401.05746
Pdf link: https://arxiv.org/pdf/2401.05746
Abstract Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and inconsistencies in learned representations caused by independent modality manipulations in deepfake videos. To address this, we propose cross-modality and within-modality regularization to preserve modality distinctions during multimodal representation learning. Our approach includes an audio-visual transformer module for modality correspondence and a cross-modality regularization module to align paired audio-visual signals, preserving modality distinctions. Simultaneously, a within-modality regularization module refines unimodal representations with modality-specific targets to retain modal-specific details. Experimental results on the public audio-visual dataset, FakeAVCeleb, demonstrate the effectiveness and competitiveness of our approach.

Evaluating Data Augmentation Techniques for Coffee Leaf Disease Classification

Authors: Adrian Gheorghiu, Iulian-Marius Tăiatu, Dumitru-Clementin Cercel, Iuliana Marin, Florin Pop
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.05768
Pdf link: https://arxiv.org/pdf/2401.05768
Abstract The detection and classification of diseases in Robusta coffee leaves are essential to ensure that plants are healthy and the crop yield is kept high. However, this job requires extensive botanical knowledge and much wasted time. Therefore, this task and others similar to it have been extensively researched subjects in image classification. Regarding leaf disease classification, most approaches have used the more popular PlantVillage dataset while completely disregarding other datasets, like the Robusta Coffee Leaf (RoCoLe) dataset. As the RoCoLe dataset is imbalanced and does not have many samples, fine-tuning of pre-trained models and multiple augmentation techniques need to be used. The current paper uses the RoCoLe dataset and approaches based on deep learning for classifying coffee leaf diseases from images, incorporating the pix2pix model for segmentation and cycle-generative adversarial network (CycleGAN) for augmentation. Our study demonstrates the effectiveness of Transformer-based models, online augmentations, and CycleGAN augmentation in improving leaf disease classification. While synthetic data has limitations, it complements real data, enhancing model performance. These findings contribute to developing robust techniques for plant disease detection and classification.

Efficient Image Deblurring Networks based on Diffusion Models

Authors: Kang Chen, Yuanjie Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.05907
Pdf link: https://arxiv.org/pdf/2401.05907
Abstract This article introduces a sliding window model for defocus deblurring that achieves the best performance to date with extremely low memory usage. Named Swintormer, the method utilizes a diffusion model to generate latent prior features that assist in restoring more detailed images. It also extends the sliding window strategy to specialized Transformer blocks for efficient inference. Additionally, we have further optimized Multiply-Accumulate operations (Macs). Compared to the currently top-performing GRL method, our Swintormer model drastically reduces computational complexity from 140.35 GMACs to 8.02 GMacs, while also improving the Signal-to-Noise Ratio (SNR) for defocus deblurring from 27.04 dB to 27.07 dB. This new method allows for the processing of higher resolution images on devices with limited memory, significantly expanding potential application scenarios. The article concludes with an ablation study that provides an in-depth analysis of the impact of each network module on final performance. The source code and model will be available at the following website: https://github.com/bnm6900030/swintormer.

Learning Cognitive Maps from Transformer Representations for Efficient Planning in Partially Observed Environments

Authors: Antoine Dedieu, Wolfgang Lehrach, Guangyao Zhou, Dileep George, Miguel Lázaro-Gredilla
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.05946
Pdf link: https://arxiv.org/pdf/2401.05946
Abstract Despite their stellar performance on a wide range of tasks, including in-context tasks only revealed during inference, vanilla transformers and variants trained for next-token predictions (a) do not learn an explicit world model of their environment which can be flexibly queried and (b) cannot be used for planning or navigation. In this paper, we consider partially observed environments (POEs), where an agent receives perceptually aliased observations as it navigates, which makes path planning hard. We introduce a transformer with (multiple) discrete bottleneck(s), TDB, whose latent codes learn a compressed representation of the history of observations and actions. After training a TDB to predict the future observation(s) given the history, we extract interpretable cognitive maps of the environment from its active bottleneck(s) indices. These maps are then paired with an external solver to solve (constrained) path planning problems. First, we show that a TDB trained on POEs (a) retains the near perfect predictive performance of a vanilla transformer or an LSTM while (b) solving shortest path problems exponentially faster. Second, a TDB extracts interpretable representations from text datasets, while reaching higher in-context accuracy than vanilla sequence models. Finally, in new POEs, a TDB (a) reaches near-perfect in-context accuracy, (b) learns accurate in-context cognitive maps (c) solves in-context path planning problems.

Autocompletion of Chief Complaints in the Electronic Health Records using Large Language Models

Authors: K M Sajjadul Islam, Ayesha Siddika Nipu, Praveen Madiraju, Priya Deshpande
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.06088
Pdf link: https://arxiv.org/pdf/2401.06088
Abstract The Chief Complaint (CC) is a crucial component of a patient's medical record as it describes the main reason or concern for seeking medical care. It provides critical information for healthcare providers to make informed decisions about patient care. However, documenting CCs can be time-consuming for healthcare providers, especially in busy emergency departments. To address this issue, an autocompletion tool that suggests accurate and well-formatted phrases or sentences for clinical notes can be a valuable resource for triage nurses. In this study, we utilized text generation techniques to develop machine learning models using CC data. In our proposed work, we train a Long Short-Term Memory (LSTM) model and fine-tune three different variants of Biomedical Generative Pretrained Transformers (BioGPT), namely microsoft/biogpt, microsoft/BioGPT-Large, and microsoft/BioGPT-Large-PubMedQA. Additionally, we tune a prompt by incorporating exemplar CC sentences, utilizing the OpenAI API of GPT-4. We evaluate the models' performance based on the perplexity score, modified BERTScore, and cosine similarity score. The results show that BioGPT-Large exhibits superior performance compared to the other models. It consistently achieves a remarkably low perplexity score of 1.65 when generating CC, whereas the baseline LSTM model achieves the best perplexity score of 170. Further, we evaluate and assess the proposed models' performance and the outcome of GPT-4.0. Our study demonstrates that utilizing LLMs such as BioGPT, leads to the development of an effective autocompletion tool for generating CC documentation in healthcare settings.

Transformers are Multi-State RNNs

Authors: Matanel Oren, Michael Hassid, Yossi Adi, Roy Schwartz
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.06104
Pdf link: https://arxiv.org/pdf/2401.06104
Abstract Transformers are considered conceptually different compared to the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as infinite multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that pretrained transformers can be converted into $\textit{finite}$ multi-state RNNs by fixing the size of their hidden state. We observe that several existing transformers cache compression techniques can be framed as such conversion policies, and introduce a novel policy, TOVA, which is simpler compared to these policies. Our experiments with several long range tasks indicate that TOVA outperforms all other baseline policies, while being nearly on par with the full (infinite) model, and using in some cases only $\frac{1}{8}$ of the original cache size. Our results indicate that transformer decoder LLMs often behave in practice as RNNs. They also lay out the option of mitigating one of their most painful computational bottlenecks - the size of their cache memory. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA.

Keyword: scene understanding

VLP: Vision Language Planning for Autonomous Driving

Authors: Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, Liu Ren
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.05577
Pdf link: https://arxiv.org/pdf/2401.05577
Abstract Autonomous driving is a complex and challenging task that aims at safe motion planning through scene understanding and reasoning. While vision-only autonomous driving methods have recently achieved notable performance, through enhanced scene understanding, several key issues, including lack of reasoning, low generalization performance and long-tail scenarios, still need to be addressed. In this paper, we present VLP, a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. VLP enhances autonomous driving systems by strengthening both the source memory foundation and the self-driving car's contextual understanding. VLP achieves state-of-the-art end-to-end planning performance on the challenging NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates, respectively, compared to the previous best method. Moreover, VLP shows improved performance in challenging long-tail scenarios and strong generalization capabilities when faced with new urban environments.

Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection

Authors: Weibo Jiang, Weihong Ren, Jiandong Tian, Liangqiong Qu, Zhiyong Wang, Honghai Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.05676
Pdf link: https://arxiv.org/pdf/2401.05676
Abstract Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of <human, object, action>. Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC.

Keyword: visual reasoning

There is no result

Jan 12 '24 02:01 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Fri, 12 Jan 24

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

DualTeacher: Bridging Coexistence of Unlabelled Classes for Semi-supervised Incremental Object Detection

Domain Similarity-Perceived Label Assignment for Domain Generalized Underwater Object Detection

Consensus Focus for Object Detection and minority classes

Wasserstein Distance-based Expansion of Low-Density Latent Regions for Unknown Class Detection

Keyword: transformer

STR-Cert: Robustness Certification for Deep Text Recognition on Deep Learning Pipelines and Vision Transformers

Efficient LLM inference solution on Intel GPU

Automated Assessment of Students' Code Comprehension using LLMs

Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics

FourCastNeXt: Improving FourCastNet Training with Limited Compute

Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach

Masked Attribute Description Embedding for Cloth-Changing Person Re-identification

LKCA: Large Kernel Convolutional Attention

Surface Normal Estimation with Transformers

Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection

Evaluating Data Augmentation Techniques for Coffee Leaf Disease Classification

Efficient Image Deblurring Networks based on Diffusion Models

Learning Cognitive Maps from Transformer Representations for Efficient Planning in Partially Observed Environments

Autocompletion of Chief Complaints in the Electronic Health Records using Large Language Models

Transformers are Multi-State RNNs

Keyword: scene understanding

VLP: Vision Language Planning for Autonomous Driving

Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection

Keyword: visual reasoning

arxiv-daily
arxiv-daily copied to clipboard