arxiv-daily New submissions for Tue, 6 Feb 24

New submissions for Tue, 6 Feb 24

Open DongZhouGu opened this issue 1 year ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Resource-efficient In-orbit Detection of Earth Objects

Authors: Qiyang Zhang, Xin Yuan, Ruolin Xing, Yiran Zhang, Zimu Zheng, Xiao Ma, Mengwei Xu, Schahram Dustdar, Shangguang Wang
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2402.01675
Pdf link: https://arxiv.org/pdf/2402.01675
Abstract With the rapid proliferation of large Low Earth Orbit (LEO) satellite constellations, a huge amount of in-orbit data is generated and needs to be transmitted to the ground for processing. However, traditional LEO satellite constellations, which downlink raw data to the ground, are significantly restricted in transmission capability. Orbital edge computing (OEC), which exploits the computation capacities of LEO satellites and processes the raw data in orbit, is envisioned as a promising solution to relieve the downlink burden. Yet, with OEC, the bottleneck is shifted to the inelastic computation capacities. The computational bottleneck arises from two primary challenges that existing satellite systems have not adequately addressed: the inability to process all captured images and the limited energy supply available for satellite operations. In this work, we seek to fully exploit the scarce satellite computation and communication resources to achieve satellite-ground collaboration and present a satellite-ground collaborative system named TargetFuse for onboard object detection. TargetFuse incorporates a combination of techniques to minimize detection errors under energy and bandwidth constraints. Extensive experiments show that TargetFuse can reduce detection errors by 3.4 times on average, compared to onboard computing. TargetFuse achieves a 9.6 times improvement in bandwidth efficiency compared to the vanilla baseline under the limited bandwidth budget constraint.

Multimodal-Enhanced Objectness Learner for Corner Case Detection in Autonomous Driving

Authors: Lixing Xiao, Ruixiao Shi, Xiaoyang Tang, Yi Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.02026
Pdf link: https://arxiv.org/pdf/2402.02026
Abstract Previous works on object detection have achieved high accuracy in closed-set scenarios, but their performance in open-world scenarios is not satisfactory. One of the challenging open-world problems is corner case detection in autonomous driving. Existing detectors struggle with these cases, relying heavily on visual appearance and exhibiting poor generalization ability. In this paper, we propose a solution by reducing the discrepancy between known and unknown classes and introduce a multimodal-enhanced objectness notion learner. Leveraging both vision-centric and image-text modalities, our semi-supervised learning framework imparts objectness knowledge to the student model, enabling class-aware detection. Our approach, Multimodal-Enhanced Objectness Learner (MENOL) for Corner Case Detection, significantly improves recall for novel classes with lower training costs. By achieving a 76.6% mAR-corner and 79.8% mAR-agnostic on the CODA-val dataset with just 5100 labeled training images, MENOL outperforms the baseline ORE by 71.3% and 60.6%, respectively. The code will be available at https://github.com/tryhiseyyysum/MENOL.

MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning

Authors: Zhe Li, Laurence T. Yang, Bocheng Ren, Xin Nie, Zhangyang Gao, Cheng Tan, Stan Z. Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02045
Pdf link: https://arxiv.org/pdf/2402.02045
Abstract The scarcity of annotated data has sparked significant interest in unsupervised pre-training methods that leverage medical reports as auxiliary signals for medical visual representation learning. However, existing research overlooks the multi-granularity nature of medical visual representation and lacks suitable contrastive learning techniques to improve the models' generalizability across different granularities, leading to the underutilization of image-text information. To address this, we propose MLIP, a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge. Experimental evaluations reveal the efficacy of our model in enhancing transfer performance for tasks such as image classification, object detection, and semantic segmentation. Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.

RIDERS: Radar-Infrared Depth Estimation for Robust Sensing

Authors: Han Li, Yukai Ma, Yuehao Huang, Yaqing Gu, Weihua Xu, Yong Liu, Xingxing Zuo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02067
Pdf link: https://arxiv.org/pdf/2402.02067
Abstract Dense depth recovery is crucial in autonomous driving, serving as a foundational element for obstacle avoidance, 3D object detection, and local path planning. Adverse weather conditions, including haze, dust, rain, snow, and darkness, introduce significant challenges to accurate dense depth estimation, thereby posing substantial safety risks in autonomous driving. These challenges are particularly pronounced for traditional depth estimation methods that rely on short electromagnetic wave sensors, such as visible spectrum cameras and near-infrared LiDAR, due to their susceptibility to diffraction noise and occlusion in such environments. To fundamentally overcome this issue, we present a novel approach for robust metric depth estimation by fusing a millimeter-wave Radar and a monocular infrared thermal camera, which are capable of penetrating atmospheric particles and unaffected by lighting conditions. Our proposed Radar-Infrared fusion method achieves highly accurate and finely detailed dense depth estimation through three stages, including monocular depth prediction with global scale alignment, quasi-dense Radar augmentation by learning Radar-pixels correspondences, and local scale refinement of dense depth using a scale map learner. Our method achieves exceptional visual quality and accurate metric estimation by addressing the challenges of ambiguity and misalignment that arise from directly fusing multi-modal long-wave features. We evaluate the performance of our approach on the NTU4DRadLM dataset and our self-collected challenging ZJU-Multispectrum dataset. Especially noteworthy is the unprecedented robustness demonstrated by our proposed method in smoky scenarios. Our code will be released at \url{https://github.com/MMOCKING/RIDERS}.

Decomposition-based and Interference Perception for Infrared and Visible Image Fusion in Complex Scenes

Authors: Xilai Li, Xiaosong Li, Haishu Tan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02096
Pdf link: https://arxiv.org/pdf/2402.02096
Abstract Infrared and visible image fusion has emerged as a prominent research in computer vision. However, little attention has been paid on complex scenes fusion, causing existing techniques to produce sub-optimal results when suffers from real interferences. To fill this gap, we propose a decomposition-based and interference perception image fusion method. Specifically, we classify the pixels of visible image from the degree of scattering of light transmission, based on which we then separate the detail and energy information of the image. This refined decomposition facilitates the proposed model in identifying more interfering pixels that are in complex scenes. To strike a balance between denoising and detail preservation, we propose an adaptive denoising scheme for fusing detail components. Meanwhile, we propose a new weighted fusion rule by considering the distribution of image energy information from the perspective of multiple directions. Extensive experiments in complex scenes fusions cover adverse weathers, noise, blur, overexposure, fire, as well as downstream tasks including semantic segmentation, object detection, salient object detection and depth estimation, consistently indicate the effectiveness and superiority of the proposed method compared with the recent representative methods.

CoFiNet: Unveiling Camouflaged Objects with Multi-Scale Finesse

Authors: Cunhan Guo, Heyan Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02217
Pdf link: https://arxiv.org/pdf/2402.02217
Abstract Camouflaged Object Detection (COD) is a critical aspect of computer vision aimed at identifying concealed objects, with applications spanning military, industrial, medical and monitoring domains. To address the problem of poor detail segmentation effect, we introduce a novel method for camouflage object detection, named CoFiNet. Our approach primarily focuses on multi-scale feature fusion and extraction, with special attention to the model's segmentation effectiveness for detailed features, enhancing its ability to effectively detect camouflaged objects. CoFiNet adopts a coarse-to-fine strategy. A multi-scale feature integration module is laveraged to enhance the model's capability of fusing context feature. A multi-activation selective kernel module is leveraged to grant the model the ability to autonomously alter its receptive field, enabling it to selectively choose an appropriate receptive field for camouflaged objects of different sizes. During mask generation, we employ the dual-mask strategy for image segmentation, separating the reconstruction of coarse and fine masks, which significantly enhances the model's learning capacity for details. Comprehensive experiments were conducted on four different datasets, demonstrating that CoFiNet achieves state-of-the-art performance across all datasets. The experiment results of CoFiNet underscore its effectiveness in camouflage object detection and highlight its potential in various practical application scenarios.

$\textit{A Contrario}$ Paradigm for YOLO-based Infrared Small Target Detection

Authors: Alina Ciocarlan, Sylvie Le Hégarat-Mascle, Sidonie Lefebvre, Arnaud Woiselle, Clara Barbanson
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02288
Pdf link: https://arxiv.org/pdf/2402.02288
Abstract Detecting small to tiny targets in infrared images is a challenging task in computer vision, especially when it comes to differentiating these targets from noisy or textured backgrounds. Traditional object detection methods such as YOLO struggle to detect tiny objects compared to segmentation neural networks, resulting in weaker performance when detecting small targets. To reduce the number of false alarms while maintaining a high detection rate, we introduce an $\textit{a contrario}$ decision criterion into the training of a YOLO detector. The latter takes advantage of the $\textit{unexpectedness}$ of small targets to discriminate them from complex backgrounds. Adding this statistical criterion to a YOLOv7-tiny bridges the performance gap between state-of-the-art segmentation methods for infrared small target detection and object detection networks. It also significantly increases the robustness of YOLO towards few-shot settings.

DeSparsify: Adversarial Attack Against Token Sparsification Mechanisms in Vision Transformers

Authors: Oryan Yehezkel, Alon Zolfi, Amit Baras, Yuval Elovici, Asaf Shabtai
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02554
Pdf link: https://arxiv.org/pdf/2402.02554
Abstract Vision transformers have contributed greatly to advancements in the computer vision domain, demonstrating state-of-the-art performance in diverse tasks (e.g., image classification, object detection). However, their high computational requirements grow quadratically with the number of tokens used. Token sparsification techniques have been proposed to address this issue. These techniques employ an input-dependent strategy, in which uninformative tokens are discarded from the computation pipeline, improving the model's efficiency. However, their dynamism and average-case assumption makes them vulnerable to a new threat vector - carefully crafted adversarial examples capable of fooling the sparsification mechanism, resulting in worst-case performance. In this paper, we present DeSparsify, an attack targeting the availability of vision transformers that use token sparsification mechanisms. The attack aims to exhaust the operating system's resources, while maintaining its stealthiness. Our evaluation demonstrates the attack's effectiveness on three token sparsification techniques and examines the attack's transferability between them and its effect on the GPU resources. To mitigate the impact of the attack, we propose various countermeasures.

Gazebo Plants: Simulating Plant-Robot Interaction with Cosserat Rods

Authors: Junchen Deng, Samhita Marri, Jonathan Klein, Wojtek Pałubicki, Sören Pirk, Girish Chowdhary, Dominik L. Michels
Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02570
Pdf link: https://arxiv.org/pdf/2402.02570
Abstract Robotic harvesting has the potential to positively impact agricultural productivity, reduce costs, improve food quality, enhance sustainability, and to address labor shortage. In the rapidly advancing field of agricultural robotics, the necessity of training robots in a virtual environment has become essential. Generating training data to automatize the underlying computer vision tasks such as image segmentation, object detection and classification, also heavily relies on such virtual environments as synthetic data is often required to overcome the shortage and lack of variety of real data sets. However, physics engines commonly employed within the robotics community, such as ODE, Simbody, Bullet, and DART, primarily support motion and collision interaction of rigid bodies. This inherent limitation hinders experimentation and progress in handling non-rigid objects such as plants and crops. In this contribution, we present a plugin for the Gazebo simulation platform based on Cosserat rods to model plant motion. It enables the simulation of plants and their interaction with the environment. We demonstrate that, using our plugin, users can conduct harvesting simulations in Gazebo by simulating a robotic arm picking fruits and achieve results comparable to real-world experiments.

Spatio-temporal Prompting Network for Robust Video Feature Extraction

Authors: Guanxiong Sun, Chi Wang, Zhaoyu Zhang, Jiankang Deng, Stefanos Zafeiriou, Yang Hua
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02574
Pdf link: https://arxiv.org/pdf/2402.02574
Abstract Frame quality deterioration is one of the main challenges in the field of video understanding. To compensate for the information loss caused by deteriorated frames, recent approaches exploit transformer-based integration modules to obtain spatio-temporal information. However, these integration modules are heavy and complex. Furthermore, each integration module is specifically tailored for its target task, making it difficult to generalise to multiple tasks. In this paper, we present a neat and unified framework, called Spatio-Temporal Prompting Network (STPN). It can efficiently extract robust and accurate video features by dynamically adjusting the input features in the backbone network. Specifically, STPN predicts several video prompts containing spatio-temporal information of neighbour frames. Then, these video prompts are prepended to the patch embeddings of the current frame as the updated input for video feature extraction. Moreover, STPN is easy to generalise to various video tasks because it does not contain task-specific modules. Without bells and whistles, STPN achieves state-of-the-art performance on three widely-used datasets for different video understanding tasks, i.e., ImageNetVID for video object detection, YouTubeVIS for video instance segmentation, and GOT-10k for visual object tracking. Code is available at https://github.com/guanxiongsun/vfe.pytorch.

Improving Robustness of LiDAR-Camera Fusion Model against Weather Corruption from Fusion Strategy Perspective

Authors: Yihao Huang, Kaiyuan Yu, Qing Guo, Felix Juefei-Xu, Xiaojun Jia, Tianlin Li, Geguang Pu, Yang Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02738
Pdf link: https://arxiv.org/pdf/2402.02738
Abstract In recent years, LiDAR-camera fusion models have markedly advanced 3D object detection tasks in autonomous driving. However, their robustness against common weather corruption such as fog, rain, snow, and sunlight in the intricate physical world remains underexplored. In this paper, we evaluate the robustness of fusion models from the perspective of fusion strategies on the corrupted dataset. Based on the evaluation, we further propose a concise yet practical fusion strategy to enhance the robustness of the fusion models, namely flexibly weighted fusing features from LiDAR and camera sources to adapt to varying weather scenarios. Experiments conducted on four types of fusion models, each with two distinct lightweight implementations, confirm the broad applicability and effectiveness of the approach.

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

Authors: Yuqian Fu, Yu Wang, Yixuan Pan, Lian Huai, Xingyu Qiu, Zeyu Shangguan, Tong Liu, Lingjie Kong, Yanwei Fu, Luc Van Gool, Xingqun Jiang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.03094
Pdf link: https://arxiv.org/pdf/2402.03094
Abstract This paper addresses the challenge of cross-domain few-shot object detection (CD-FSOD), aiming to develop an accurate object detector for novel domains with minimal labeled examples. While transformer-based open-set detectors e.g., DE-ViT~\cite{zhang2023detect} have excelled in both open-vocabulary object detection and traditional few-shot object detection, detecting categories beyond those seen during training, we thus naturally raise two key questions: 1) can such open-set detection methods easily generalize to CD-FSOD? 2) If no, how to enhance the results of open-set methods when faced with significant domain gaps? To address the first question, we introduce several metrics to quantify domain variances and establish a new CD-FSOD benchmark with diverse domain metric values. Some State-Of-The-Art (SOTA) open-set object detection methods are evaluated on this benchmark, with evident performance degradation observed across out-of-domain datasets. This indicates the failure of adopting open-set detectors directly for CD-FSOD. Sequentially, to overcome the performance degradation issue and also to answer the second proposed question, we endeavor to enhance the vanilla DE-ViT. With several novel components including finetuning, a learnable prototype module, and a lightweight attention module, we present an improved Cross-Domain Vision Transformer for CD-FSOD (CD-ViTO). Experiments show that our CD-ViTO achieves impressive results on both out-of-domain and in-domain target datasets, establishing new SOTAs for both CD-FSOD and FSOD. All the datasets, codes, and models will be released to the community.

ActiveAnno3D - An Active Learning Framework for Multi-Modal 3D Object Detection

Authors: Ahmed Ghita, Bjørk Antoniussen, Walter Zimmer, Ross Greer, Christian Creß, Andreas Møgelmose, Mohan M. Trivedi, Alois C. Knoll
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.03235
Pdf link: https://arxiv.org/pdf/2402.03235
Abstract The curation of large-scale datasets is still costly and requires much time and resources. Data is often manually labeled, and the challenge of creating high-quality datasets remains. In this work, we fill the research gap using active learning for multi-modal 3D object detection. We propose ActiveAnno3D, an active learning framework to select data samples for labeling that are of maximum informativeness for training. We explore various continuous training methods and integrate the most efficient method regarding computational demand and detection performance. Furthermore, we perform extensive experiments and ablation studies with BEVFusion and PV-RCNN on the nuScenes and TUM Traffic Intersection dataset. We show that we can achieve almost the same performance with PV-RCNN and the entropy-based query strategy when using only half of the training data (77.25 mAP compared to 83.50 mAP) of the TUM Traffic Intersection dataset. BEVFusion achieved an mAP of 64.31 when using half of the training data and 75.0 mAP when using the complete nuScenes dataset. We integrate our active learning framework into the proAnno labeling tool to enable AI-assisted data selection and labeling and minimize the labeling costs. Finally, we provide code, weights, and visualization results on our website: https://active3d-framework.github.io/active3d-framework.

HASSOD: Hierarchical Adaptive Self-Supervised Object Detection

Authors: Shengcao Cao, Dhiraj Joshi, Liang-Yan Gui, Yu-Xiong Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.03311
Pdf link: https://arxiv.org/pdf/2402.03311
Abstract The human visual perception system demonstrates exceptional capabilities in learning without explicit supervision and understanding the part-to-whole composition of objects. Drawing inspiration from these two abilities, we propose Hierarchical Adaptive Self-Supervised Object Detection (HASSOD), a novel approach that learns to detect objects and understand their compositions without human supervision. HASSOD employs a hierarchical adaptive clustering strategy to group regions into object masks based on self-supervised visual representations, adaptively determining the number of objects per image. Furthermore, HASSOD identifies the hierarchical levels of objects in terms of composition, by analyzing coverage relations between masks and constructing tree structures. This additional self-supervised learning task leads to improved detection performance and enhanced interpretability. Lastly, we abandon the inefficient multi-round self-training process utilized in prior methods and instead adapt the Mean Teacher framework from semi-supervised learning, which leads to a smoother and more efficient training process. Through extensive experiments on prevalent image datasets, we demonstrate the superiority of HASSOD over existing methods, thereby advancing the state of the art in self-supervised object detection. Notably, we improve Mask AR from 20.2 to 22.5 on LVIS, and from 17.0 to 26.0 on SA-1B. Project page: https://HASSOD-NeurIPS23.github.io.

Keyword: transformer

Detection of Machine-Generated Text: Literature Survey

Authors: Dmytro Valiaiev
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.01642
Pdf link: https://arxiv.org/pdf/2402.01642
Abstract Since language models produce fake text quickly and easily, there is an oversupply of such content in the public domain. The degree of sophistication and writing style has reached a point where differentiating between human authored and machine-generated content is nearly impossible. As a result, works generated by language models rather than human authors have gained significant media attention and stirred controversy.Concerns regarding the possible influence of advanced language models on society have also arisen, needing a fuller knowledge of these processes. Natural language generation (NLG) and generative pre-trained transformer (GPT) models have revolutionized a variety of sectors: the scope not only permeated throughout journalism and customer service but also reached academia. To mitigate the hazardous implications that may arise from the use of these models, preventative measures must be implemented, such as providing human agents with the capacity to distinguish between artificially made and human composed texts utilizing automated systems and possibly reverse-engineered language models. Furthermore, to ensure a balanced and responsible approach, it is critical to have a full grasp of the socio-technological ramifications of these breakthroughs. This literature survey aims to compile and synthesize accomplishments and developments in the aforementioned work, while also identifying future prospects. It also gives an overview of machine-generated text trends and explores the larger societal implications. Ultimately, this survey intends to contribute to the development of robust and effective approaches for resolving the issues connected with the usage and detection of machine-generated text by exploring the interplay between the capabilities of language models and their possible implications.

Killer Apps: Low-Speed, Large-Scale AI Weapons

Authors: Philip Feldman, Aaron Dant, James R. Foulds
Subjects: Computers and Society (cs.CY); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.01663
Pdf link: https://arxiv.org/pdf/2402.01663
Abstract The accelerating advancements in Artificial Intelligence (AI) and Machine Learning (ML), highlighted by the development of cutting-edge Generative Pre-trained Transformer (GPT) models by organizations such as OpenAI, Meta, and Anthropic, present new challenges and opportunities in warfare and security. Much of the current focus is on AI's integration within weapons systems and its role in rapid decision-making in kinetic conflict. However, an equally important but often overlooked aspect is the potential of AI-based psychological manipulation at internet scales within the information domain. These capabilities could pose significant threats to individuals, organizations, and societies globally. This paper explores the concept of AI weapons, their deployment, detection, and potential countermeasures.

Linguistic-Based Mild Cognitive Impairment Detection Using Informative Loss

Authors: Ali Pourramezan Fard, Mohammad H. Mahoor, Muath Alsuhaibani, Hiroko H. Dodgec
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.01690
Pdf link: https://arxiv.org/pdf/2402.01690
Abstract This paper presents a deep learning method using Natural Language Processing (NLP) techniques, to distinguish between Mild Cognitive Impairment (MCI) and Normal Cognitive (NC) conditions in older adults. We propose a framework that analyzes transcripts generated from video interviews collected within the I-CONECT study project, a randomized controlled trial aimed at improving cognitive functions through video chats. Our proposed NLP framework consists of two Transformer-based modules, namely Sentence Embedding (SE) and Sentence Cross Attention (SCA). First, the SE module captures contextual relationships between words within each sentence. Subsequently, the SCA module extracts temporal features from a sequence of sentences. This feature is then used by a Multi-Layer Perceptron (MLP) for the classification of subjects into MCI or NC. To build a robust model, we propose a novel loss function, called InfoLoss, that considers the reduction in entropy by observing each sequence of sentences to ultimately enhance the classification accuracy. The results of our comprehensive model evaluation using the I-CONECT dataset show that our framework can distinguish between MCI and NC with an average area under the curve of 84.75%.

Language-Guided World Models: A Model-Based Approach to AI Control

Authors: Alex Zhang, Khanh Nguyen, Jens Tuyls, Albert Lin, Karthik Narasimhan
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.01695
Pdf link: https://arxiv.org/pdf/2402.01695
Abstract Installing probabilistic world models into artificial agents opens an efficient channel for humans to communicate with and control these agents. In addition to updating agent policies, humans can modify their internal world models in order to influence their decisions. The challenge, however, is that currently existing world models are difficult for humans to adapt because they lack a natural communication interface. Aimed at addressing this shortcoming, we develop Language-Guided World Models (LWMs), which can capture environment dynamics by reading language descriptions. These models enhance agent communication efficiency, allowing humans to simultaneously alter their behavior on multiple tasks with concise language feedback. They also enable agents to self-learn from texts originally written to instruct humans. To facilitate the development of LWMs, we design a challenging benchmark based on the game of MESSENGER (Hanjie et al., 2021), requiring compositional generalization to new language descriptions and environment dynamics. Our experiments reveal that the current state-of-the-art Transformer architecture performs poorly on this benchmark, motivating us to design a more robust architecture. To showcase the practicality of our proposed LWMs, we simulate a scenario where these models augment the interpretability and safety of an agent by enabling it to generate and discuss plans with a human before execution. By effectively incorporating language feedback on the plan, the models boost the agent performance in the real environment by up to three times without collecting any interactive experiences in this environment.

Fortifying Ethical Boundaries in AI: Advanced Strategies for Enhancing Security in Large Language Models

Authors: Yunhong He, Jianling Qiu, Wei Zhang, Zhengqing Yuan
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.01725
Pdf link: https://arxiv.org/pdf/2402.01725
Abstract Recent advancements in large language models (LLMs) have significantly enhanced capabilities in natural language processing and artificial intelligence. These models, including GPT-3.5 and LLaMA-2, have revolutionized text generation, translation, and question-answering tasks due to the transformative Transformer model. Despite their widespread use, LLMs present challenges such as ethical dilemmas when models are compelled to respond inappropriately, susceptibility to phishing attacks, and privacy violations. This paper addresses these challenges by introducing a multi-pronged approach that includes: 1) filtering sensitive vocabulary from user input to prevent unethical responses; 2) detecting role-playing to halt interactions that could lead to 'prison break' scenarios; 3) implementing custom rule engines to restrict the generation of prohibited content; and 4) extending these methodologies to various LLM derivatives like Multi-Model Large Language Models (MLLMs). Our approach not only fortifies models against unethical manipulations and privacy breaches but also maintains their high performance across tasks. We demonstrate state-of-the-art performance under various attack prompts, without compromising the model's core functionalities. Furthermore, the introduction of differentiated security levels empowers users to control their personal data disclosure. Our methods contribute to reducing social risks and conflicts arising from technological abuse, enhance data protection, and promote social equity. Collectively, this research provides a framework for balancing the efficiency of question-answering systems with user privacy and ethical standards, ensuring a safer user experience and fostering trust in AI technology.

3DG: A Framework for Using Generative AI for Handling Sparse Learner Performance Data From Intelligent Tutoring Systems

Authors: Liang Zhang, Jionghao Lin, Conrad Borchers, Meng Cao, Xiangen Hu
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.01746
Pdf link: https://arxiv.org/pdf/2402.01746
Abstract Learning performance data (e.g., quiz scores and attempts) is significant for understanding learner engagement and knowledge mastery level. However, the learning performance data collected from Intelligent Tutoring Systems (ITSs) often suffers from sparsity, impacting the accuracy of learner modeling and knowledge assessments. To address this, we introduce the 3DG framework (3-Dimensional tensor for Densification and Generation), a novel approach combining tensor factorization with advanced generative models, including Generative Adversarial Network (GAN) and Generative Pre-trained Transformer (GPT), for enhanced data imputation and augmentation. The framework operates by first representing the data as a three-dimensional tensor, capturing dimensions of learners, questions, and attempts. It then densifies the data through tensor factorization and augments it using Generative AI models, tailored to individual learning patterns identified via clustering. Applied to data from an AutoTutor lesson by the Center for the Study of Adult Literacy (CSAL), the 3DG framework effectively generated scalable, personalized simulations of learning performance. Comparative analysis revealed GAN's superior reliability over GPT-4 in this context, underscoring its potential in addressing data sparsity challenges in ITSs and contributing to the advancement of personalized educational technology.

Systematic Literature Review: Computational Approaches for Humour Style Classification

Authors: Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.01759
Pdf link: https://arxiv.org/pdf/2402.01759
Abstract Understanding various humour styles is essential for comprehending the multifaceted nature of humour and its impact on fields such as psychology and artificial intelligence. This understanding has revealed that humour, depending on the style employed, can either have therapeutic or detrimental effects on an individual's health and relationships. Although studies dedicated exclusively to computational-based humour style analysis remain somewhat rare, an expansive body of research thrives within related task, particularly binary humour and sarcasm recognition. In this systematic literature review (SLR), we survey the landscape of computational techniques applied to these related tasks and also uncover their fundamental relevance to humour style analysis. Through this study, we unveil common approaches, illuminate various datasets and evaluation metrics, and effectively navigate the complex terrain of humour research. Our efforts determine potential research gaps and outlined promising directions. Furthermore, the SLR identifies a range of features and computational models that can seamlessly transition from related tasks like binary humour and sarcasm detection to invigorate humour style classification. These features encompass incongruity, sentiment and polarity analysis, ambiguity detection, acoustic nuances, visual cues, contextual insights, and more. The computational models that emerge contain traditional machine learning paradigms, neural network architectures, transformer-based models, and specialised models attuned to the nuances of humour. Finally, the SLR provides access to existing datasets related to humour and sarcasm, facilitating the work of future researchers.

BlackMamba: Mixture of Experts for State-Space Models

Authors: Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.01771
Pdf link: https://arxiv.org/pdf/2402.01771
Abstract State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both language modeling and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models have shown remarkable performance while significantly reducing the compute and latency costs of inference at the expense of a larger memory footprint. In this paper, we present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate that BlackMamba performs competitively against both Mamba and transformer baselines, and outperforms in inference and training FLOPs. We fully train and open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a custom dataset. We show that BlackMamba inherits and combines both of the benefits of SSM and MoE architectures, combining linear-complexity generation from SSM with cheap and fast inference from MoE. We release all weights, checkpoints, and inference code open-source. Inference code at: https://github.com/Zyphra/BlackMamba

An introduction to graphical tensor notation for mechanistic interpretability

Authors: Jordan K. Taylor
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.01790
Pdf link: https://arxiv.org/pdf/2402.01790
Abstract Graphical tensor notation is a simple way of denoting linear operations on tensors, originating from physics. Modern deep learning consists almost entirely of operations on or between tensors, so easily understanding tensor operations is quite important for understanding these systems. This is especially true when attempting to reverse-engineer the algorithms learned by a neural network in order to understand its behavior: a field known as mechanistic interpretability. It's often easy to get confused about which operations are happening between tensors and lose sight of the overall structure, but graphical tensor notation makes it easier to parse things at a glance and see interesting equivalences. The first half of this document introduces the notation and applies it to some decompositions (SVD, CP, Tucker, and tensor network decompositions), while the second half applies it to some existing some foundational approaches for mechanistically understanding language models, loosely following A Mathematical Framework for Transformer Circuits'', then constructing an example induction head'' circuit in graphical tensor notation.

COMET: Generating Commit Messages using Delta Graph Context Representation

Authors: Abhinav Reddy Mandli, Saurabhsingh Rajput, Tushar Sharma
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2402.01841
Pdf link: https://arxiv.org/pdf/2402.01841
Abstract Commit messages explain code changes in a commit and facilitate collaboration among developers. Several commit message generation approaches have been proposed; however, they exhibit limited success in capturing the context of code changes. We propose Comet (Context-Aware Commit Message Generation), a novel approach that captures context of code changes using a graph-based representation and leverages a transformer-based model to generate high-quality commit messages. Our proposed method utilizes delta graph that we developed to effectively represent code differences. We also introduce a customizable quality assurance module to identify optimal messages, mitigating subjectivity in commit messages. Experiments show that Comet outperforms state-of-the-art techniques in terms of bleu-norm and meteor metrics while being comparable in terms of rogue-l. Additionally, we compare the proposed approach with the popular gpt-3.5-turbo model, along with gpt-4-turbo; the most capable GPT model, over zero-shot, one-shot, and multi-shot settings. We found Comet outperforming the GPT models, on five and four metrics respectively and provide competitive results with the two other metrics. The study has implications for researchers, tool developers, and software developers. Software developers may utilize Comet to generate context-aware commit messages. Researchers and tool developers can apply the proposed delta graph technique in similar contexts, like code review summarization.

From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers

Authors: Bharat Runwal, Tejaswini Pedapati, Pin-Yu Chen
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.01911
Pdf link: https://arxiv.org/pdf/2402.01911
Abstract Pretrained Language Models (PLMs) have become the de facto starting point for fine-tuning on downstream tasks. However, as model sizes continue to increase, traditional fine-tuning of all parameters becomes challenging. To address this, parameter-efficient fine-tuning (PEFT) methods have gained popularity as a means to adapt PLMs effectively. In parallel, recent studies have revealed the presence of activation sparsity within the intermediate outputs of the multilayer perception (MLP) blocks in transformers. Low activation density enables efficient model inference on sparsity-aware hardware. Building upon this insight, in this work, we propose a novel density loss that encourages higher activation sparsity (equivalently, lower activation density) in the pre-trained models. We demonstrate the effectiveness of our approach by utilizing mainstream PEFT techniques including QLoRA, LoRA, Adapter, Prompt/Prefix Tuning to facilitate efficient model adaptation across diverse downstream tasks. Experiments show that our proposed method DEFT, Density-Efficient Fine-Tuning, can reduce the activation density consistently and up to $\boldsymbol{50.72%}$ on RoBERTa$\mathrm{Large}$, and $\boldsymbol {53.19%}$ (encoder density) and $\boldsymbol{90.60%}$ (decoder density) on Flan-T5$\mathrm{XXL}$ ($\boldsymbol{11B}$) compared to PEFT using GLUE and QA (SQuAD) benchmarks respectively while maintaining competitive performance on downstream tasks. We also showcase that DEFT works complementary with quantized and pruned models

Hypergraph-Transformer (HGT) for Interactive Event Prediction in Laparoscopic and Robotic Surgery

Authors: Lianhao Yin, Yutong Ban, Jennifer Eckhoff, Ozanan Meireles, Daniela Rus, Guy Rosman
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.01974
Pdf link: https://arxiv.org/pdf/2402.01974
Abstract Understanding and anticipating intraoperative events and actions is critical for intraoperative assistance and decision-making during minimally invasive surgery. Automated prediction of events, actions, and the following consequences is addressed through various computational approaches with the objective of augmenting surgeons' perception and decision-making capabilities. We propose a predictive neural network that is capable of understanding and predicting critical interactive aspects of surgical workflow from intra-abdominal video, while flexibly leveraging surgical knowledge graphs. The approach incorporates a hypergraph-transformer (HGT) structure that encodes expert knowledge into the network design and predicts the hidden embedding of the graph. We verify our approach on established surgical datasets and applications, including the detection and prediction of action triplets, and the achievement of the Critical View of Safety (CVS). Moreover, we address specific, safety-related tasks, such as predicting the clipping of cystic duct or artery without prior achievement of the CVS. Our results demonstrate the superiority of our approach compared to unstructured alternatives.

Topology-Informed Graph Transformer

Authors: Yun Young Choi, Sun Woo Park, Minho Lee, Youngho Woo
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02005
Pdf link: https://arxiv.org/pdf/2402.02005
Abstract Transformers have revolutionized performance in Natural Language Processing and Vision, paving the way for their integration with Graph Neural Networks (GNNs). One key challenge in enhancing graph transformers is strengthening the discriminative power of distinguishing isomorphisms of graphs, which plays a crucial role in boosting their predictive performances. To address this challenge, we introduce 'Topology-Informed Graph Transformer (TIGT)', a novel transformer enhancing both discriminative power in detecting graph isomorphisms and the overall performance of Graph Transformers. TIGT consists of four components: A topological positional embedding layer using non-isomorphic universal covers based on cyclic subgraphs of graphs to ensure unique graph representation: A dual-path message-passing layer to explicitly encode topological characteristics throughout the encoder layers: A global attention mechanism: And a graph information layer to recalibrate channel-wise graph features for better feature representation. TIGT outperforms previous Graph Transformers in classifying synthetic dataset aimed at distinguishing isomorphism classes of graphs. Additionally, mathematical analysis and empirical evaluations highlight our model's competitive edge over state-of-the-art Graph Transformers across various benchmark datasets.

GenFormer: A Deep-Learning-Based Approach for Generating Multivariate Stochastic Processes

Authors: Haoran Zhao, Wayne Isaac Tan Uy
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2402.02010
Pdf link: https://arxiv.org/pdf/2402.02010
Abstract Stochastic generators are essential to produce synthetic realizations that preserve target statistical properties. We propose GenFormer, a stochastic generator for spatio-temporal multivariate stochastic processes. It is constructed using a Transformer-based deep learning model that learns a mapping between a Markov state sequence and time series values. The synthetic data generated by the GenFormer model preserves the target marginal distributions and approximately captures other desired statistical properties even in challenging applications involving a large number of spatial locations and a long simulation horizon. The GenFormer model is applied to simulate synthetic wind speed data at various stations in Florida to calculate exceedance probabilities for risk management.

Precise Knowledge Transfer via Flow Matching

Authors: Shitong Shao, Zhiqiang Shen, Linrui Gong, Huanran Chen, Xu Dai
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02012
Pdf link: https://arxiv.org/pdf/2402.02012
Abstract In this paper, we propose a novel knowledge transfer framework that introduces continuous normalizing flows for progressive knowledge transformation and leverages multi-step sampling strategies to achieve precision knowledge transfer. We name this framework Knowledge Transfer with Flow Matching (FM-KT), which can be integrated with a metric-based distillation method with any form (\textit{e.g.} vanilla KD, DKD, PKD and DIST) and a meta-encoder with any available architecture (\textit{e.g.} CNN, MLP and Transformer). By introducing stochastic interpolants, FM-KD is readily amenable to arbitrary noise schedules (\textit{e.g.}, VP-ODE, VE-ODE, Rectified flow) for normalized flow path estimation. We theoretically demonstrate that the training objective of FM-KT is equivalent to minimizing the upper bound of the teacher feature map or logit negative log-likelihood. Besides, FM-KT can be viewed as a unique implicit ensemble method that leads to performance gains. By slightly modifying the FM-KT framework, FM-KT can also be transformed into an online distillation framework OFM-KT with desirable performance gains. Through extensive experiments on CIFAR-100, ImageNet-1k, and MS-COCO datasets, we empirically validate the scalability and state-of-the-art performance of our proposed methods among relevant comparison approaches.

ScribFormer: Transformer Makes CNN Work Better for Scribble-based Medical Image Segmentation

Authors: Zihan Li, Yuan Zheng, Dandan Shan, Shuzhou Yang, Qingde Li, Beizhan Wang, Yuanting Zhang, Qingqi Hong, Dinggang Shen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02029
Pdf link: https://arxiv.org/pdf/2402.02029
Abstract Most recent scribble-supervised segmentation methods commonly adopt a CNN framework with an encoder-decoder architecture. Despite its multiple benefits, this framework generally can only capture small-range feature dependency for the convolutional layer with the local receptive field, which makes it difficult to learn global shape information from the limited information provided by scribble annotations. To address this issue, this paper proposes a new CNN-Transformer hybrid solution for scribble-supervised medical image segmentation called ScribFormer. The proposed ScribFormer model has a triple-branch structure, i.e., the hybrid of a CNN branch, a Transformer branch, and an attention-guided class activation map (ACAM) branch. Specifically, the CNN branch collaborates with the Transformer branch to fuse the local features learned from CNN with the global representations obtained from Transformer, which can effectively overcome limitations of existing scribble-supervised segmentation methods. Furthermore, the ACAM branch assists in unifying the shallow convolution features and the deep convolution features to improve model's performance further. Extensive experiments on two public datasets and one private dataset show that our ScribFormer has superior performance over the state-of-the-art scribble-supervised segmentation methods, and achieves even better results than the fully-supervised segmentation methods. The code is released at https://github.com/HUANGLIZI/ScribFormer.

TCI-Former: Thermal Conduction-Inspired Transformer for Infrared Small Target Detection

Authors: Tianxiang Chen, Zhentao Tan, Qi Chu, Yue Wu, Bin Liu, Nenghai Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02046
Pdf link: https://arxiv.org/pdf/2402.02046
Abstract Infrared small target detection (ISTD) is critical to national security and has been extensively applied in military areas. ISTD aims to segment small target pixels from background. Most ISTD networks focus on designing feature extraction blocks or feature fusion modules, but rarely describe the ISTD process from the feature map evolution perspective. In the ISTD process, the network attention gradually shifts towards target areas. We abstract this process as the directional movement of feature map pixels to target areas through convolution, pooling and interactions with surrounding pixels, which can be analogous to the movement of thermal particles constrained by surrounding variables and particles. In light of this analogy, we propose Thermal Conduction-Inspired Transformer (TCI-Former) based on the theoretical principles of thermal conduction. According to thermal conduction differential equation in heat dynamics, we derive the pixel movement differential equation (PMDE) in the image domain and further develop two modules: Thermal Conduction-Inspired Attention (TCIA) and Thermal Conduction Boundary Module (TCBM). TCIA incorporates finite difference method with PMDE to reach a numerical approximation so that target body features can be extracted. To further remove errors in boundary areas, TCBM is designed and supervised by boundary masks to refine target body features with fine boundary details. Experiments on IRSTD-1k and NUAA-SIRST demonstrate the superiority of our method.

Neural Scaling Laws on Graphs

Authors: Jingzhe Liu, Haitao Mao, Zhikai Chen, Tong Zhao, Neil Shah, Jiliang Tang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.02054
Pdf link: https://arxiv.org/pdf/2402.02054
Abstract Deep graph models (e.g., graph neural networks and graph transformers) have become important techniques for leveraging knowledge across various types of graphs. Yet, the scaling properties of deep graph models have not been systematically investigated, casting doubt on the feasibility of achieving large graph models through enlarging the model and dataset sizes. In this work, we delve into neural scaling laws on graphs from both model and data perspectives. We first verify the validity of such laws on graphs, establishing formulations to describe the scaling behaviors. For model scaling, we investigate the phenomenon of scaling law collapse and identify overfitting as the potential reason. Moreover, we reveal that the model depth of deep graph models can impact the model scaling behaviors, which differ from observations in other domains such as CV and NLP. For data scaling, we suggest that the number of graphs can not effectively metric the graph data volume in scaling law since the sizes of different graphs are highly irregular. Instead, we reform the data scaling law with the number of edges as the metric to address the irregular graph sizes. We further demonstrate the reformed law offers a unified view of the data scaling behaviors for various fundamental graph tasks including node classification, link prediction, and graph classification. This work provides valuable insights into neural scaling laws on graphs, which can serve as an essential step toward large graph models.

DiffVein: A Unified Diffusion Network for Finger Vein Segmentation and Authentication

Authors: Yanjun Liu, Wenming Yang, Qingmin Liao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02060
Pdf link: https://arxiv.org/pdf/2402.02060
Abstract Finger vein authentication, recognized for its high security and specificity, has become a focal point in biometric research. Traditional methods predominantly concentrate on vein feature extraction for discriminative modeling, with a limited exploration of generative approaches. Suffering from verification failure, existing methods often fail to obtain authentic vein patterns by segmentation. To fill this gap, we introduce DiffVein, a unified diffusion model-based framework which simultaneously addresses vein segmentation and authentication tasks. DiffVein is composed of two dedicated branches: one for segmentation and the other for denoising. For better feature interaction between these two branches, we introduce two specialized modules to improve their collective performance. The first, a mask condition module, incorporates the semantic information of vein patterns from the segmentation branch into the denoising process. Additionally, we also propose a Semantic Difference Transformer (SD-Former), which employs Fourier-space self-attention and cross-attention modules to extract category embedding before feeding it to the segmentation task. In this way, our framework allows for a dynamic interplay between diffusion and segmentation embeddings, thus vein segmentation and authentication tasks can inform and enhance each other in the joint training. To further optimize our model, we introduce a Fourier-space Structural Similarity (FourierSIM) loss function, which is tailored to improve the denoising network's learning efficacy. Extensive experiments on the USM and THU-MVFV3V datasets substantiates DiffVein's superior performance, setting new benchmarks in both vein segmentation and authentication tasks.

Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

Authors: Ekaterina Artemova, Verena Blaschke, Barbara Plank
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2402.02078
Pdf link: https://arxiv.org/pdf/2402.02078
Abstract Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages. We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data. Inspired by prior work on English varieties, we craft and manually evaluate perturbation rules that transform German sentences into colloquial forms and use them to synthesize test sets in four ToD datasets. Our perturbation rules cover 18 distinct language phenomena, enabling us to explore the impact of each perturbation on slot and intent performance. Using these new datasets, we conduct an experimental evaluation across six different transformers. Here, we demonstrate that when applied to colloquial varieties, ToD systems maintain their intent recognition performance, losing 6% (4.62 percentage points) in accuracy on average. However, they exhibit a significant drop in slot detection, with a decrease of 31% (21 percentage points) in slot F1 score. Our findings are further supported by a transfer experiment from Standard American English to synthetic Urban African American Vernacular English.

Revisiting the Markov Property for Machine Translation

Authors: Cunxiao Du, Hao Zhou, Zhaopeng Tu, Jing Jiang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2402.02084
Pdf link: https://arxiv.org/pdf/2402.02084
Abstract In this paper, we re-examine the Markov property in the context of neural machine translation. We design a Markov Autoregressive Transformer~(MAT) and undertake a comprehensive assessment of its performance across four WMT benchmarks. Our findings indicate that MAT with an order larger than 4 can generate translations with quality on par with that of conventional autoregressive transformers. In addition, counter-intuitively, we also find that the advantages of utilizing a higher-order MAT do not specifically contribute to the translation of longer sentences.

Deep Semantic-Visual Alignment for Zero-Shot Remote Sensing Image Scene Classification

Authors: Wenjia Xu, Jiuniu Wang, Zhiwei Wei, Mugen Peng, Yirong Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.02094
Pdf link: https://arxiv.org/pdf/2402.02094
Abstract Deep neural networks have achieved promising progress in remote sensing (RS) image classification, for which the training process requires abundant samples for each class. However, it is time-consuming and unrealistic to annotate labels for each RS category, given the fact that the RS target database is increasing dynamically. Zero-shot learning (ZSL) allows for identifying novel classes that are not seen during training, which provides a promising solution for the aforementioned problem. However, previous ZSL models mainly depend on manually-labeled attributes or word embeddings extracted from language models to transfer knowledge from seen classes to novel classes. Besides, pioneer ZSL models use convolutional neural networks pre-trained on ImageNet, which focus on the main objects appearing in each image, neglecting the background context that also matters in RS scene classification. To address the above problems, we propose to collect visually detectable attributes automatically. We predict attributes for each class by depicting the semantic-visual similarity between attributes and images. In this way, the attribute annotation process is accomplished by machine instead of human as in other methods. Moreover, we propose a Deep Semantic-Visual Alignment (DSVA) that take advantage of the self-attention mechanism in the transformer to associate local image regions together, integrating the background context information for prediction. The DSVA model further utilizes the attribute attention maps to focus on the informative image regions that are essential for knowledge transfer in ZSL, and maps the visual images into attribute space to perform ZSL classification. With extensive experiments, we show that our model outperforms other state-of-the-art models by a large margin on a challenging large-scale RS scene classification benchmark.

ParZC: Parametric Zero-Cost Proxies for Efficient NAS

Authors: Peijie Dong, Lujun Li, Xinglin Pan, Zimian Wei, Xiang Liu, Qiang Wang, Xiaowen Chu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02105
Pdf link: https://arxiv.org/pdf/2402.02105
Abstract Recent advancements in Zero-shot Neural Architecture Search (NAS) highlight the efficacy of zero-cost proxies in various NAS benchmarks. Several studies propose the automated design of zero-cost proxies to achieve SOTA performance but require tedious searching progress. Furthermore, we identify a critical issue with current zero-cost proxies: they aggregate node-wise zero-cost statistics without considering the fact that not all nodes in a neural network equally impact performance estimation. Our observations reveal that node-wise zero-cost statistics significantly vary in their contributions to performance, with each node exhibiting a degree of uncertainty. Based on this insight, we introduce a novel method called Parametric Zero-Cost Proxies (ParZC) framework to enhance the adaptability of zero-cost proxies through parameterization. To address the node indiscrimination, we propose a Mixer Architecture with Bayesian Network (MABN) to explore the node-wise zero-cost statistics and estimate node-specific uncertainty. Moreover, we propose DiffKendall as a loss function to directly optimize Kendall's Tau coefficient in a differentiable manner so that our ParZC can better handle the discrepancies in ranking architectures. Comprehensive experiments on NAS-Bench-101, 201, and NDS demonstrate the superiority of our proposed ParZC compared to existing zero-shot NAS methods. Additionally, we demonstrate the versatility and adaptability of ParZC by transferring it to the Vision Transformer search space.

One Graph Model for Cross-domain Dynamic Link Prediction

Authors: Xuanwen Huang, Wei Chow, Yang Wang, Ziwei Chai, Chunping Wang, Lei Chen, Yang Yang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.02168
Pdf link: https://arxiv.org/pdf/2402.02168
Abstract This work proposes DyExpert, a dynamic graph model for cross-domain link prediction. It can explicitly model historical evolving processes to learn the evolution pattern of a specific downstream graph and subsequently make pattern-specific link predictions. DyExpert adopts a decode-only transformer and is capable of efficiently parallel training and inference by \textit{conditioned link generation} that integrates both evolution modeling and link prediction. DyExpert is trained by extensive dynamic graphs across diverse domains, comprising 6M dynamic edges. Extensive experiments on eight untrained graphs demonstrate that DyExpert achieves state-of-the-art performance in cross-domain link prediction. Compared to the advanced baseline under the same setting, DyExpert achieves an average of 11.40% improvement Average Precision across eight graphs. More impressive, it surpasses the fully supervised performance of 8 advanced baselines on 6 untrained graphs.

Frequency Explains the Inverse Correlation of Large Language Models' Size, Training Data Amount, and Surprisal's Fit to Reading Times

Authors: Byung-Doh Oh, Shisen Yue, William Schuler
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02255
Pdf link: https://arxiv.org/pdf/2402.02255
Abstract Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades. The current work presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends. First, residual errors from four language model families on four corpora show that the inverse correlation between model size and fit to reading times is the strongest on the subset of least frequent words, which is driven by excessively accurate predictions of larger model variants. Additionally, training dynamics reveal that during later training steps, all model variants learn to predict rare words and that larger model variants do so more accurately, which explains the detrimental effect of both training data amount and model size on fit to reading times. Finally, a feature attribution analysis demonstrates that larger model variants are able to accurately predict rare words based on both an effectively longer context window size as well as stronger local associations compared to smaller model variants. Taken together, these results indicate that Transformer-based language models' surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.

XTSFormer: Cross-Temporal-Scale Transformer for Irregular Time Event Prediction

Authors: Tingsong Xiao, Zelin Xu, Wenchong He, Jim Su, Yupu Zhang, Raymond Opoku, Ronald Ison, Jason Petho, Jiang Bian, Patrick Tighe, Parisa Rashidi, Zhe Jiang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.02258
Pdf link: https://arxiv.org/pdf/2402.02258
Abstract Event prediction aims to forecast the time and type of a future event based on a historical event sequence. Despite its significance, several challenges exist, including the irregularity of time intervals between consecutive events, the existence of cycles, periodicity, and multi-scale event interactions, as well as the high computational costs for long event sequences. Existing neural temporal point processes (TPPs) methods do not capture the multi-scale nature of event interactions, which is common in many real-world applications such as clinical event data. To address these issues, we propose the cross-temporal-scale transformer (XTSFormer), designed specifically for irregularly timed event data. Our model comprises two vital components: a novel Feature-based Cycle-aware Time Positional Encoding (FCPE) that adeptly captures the cyclical nature of time, and a hierarchical multi-scale temporal attention mechanism. These scales are determined by a bottom-up clustering algorithm. Extensive experiments on several real-world datasets show that our XTSFormer outperforms several baseline methods in prediction performance.

INViT: A Generalizable Routing Problem Solver with Invariant Nested View Transformer

Authors: Han Fang, Zhihao Song, Paul Weng, Yutong Ban
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02317
Pdf link: https://arxiv.org/pdf/2402.02317
Abstract Recently, deep reinforcement learning has shown promising results for learning fast heuristics to solve routing problems. Meanwhile, most of the solvers suffer from generalizing to an unseen distribution or distributions with different scales. To address this issue, we propose a novel architecture, called Invariant Nested View Transformer (INViT), which is designed to enforce a nested design together with invariant views inside the encoders to promote the generalizability of the learned solver. It applies a modified policy gradient algorithm enhanced with data augmentations. We demonstrate that the proposed INViT achieves a dominant generalization performance on both TSP and CVRP problems with various distributions and different problem scales.

Spin: An Efficient Secure Computation Framework with GPU Acceleration

Authors: Wuxuan Jiang, Xiangjun Song, Shenbai Hong, Haijun Zhang, Wenxin Liu, Bo Zhao, Wei Xu, Yi Li
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02320
Pdf link: https://arxiv.org/pdf/2402.02320
Abstract Accuracy and efficiency remain challenges for multi-party computation (MPC) frameworks. Spin is a GPU-accelerated MPC framework that supports multiple computation parties and a dishonest majority adversarial setup. We propose optimized protocols for non-linear functions that are critical for machine learning, as well as several novel optimizations specific to attention that is the fundamental unit of Transformer models, allowing Spin to perform non-trivial CNNs training and Transformer inference without sacrificing security. At the backend level, Spin leverages GPU, CPU, and RDMA-enabled smart network cards for acceleration. Comprehensive evaluations demonstrate that Spin can be up to $2\times$ faster than the state-of-the-art for deep neural network training. For inference on a Transformer model with 18.9 million parameters, our attention-specific optimizations enable Spin to achieve better efficiency, less communication, and better accuracy.

Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals

Authors: Daojun Liang, Haixia Zhang, Dongfeng Yuan, Bingzheng Zhang, Minggao Zhang
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02332
Pdf link: https://arxiv.org/pdf/2402.02332
Abstract In this paper, we find that ubiquitous time series (TS) forecasting models are prone to severe overfitting. To cope with this problem, we embrace a de-redundancy approach to progressively reinstate the intrinsic values of TS for future intervals. Specifically, we renovate the vanilla Transformer by reorienting the information aggregation mechanism from addition to subtraction. Then, we incorporate an auxiliary output branch into each block of the original model to construct a highway leading to the ultimate prediction. The output of subsequent modules in this branch will subtract the previously learned results, enabling the model to learn the residuals of the supervision signal, layer by layer. This designing facilitates the learning-driven implicit progressive decomposition of the input and output streams, empowering the model with heightened versatility, interpretability, and resilience against overfitting. Since all aggregations in the model are minus signs, which is called Minusformer. Extensive experiments demonstrate the proposed method outperform existing state-of-the-art methods, yielding an average performance improvement of 11.9% across various datasets.

Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning

Authors: Yi Cheng, Renjun Hu, Haochao Ying, Xing Shi, Jian Wu, Wei Lin
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.02334
Pdf link: https://arxiv.org/pdf/2402.02334
Abstract Until recently, the question of the effective inductive bias of deep models on tabular data has remained unanswered. This paper investigates the hypothesis that arithmetic feature interaction is necessary for deep tabular learning. To test this point, we create a synthetic tabular dataset with a mild feature interaction assumption and examine a modified transformer architecture enabling arithmetical feature interactions, referred to as AMFormer. Results show that AMFormer outperforms strong counterparts in fine-grained tabular data modeling, data efficiency in training, and generalization. This is attributed to its parallel additive and multiplicative attention operators and prompt-based optimization, which facilitate the separation of tabular samples in an extended space with arithmetically-engineered features. Our extensive experiments on real-world data also validate the consistent effectiveness, efficiency, and rationale of AMFormer, suggesting it has established a strong inductive bias for deep learning on tabular data. Code is available at https://github.com/aigc-apps/AMFormer.

Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning

Authors: Li Ren, Chen Chen, Liqiang Wang, Kien Hua
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02340
Pdf link: https://arxiv.org/pdf/2402.02340
Abstract Deep Metric Learning (DML) has long attracted the attention of the machine learning community as a key objective. Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets. As a result of the success of recent pre-trained models trained from larger-scale datasets, it is challenging to adapt the model to the DML tasks in the local data domain while retaining the previously gained knowledge. In this paper, we investigate parameter-efficient methods for fine-tuning the pre-trained model for DML tasks. In particular, we propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT). Based on the conventional proxy-based DML paradigm, we augment the proxy by incorporating the semantic information from the input image and the ViT, in which we optimize the visual prompts for each class. We demonstrate that our new approximations with semantic information are superior to representative capabilities, thereby improving metric learning performance. We conduct extensive experiments to demonstrate that our proposed framework is effective and efficient by evaluating popular DML benchmarks. In particular, we demonstrate that our fine-tuning method achieves comparable or even better performance than recent state-of-the-art full fine-tuning works of DML while tuning only a small percentage of total parameters.

Vision Transformer-based Multimodal Feature Fusion Network for Lymphoma Segmentation on PET/CT Images

Authors: Huan Huang, Liheng Qiu, Shenmiao Yang, Longxi Li, Jiaofen Nan, Yanting Li, Chuang Han, Fubao Zhu, Chen Zhao, Weihua Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02349
Pdf link: https://arxiv.org/pdf/2402.02349
Abstract Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Methods: Our lymphoma segmentation approach combines a vision transformer with dual encoders, adeptly fusing PET and CT data via multimodal cross-attention fusion (MMCAF) module. In this study, PET and CT data from 165 DLBCL patients were analyzed. A 5-fold cross-validation was employed to evaluate the performance and generalization ability of our method. Ground truths were annotated by experienced nuclear medicine experts. We calculated the total metabolic tumor volume (TMTV) and performed a statistical analysis on our results. Results: The proposed method exhibited accurate performance in DLBCL lesion segmentation, achieving a Dice similarity coefficient of 0.9173$\pm$0.0071, a Hausdorff distance of 2.71$\pm$0.25mm, a sensitivity of 0.9462$\pm$0.0223, and a specificity of 0.9986$\pm$0.0008. Additionally, a Pearson correlation coefficient of 0.9030$\pm$0.0179 and an R-square of 0.8586$\pm$0.0173 were observed in TMTV when measured on manual annotation compared to our segmentation results. Conclusion: This study highlights the advantages of MMCAF and vision transformer for lymphoma segmentation using PET and CT, offering great promise for computer-aided lymphoma diagnosis and treatment.

Unification of Symmetries Inside Neural Networks: Transformer, Feedforward and Neural ODE

Authors: Koji Hashimoto, Yuji Hirono, Akiyoshi Sannai
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Theory (hep-th); Computational Physics (physics.comp-ph)
Arxiv link: https://arxiv.org/abs/2402.02362
Pdf link: https://arxiv.org/pdf/2402.02362
Abstract Understanding the inner workings of neural networks, including transformers, remains one of the most challenging puzzles in machine learning. This study introduces a novel approach by applying the principles of gauge symmetries, a key concept in physics, to neural network architectures. By regarding model functions as physical observables, we find that parametric redundancies of various machine learning models can be interpreted as gauge symmetries. We mathematically formulate the parametric redundancies in neural ODEs, and find that their gauge symmetries are given by spacetime diffeomorphisms, which play a fundamental role in Einstein's theory of gravity. Viewing neural ODEs as a continuum version of feedforward neural networks, we show that the parametric redundancies in feedforward neural networks are indeed lifted to diffeomorphisms in neural ODEs. We further extend our analysis to transformer models, finding natural correspondences with neural ODEs and their gauge symmetries. The concept of gauge symmetries sheds light on the complex behavior of deep learning models through physics and provides us with a unifying perspective for analyzing various machine learning architectures.

The Developmental Landscape of In-Context Learning

Authors: Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, Daniel Murfet
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.02364
Pdf link: https://arxiv.org/pdf/2402.02364
Abstract We show that in-context learning emerges in transformers in discrete developmental stages, when they are trained on either language modeling or linear regression tasks. We introduce two methods for detecting the milestones that separate these stages, by probing the geometry of the population loss in both parameter space and function space. We study the stages revealed by these new methods using a range of behavioral and structural metrics to establish their validity.

Transolver: A Fast Transformer Solver for PDEs on General Geometries

Authors: Haixu Wu, Huakun Luo, Haowen Wang, Jianmin Wang, Mingsheng Long
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02366
Pdf link: https://arxiv.org/pdf/2402.02366
Abstract Transformers have empowered many milestones across various fields and have recently been applied to solve partial differential equations (PDEs). However, since PDEs are typically discretized into large-scale meshes with complex geometries, it is challenging for Transformers to capture intricate physical correlations directly from massive individual points. Going beyond superficial and unwieldy meshes, we present Transolver based on a more foundational idea, which is learning intrinsic physical states hidden behind discretized geometries. Specifically, we propose a new Physics-Attention to adaptively split the discretized domain into a series of learnable slices of flexible shapes, where mesh points under similar physical states will be ascribed to the same slice. By calculating attention to physics-aware tokens encoded from slices, Transovler can effectively capture intricate physical correlations under complex geometrics, which also empowers the solver with endogenetic geometry-general modeling capacity and can be efficiently computed in linear complexity. Transolver achieves consistent state-of-the-art with 22% relative gain across six standard benchmarks and also excels in large-scale industrial simulations, including car and airfoil designs.

Timer: Transformers for Time Series Analysis at Scale

Authors: Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, Mingsheng Long
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02368
Pdf link: https://arxiv.org/pdf/2402.02368
Abstract Deep learning has contributed remarkably to the advancement of time series analysis. Still, deep models can encounter performance bottlenecks in real-world small-sample scenarios, which can be concealed due to the performance saturation with small models on current benchmarks. Meanwhile, large models have demonstrated great powers in these scenarios through large-scale pre-training. Continuous progresses have been achieved as the emergence of large language models, exhibiting unprecedented ability in few-shot generalization, scalability, and task generality, which is however absent in time series models. To change the current practices of training small models on specific datasets from scratch, this paper aims at an early development of large time series models (LTSM). During pre-training, we curate large-scale datasets with up to 1 billion time points, unify heterogeneous time series into single-series sequence (S3) format, and develop the GPT-style architecture toward LTSMs. To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task. The outcome of this study is a Time Series Transformer (Timer), that is pre-trained by autoregressive next token prediction on large multi-domain datasets, and is fine-tuned to downstream scenarios with promising abilities as an LTSM.

PromptRR: Diffusion Models as Prompt Generators for Single Image Reflection Removal

Authors: Tao Wang, Wanglong Lu, Kaihao Zhang, Wenhan Luo, Tae-Kyun Kim, Tong Lu, Hongdong Li, Ming-Hsuan Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02374
Pdf link: https://arxiv.org/pdf/2402.02374
Abstract Existing single image reflection removal (SIRR) methods using deep learning tend to miss key low-frequency (LF) and high-frequency (HF) differences in images, affecting their effectiveness in removing reflections. To address this problem, this paper proposes a novel prompt-guided reflection removal (PromptRR) framework that uses frequency information as new visual prompts for better reflection performance. Specifically, the proposed framework decouples the reflection removal process into the prompt generation and subsequent prompt-guided restoration. For the prompt generation, we first propose a prompt pre-training strategy to train a frequency prompt encoder that encodes the ground-truth image into LF and HF prompts. Then, we adopt diffusion models (DMs) as prompt generators to generate the LF and HF prompts estimated by the pre-trained frequency prompt encoder. For the prompt-guided restoration, we integrate specially generated prompts into the PromptFormer network, employing a novel Transformer-based prompt block to effectively steer the model toward enhanced reflection removal. The results on commonly used benchmarks show that our method outperforms state-of-the-art approaches. The codes and models are available at https://github.com/TaoWangzj/PromptRR.

NOAH: Learning Pairwise Object Category Attentions for Image Classification

Authors: Chao Li, Aojun Zhou, Anbang Yao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02377
Pdf link: https://arxiv.org/pdf/2402.02377
Abstract A modern deep neural network (DNN) for image classification tasks typically consists of two parts: a backbone for feature extraction, and a head for feature encoding and class predication. We observe that the head structures of mainstream DNNs adopt a similar feature encoding pipeline, exploiting global feature dependencies while disregarding local ones. In this paper, we revisit the feature encoding problem, and propose Non-glObal Attentive Head (NOAH) that relies on a new form of dot-product attention called pairwise object category attention (POCA), efficiently exploiting spatially dense category-specific attentions to augment classification performance. NOAH introduces a neat combination of feature split, transform and merge operations to learn POCAs at local to global scales. As a drop-in design, NOAH can be easily used to replace existing heads of various types of DNNs, improving classification performance while maintaining similar model efficiency. We validate the effectiveness of NOAH on ImageNet classification benchmark with 25 DNN architectures spanning convolutional neural networks, vision transformers and multi-layer perceptrons. In general, NOAH is able to significantly improve the performance of lightweight DNNs, e.g., showing 3.14%|5.3%|1.9% top-1 accuracy improvement to MobileNetV2 (0.5x)|Deit-Tiny (0.5x)|gMLP-Tiny (0.5x). NOAH also generalizes well when applied to medium-size and large-size DNNs. We further show that NOAH retains its efficacy on other popular multi-class and multi-label image classification benchmarks as well as in different training regimes, e.g., showing 3.6%|1.1% mAP improvement to large ResNet101|ViT-Large on MS-COCO dataset. Project page: https://github.com/OSVAI/NOAH.

FreDF: Learning to Forecast in Frequency Domain

Authors: Hao Wang, Licheng Pan, Zhichao Chen, Degui Yang, Sen Zhang, Yifei Yang, Xinggao Liu, Haoxuan Li, Dacheng Tao
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2402.02399
Pdf link: https://arxiv.org/pdf/2402.02399
Abstract Time series modeling is uniquely challenged by the presence of autocorrelation in both historical and label sequences. Current research predominantly focuses on handling autocorrelation within the historical sequence but often neglects its presence in the label sequence. Specifically, emerging forecast models mainly conform to the direct forecast (DF) paradigm, generating multi-step forecasts under the assumption of conditional independence within the label sequence. This assumption disregards the inherent autocorrelation in the label sequence, thereby limiting the performance of DF-based models. In response to this gap, we introduce the Frequency-enhanced Direct Forecast (FreDF), which bypasses the complexity of label autocorrelation by learning to forecast in the frequency domain. Our experiments demonstrate that FreDF substantially outperforms existing state-of-the-art methods including iTransformer and is compatible with a variety of forecast models.

Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition

Authors: Mengyuan Liu, Chen Chen, Songtao Wu, Fanyang Meng, Hong Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02431
Pdf link: https://arxiv.org/pdf/2402.02431
Abstract Recognizing interactive actions, including hand-to-hand interaction and human-to-human interaction, has attracted increasing attention for various applications in the field of video analysis and human-robot interaction. Considering the success of graph convolution in modeling topology-aware features from skeleton data, recent methods commonly operate graph convolution on separate entities and use late fusion for interactive action recognition, which can barely model the mutual semantic relationships between pairwise entities. To this end, we propose a mutual excitation graph convolutional network (me-GCN) by stacking mutual excitation graph convolution (me-GC) layers. Specifically, me-GC uses a mutual topology excitation module to firstly extract adjacency matrices from individual entities and then adaptively model the mutual constraints between them. Moreover, me-GC extends the above idea and further uses a mutual feature excitation module to extract and merge deep features from pairwise entities. Compared with graph convolution, our proposed me-GC gradually learns mutual information in each layer and each stage of graph convolution operations. Extensive experiments on a challenging hand-to-hand interaction dataset, i.e., the Assembely101 dataset, and two large-scale human-to-human interaction datasets, i.e., NTU60-Interaction and NTU120-Interaction consistently verify the superiority of our proposed method, which outperforms the state-of-the-art GCN-based and Transformer-based methods.

VM-UNet: Vision Mamba UNet for Medical Image Segmentation

Authors: Jiacheng Ruan, Suncheng Xiang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02491
Pdf link: https://arxiv.org/pdf/2402.02491
Abstract In the realm of medical image segmentation, both CNN-based and Transformer-based models have been extensively explored. However, CNNs exhibit limitations in long-range modeling capabilities, whereas Transformers are hampered by their quadratic computational complexity. Recently, State Space Models (SSMs), exemplified by Mamba, have emerged as a promising approach. They not only excel in modeling long-range interactions but also maintain a linear computational complexity. In this paper, leveraging state space models, we propose a U-shape architecture model for medical image segmentation, named Vision Mamba UNet (VM-UNet). Specifically, the Visual State Space (VSS) block is introduced as the foundation block to capture extensive contextual information, and an asymmetrical encoder-decoder structure is constructed. We conduct comprehensive experiments on the ISIC17, ISIC18, and Synapse datasets, and the results indicate that VM-UNet performs competitively in medical image segmentation tasks. To our best knowledge, this is the first medical image segmentation model constructed based on the pure SSM-based model. We aim to establish a baseline and provide valuable insights for the future development of more efficient and effective SSM-based segmentation systems. Our code is available at https://github.com/JCruan519/VM-UNet.

CompeteSMoE - Effective Training of Sparse Mixture of Experts via Competition

Authors: Quang Pham, Giang Do, Huy Nguyen, TrungTin Nguyen, Chenghao Liu, Mina Sartipi, Binh T. Nguyen, Savitha Ramasamy, Xiaoli Li, Steven Hoi, Nhat Ho
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02526
Pdf link: https://arxiv.org/pdf/2402.02526
Abstract Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, effective training of SMoE has proven to be challenging due to the representation collapse issue, which causes parameter redundancy and limited representation potentials. In this work, we propose a competition mechanism to address this fundamental challenge of representation collapse. By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator. We further propose CompeteSMoE, an effective and efficient algorithm to train large language models by deploying a simple router that predicts the competition outcomes. Consequently, CompeteSMoE enjoys strong performance gains from the competition routing policy while having low computation overheads. Our extensive empirical evaluations on two transformer architectures and a wide range of tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies.

"What's my model inside of?": Exploring the role of environments for grounded natural language understanding

Authors: Ronen Tamari
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2402.02548
Pdf link: https://arxiv.org/pdf/2402.02548
Abstract In contrast to classical cognitive science which studied brains in isolation, ecological approaches focused on the role of the body and environment in shaping cognition. Similarly, in this thesis we adopt an ecological approach to grounded natural language understanding (NLU) research. Grounded language understanding studies language understanding systems situated in the context of events, actions and precepts in naturalistic/simulated virtual environments. Where classic research tends to focus on designing new models and optimization methods while treating environments as given, we explore the potential of environment design for improving data collection and model development. We developed novel training and annotation approaches for procedural text understanding based on text-based game environments. We also drew upon embodied cognitive linguistics literature to propose a roadmap for grounded NLP research, and to inform the development of a new benchmark for measuring the progress of large language models on challenging commonsense reasoning tasks. We leveraged the richer supervision provided by text-based game environments to develop Breakpoint Transformers, a novel approach to modeling intermediate semantic information in long narrative or procedural texts. Finally, we integrated theories on the role of environments in collective human intelligence to propose a design for AI-augmented "social thinking environments" for knowledge workers like scientists.

DeSparsify: Adversarial Attack Against Token Sparsification Mechanisms in Vision Transformers

Authors: Oryan Yehezkel, Alon Zolfi, Amit Baras, Yuval Elovici, Asaf Shabtai
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02554
Pdf link: https://arxiv.org/pdf/2402.02554
Abstract Vision transformers have contributed greatly to advancements in the computer vision domain, demonstrating state-of-the-art performance in diverse tasks (e.g., image classification, object detection). However, their high computational requirements grow quadratically with the number of tokens used. Token sparsification techniques have been proposed to address this issue. These techniques employ an input-dependent strategy, in which uninformative tokens are discarded from the computation pipeline, improving the model's efficiency. However, their dynamism and average-case assumption makes them vulnerable to a new threat vector - carefully crafted adversarial examples capable of fooling the sparsification mechanism, resulting in worst-case performance. In this paper, we present DeSparsify, an attack targeting the availability of vision transformers that use token sparsification mechanisms. The attack aims to exhaust the operating system's resources, while maintaining its stealthiness. Our evaluation demonstrates the attack's effectiveness on three token sparsification techniques and examines the attack's transferability between them and its effect on the GPU resources. To mitigate the impact of the attack, we propose various countermeasures.

Spatio-temporal Prompting Network for Robust Video Feature Extraction

Authors: Guanxiong Sun, Chi Wang, Zhaoyu Zhang, Jiankang Deng, Stefanos Zafeiriou, Yang Hua
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2402.02574
Pdf link: https://arxiv.org/pdf/2402.02574
Abstract Frame quality deterioration is one of the main challenges in the field of video understanding. To compensate for the information loss caused by deteriorated frames, recent approaches exploit transformer-based integration modules to obtain spatio-temporal information. However, these integration modules are heavy and complex. Furthermore, each integration module is specifically tailored for its target task, making it difficult to generalise to multiple tasks. In this paper, we present a neat and unified framework, called Spatio-Temporal Prompting Network (STPN). It can efficiently extract robust and accurate video features by dynamically adjusting the input features in the backbone network. Specifically, STPN predicts several video prompts containing spatio-temporal information of neighbour frames. Then, these video prompts are prepended to the patch embeddings of the current frame as the updated input for video feature extraction. Moreover, STPN is easy to generalise to various video tasks because it does not contain task-specific modules. Without bells and whistles, STPN achieves state-of-the-art performance on three widely-used datasets for different video understanding tasks, i.e., ImageNetVID for video object detection, YouTubeVIS for video instance segmentation, and GOT-10k for visual object tracking. Code is available at https://github.com/guanxiongsun/vfe.pytorch.

ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars for Write Noise Mitigation

Authors: Abhiroop Bhattacharjee, Abhishek Moitra, Priyadarshini Panda
Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2402.02586
Pdf link: https://arxiv.org/pdf/2402.02586
Abstract Transformers have revolutionized various real-world applications from natural language processing to computer vision. However, traditional von-Neumann computing paradigm faces memory and bandwidth limitations in accelerating transformers owing to their massive model sizes. To this end, In-memory Computing (IMC) crossbars based on Non-volatile Memories (NVMs), due to their ability to perform highly parallelized Matrix-Vector-Multiplications (MVMs) with high energy-efficiencies, have emerged as a promising solution for accelerating transformers. However, analog MVM operations in crossbars introduce non-idealities, such as stochastic read & write noise, which affect the inference accuracy of the deployed transformers. Specifically, we find pre-trained Vision Transformers (ViTs) to be vulnerable on crossbars due to the impact of write noise on the dynamically-generated Key (K) and Value (V) matrices in the attention layers, an effect not accounted for in prior studies. We, thus, propose ClipFormer, a transformation on the K and V matrices during inference, to boost the non-ideal accuracies of pre-trained ViT models. ClipFormer requires no additional hardware and training overhead and is amenable to transformers deployed on any memristive crossbar platform. Our experiments on Imagenet-1k dataset using pre-trained DeiT-S transformers, subjected to standard training and variation-aware-training, show >10-40% higher non-ideal accuracies at the high write noise regime by applying ClipFormer.

Unified Training of Universal Time Series Forecasting Transformers

Authors: Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, Doyen Sahoo
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.02592
Pdf link: https://arxiv.org/pdf/2402.02592
Abstract Deep learning for time series forecasting has traditionally operated within a one-model-per-dataset framework, limiting its potential to leverage the game-changing impact of large pre-trained models. The concept of universal forecasting, emerging from pre-training on a vast collection of time series datasets, envisions a single Large Time Series Model capable of addressing diverse downstream forecasting tasks. However, constructing such a model poses unique challenges specific to time series data: i) cross-frequency learning, ii) accommodating an arbitrary number of variates for multivariate time series, and iii) addressing the varying distributional properties inherent in large-scale data. To address these challenges, we present novel enhancements to the conventional time series Transformer architecture, resulting in our proposed Masked Encoder-based Universal Time Series Forecasting Transformer (Moirai). Trained on our newly introduced Large-scale Open Time Series Archive (LOTSA) featuring over 27B observations across nine domains, Moirai achieves competitive or superior performance as a zero-shot forecaster when compared to full-shot models. Code, model weights, and data will be released.

Leveraging Continuously Differentiable Activation Functions for Learning in Quantized Noisy Environments

Authors: Vivswan Shah, Nathan Youngblood
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02593
Pdf link: https://arxiv.org/pdf/2402.02593
Abstract Real-world analog systems intrinsically suffer from noise that can impede model convergence and accuracy on a variety of deep learning models. We demonstrate that differentiable activations like GELU and SiLU enable robust propagation of gradients which help to mitigate analog quantization error that is ubiquitous to all analog systems. We perform analysis and training of convolutional, linear, and transformer networks in the presence of quantized noise. Here, we are able to demonstrate that continuously differentiable activation functions are significantly more noise resilient over conventional rectified activations. As in the case of ReLU, the error in gradients are 100x higher than those in GELU near zero. Our findings provide guidance for selecting appropriate activations to realize performant and reliable hardware implementations across several machine learning domains such as computer vision, signal processing, and beyond.

Increasing Trust in Language Models through the Reuse of Verified Circuits

Authors: Philip Quirke, Clement Neo, Fazl Barez
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02619
Pdf link: https://arxiv.org/pdf/2402.02619
Abstract Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounting for edge cases, with no known failure modes. We show that a transformer model can be trained to meet this standard if built using mathematically and logically specified frameworks. In this paper, we fully verify a model for n-digit integer addition. To exhibit the reusability of verified modules, we insert the trained integer addition model into an untrained model and train the combined model to perform both addition and subtraction. We find extensive reuse of the addition circuits for both tasks, easing verification of the more complex subtractor model. We discuss how inserting verified task modules into LMs can leverage model reuse to improve verifiability and trustworthiness of language models built using them. The reuse of verified circuits reduces the effort to verify more complex composite models which we believe to be a significant step towards safety of language models.

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Authors: Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02622
Pdf link: https://arxiv.org/pdf/2402.02622
Abstract The transformer architecture from Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations -- we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time.

PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks

Authors: Ziquan Liu, Zhuo Zhi, Ilija Bogunovic, Carsten Gerner-Beuerle, Miguel Rodrigues
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02629
Pdf link: https://arxiv.org/pdf/2402.02629
Abstract It is widely known that state-of-the-art machine learning models, including vision and language models, can be seriously compromised by adversarial perturbations. It is therefore increasingly relevant to develop capabilities to certify their performance in the presence of the most effective adversarial attacks. Our paper offers a new approach to certify the performance of machine learning models in the presence of adversarial attacks with population level risk guarantees. In particular, we introduce the notion of $(\alpha,\zeta)$ machine learning model safety. We propose a hypothesis testing procedure, based on the availability of a calibration set, to derive statistical guarantees providing that the probability of declaring that the adversarial (population) risk of a machine learning model is less than $\alpha$ (i.e. the model is safe), while the model is in fact unsafe (i.e. the model adversarial population risk is higher than $\alpha$), is less than $\zeta$. We also propose Bayesian optimization algorithms to determine efficiently whether a machine learning model is $(\alpha,\zeta)$-safe in the presence of an adversarial attack, along with statistical guarantees. We apply our framework to a range of machine learning models including various sizes of vision Transformer (ViT) and ResNet models impaired by a variety of adversarial attacks, such as AutoAttack, SquareAttack and natural evolution strategy attack, to illustrate the operation of our approach. Importantly, we show that ViT's are generally more robust to adversarial attacks than ResNets, and ViT-large is more robust than smaller models. Our approach goes beyond existing empirical adversarial risk-based certification guarantees. It formulates rigorous (and provable) performance guarantees that can be used to satisfy regulatory requirements mandating the use of state-of-the-art technical tools.

Key-Graph Transformer for Image Restoration

Authors: Bin Ren, Yawei Li, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Nicu Sebe
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02634
Pdf link: https://arxiv.org/pdf/2402.02634
Abstract While it is crucial to capture global information for effective image restoration (IR), integrating such cues into transformer-based methods becomes computationally expensive, especially with high input resolution. Furthermore, the self-attention mechanism in transformers is prone to considering unnecessary global cues from unrelated objects or regions, introducing computational inefficiencies. In response to these challenges, we introduce the Key-Graph Transformer (KGT) in this paper. Specifically, KGT views patch features as graph nodes. The proposed Key-Graph Constructor efficiently forms a sparse yet representative Key-Graph by selectively connecting essential nodes instead of all the nodes. Then the proposed Key-Graph Attention is conducted under the guidance of the Key-Graph only among selected nodes with linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed KGT's state-of-the-art performance, showcasing advancements both quantitatively and qualitatively.

Focal Modulation Networks for Interpretable Sound Classification

Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli
Subjects: Sound (cs.SD); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02754
Pdf link: https://arxiv.org/pdf/2402.02754
Abstract The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.

Approximate Attributions for Off-the-Shelf Siamese Transformers

Authors: Lucas Möller, Dmitry Nikolaev, Sebastian Padó
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02883
Pdf link: https://arxiv.org/pdf/2402.02883
Abstract Siamese encoders such as sentence transformers are among the least understood deep models. Established attribution methods cannot tackle this model class since it compares two inputs rather than processing a single one. To address this gap, we have recently proposed an attribution method specifically for Siamese encoders (M"oller et al., 2023). However, it requires models to be adjusted and fine-tuned and therefore cannot be directly applied to off-the-shelf models. In this work, we reassess these restrictions and propose (i) a model with exact attribution ability that retains the original model's predictive performance and (ii) a way to compute approximate attributions for off-the-shelf models. We extensively compare approximate and exact attributions and use them to analyze the models' attendance to different linguistic aspects. We gain insights into which syntactic roles Siamese transformers attend to, confirm that they mostly ignore negation, explore how they judge semantically opposite adjectives, and find that they exhibit lexical bias.

Time-, Memory- and Parameter-Efficient Visual Adaptation

Authors: Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02887
Pdf link: https://arxiv.org/pdf/2402.02887
Abstract As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone. As a result, our method is efficient not only in terms of parameters, but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark, and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification, without any intricate model parallelism. Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.

Automated Cognate Detection as a Supervised Link Prediction Task with Cognate Transformer

Authors: V.S.D.S.Mahesh Akavarapu, Arnab Bhattacharya
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02926
Pdf link: https://arxiv.org/pdf/2402.02926
Abstract Identification of cognates across related languages is one of the primary problems in historical linguistics. Automated cognate identification is helpful for several downstream tasks including identifying sound correspondences, proto-language reconstruction, phylogenetic classification, etc. Previous state-of-the-art methods for cognate identification are mostly based on distributions of phonemes computed across multilingual wordlists and make little use of the cognacy labels that define links among cognate clusters. In this paper, we present a transformer-based architecture inspired by computational biology for the task of automated cognate detection. Beyond a certain amount of supervision, this method performs better than the existing methods, and shows steady improvement with further increase in supervision, thereby proving the efficacy of utilizing the labeled information. We also demonstrate that accepting multiple sequence alignments as input and having an end-to-end architecture with link prediction head saves much computation time while simultaneously yielding superior performance.

Exploring the Synergies of Hybrid CNNs and ViTs Architectures for Computer Vision: A survey

Authors: Haruna Yunusa, Shiyin Qin, Abdulrahman Hamman Adama Chukkol, Abdulganiyu Abdu Yusuf, Isah Bello, Adamu Lawan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.02941
Pdf link: https://arxiv.org/pdf/2402.02941
Abstract The hybrid of Convolutional Neural Network (CNN) and Vision Transformers (ViT) architectures has emerged as a groundbreaking approach, pushing the boundaries of computer vision (CV). This comprehensive review provides a thorough examination of the literature on state-of-the-art hybrid CNN-ViT architectures, exploring the synergies between these two approaches. The main content of this survey includes: (1) a background on the vanilla CNN and ViT, (2) systematic review of various taxonomic hybrid designs to explore the synergy achieved through merging CNNs and ViTs models, (3) comparative analysis and application task-specific synergy between different hybrid architectures, (4) challenges and future directions for hybrid models, (5) lastly, the survey concludes with a summary of key findings and recommendations. Through this exploration of hybrid CV architectures, the survey aims to serve as a guiding resource, fostering a deeper understanding of the intricate dynamics between CNNs and ViTs and their collective impact on shaping the future of CV architectures.

Putting Context in Context: the Impact of Discussion Structure on Text Classification

Authors: Nicolò Penzo, Antonio Longa, Bruno Lepri, Sara Tonelli, Marco Guerini
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2402.02975
Pdf link: https://arxiv.org/pdf/2402.02975
Abstract Current text classification approaches usually focus on the content to be classified. Contextual aspects (both linguistic and extra-linguistic) are usually neglected, even in tasks based on online discussions. Still in many cases the multi-party and multi-turn nature of the context from which these elements are selected can be fruitfully exploited. In this work, we propose a series of experiments on a large dataset for stance detection in English, in which we evaluate the contribution of different types of contextual information, i.e. linguistic, structural and temporal, by feeding them as natural language input into a transformer-based model. We also experiment with different amounts of training data and analyse the topology of local discussion networks in a privacy-compliant way. Results show that structural information can be highly beneficial to text classification but only under certain circumstances (e.g. depending on the amount of training data and on discussion chain complexity). Indeed, we show that contextual information on smaller datasets from other classification tasks does not yield significant improvements. Our framework, based on local discussion networks, allows the integration of structural information, while minimising user profiling, thus preserving their privacy.

UniMem: Towards a Unified View of Long-Context Large Language Models

Authors: Junjie Fang, Likai Tang, Hongzhe Bi, Yujia Qin, Si Sun, Zhenyu Li, Haolun Li, Yongjian Li, Xin Cong, Yukun Yan, Xiaodong Shi, Sen Song, Yankai Lin, Zhiyuan Liu, Maosong Sun
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.03009
Pdf link: https://arxiv.org/pdf/2402.03009
Abstract Long-context processing is a critical ability that constrains the applicability of large language models. Although there exist various methods devoted to enhancing the long-context processing ability of large language models (LLMs), they are developed in an isolated manner and lack systematic analysis and integration of their strengths, hindering further developments. In this paper, we introduce UniMem, a unified framework that reformulates existing long-context methods from the view of memory augmentation of LLMs. UniMem is characterized by four key dimensions: Memory Management, Memory Writing, Memory Reading, and Memory Injection, providing a systematic theory for understanding various long-context methods. We reformulate 16 existing methods based on UniMem and analyze four representative methods: Transformer-XL, Memorizing Transformer, RMT, and Longformer into equivalent UniMem forms to reveal their design principles and strengths. Based on these analyses, we propose UniMix, an innovative approach that integrates the strengths of these algorithms. Experimental results show that UniMix achieves superior performance in handling long contexts with significantly lower perplexity than baselines.

Taylor Videos for Action Recognition

Authors: Lei Wang, Xiuyuan Yuan, Tom Gedeon, Liang Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.03019
Pdf link: https://arxiv.org/pdf/2402.03019
Abstract Effectively extracting motions from video is a critical and long-standing problem for action recognition. This problem is very challenging because motions (i) do not have an explicit form, (ii) have various concepts such as displacement, velocity, and acceleration, and (iii) often contain noise caused by unstable pixels. Addressing these challenges, we propose the Taylor video, a new video format that highlights the dominate motions (e.g., a waving hand) in each of its frames named the Taylor frame. Taylor video is named after Taylor series, which approximates a function at a given point using important terms. In the scenario of videos, we define an implicit motion-extraction function which aims to extract motions from video temporal block. In this block, using the frames, the difference frames, and higher-order difference frames, we perform Taylor expansion to approximate this function at the starting frame. We show the summation of the higher-order terms in the Taylor series gives us dominant motion patterns, where static objects, small and unstable motions are removed. Experimentally we show that Taylor videos are effective inputs to popular architectures including 2D CNNs, 3D CNNs, and transformers. When used individually, Taylor videos yield competitive action recognition accuracy compared to RGB videos and optical flow. When fused with RGB or optical flow videos, further accuracy improvement is achieved.

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

Authors: Yuqian Fu, Yu Wang, Yixuan Pan, Lian Huai, Xingyu Qiu, Zeyu Shangguan, Tong Liu, Lingjie Kong, Yanwei Fu, Luc Van Gool, Xingqun Jiang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.03094
Pdf link: https://arxiv.org/pdf/2402.03094
Abstract This paper addresses the challenge of cross-domain few-shot object detection (CD-FSOD), aiming to develop an accurate object detector for novel domains with minimal labeled examples. While transformer-based open-set detectors e.g., DE-ViT~\cite{zhang2023detect} have excelled in both open-vocabulary object detection and traditional few-shot object detection, detecting categories beyond those seen during training, we thus naturally raise two key questions: 1) can such open-set detection methods easily generalize to CD-FSOD? 2) If no, how to enhance the results of open-set methods when faced with significant domain gaps? To address the first question, we introduce several metrics to quantify domain variances and establish a new CD-FSOD benchmark with diverse domain metric values. Some State-Of-The-Art (SOTA) open-set object detection methods are evaluated on this benchmark, with evident performance degradation observed across out-of-domain datasets. This indicates the failure of adopting open-set detectors directly for CD-FSOD. Sequentially, to overcome the performance degradation issue and also to answer the second proposed question, we endeavor to enhance the vanilla DE-ViT. With several novel components including finetuning, a learnable prototype module, and a lightweight attention module, we present an improved Cross-Domain Vision Transformer for CD-FSOD (CD-ViTO). Experiments show that our CD-ViTO achieves impressive results on both out-of-domain and in-domain target datasets, establishing new SOTAs for both CD-FSOD and FSOD. All the datasets, codes, and models will be released to the community.

Less is KEN: a Universal and Simple Non-Parametric Pruning Algorithm for Large Language Models

Authors: Michele Mastromattei, Fabio Massimo Zanzotto
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.03142
Pdf link: https://arxiv.org/pdf/2402.03142
Abstract Neural network pruning has become increasingly crucial due to the complexity of neural network models and their widespread use in various fields. Existing pruning algorithms often suffer from limitations such as architecture specificity, excessive complexity and reliance on complex calculations, rendering them impractical for real-world applications. In this paper, we propose KEN: a straightforward, universal and unstructured pruning algorithm based on Kernel Density Estimation (KDE). KEN aims to construct optimized transformer models by selectively preserving the most significant parameters while restoring others to their pre-training state. This approach maintains model performance while allowing storage of only the optimized subnetwork, leading to significant memory savings. Extensive evaluations on seven transformer models demonstrate that KEN achieves equal or better performance than the original models with a minimum parameter reduction of 25%. In-depth comparisons against other pruning and PEFT algorithms confirm KEN effectiveness. Furthermore, we introduce KEN_viz, an explainable tool that visualizes the optimized model composition and the subnetwork selected by KEN.

Is Mamba Capable of In-Context Learning?

Authors: Riccardo Grazzi, Julien Siems, Simon Schrodi, Thomas Brox, Frank Hutter
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.03170
Pdf link: https://arxiv.org/pdf/2402.03170
Abstract This work provides empirical evidence that Mamba, a newly proposed selective structured state space model, has similar in-context learning (ICL) capabilities as transformers. We evaluated Mamba on tasks involving simple function approximation as well as more complex natural language processing problems. Our results demonstrate that across both categories of tasks, Mamba matches the performance of transformer models for ICL. Further analysis reveals that like transformers, Mamba appears to solve ICL problems by incrementally optimizing its internal representations. Overall, our work suggests that Mamba can be an efficient alternative to transformers for ICL tasks involving longer input sequences.

Homograph Attacks on Maghreb Sentiment Analyzers

Authors: Fatima Zahra Qachfar, Rakesh M. Verma
Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.03171
Pdf link: https://arxiv.org/pdf/2402.03171
Abstract We examine the impact of homograph attacks on the Sentiment Analysis (SA) task of different Arabic dialects from the Maghreb North-African countries. Homograph attacks result in a 65.3% decrease in transformer classification from an F1-score of 0.95 to 0.33 when data is written in "Arabizi". The goal of this study is to highlight LLMs weaknesses' and to prioritize ethical and responsible Machine Learning.

Accurate and Well-Calibrated ICD Code Assignment Through Attention Over Diverse Label Embeddings

Authors: Gonçalo Gomes, Isabel Coutinho, Bruno Martins
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.03172
Pdf link: https://arxiv.org/pdf/2402.03172
Abstract Although the International Classification of Diseases (ICD) has been adopted worldwide, manually assigning ICD codes to clinical text is time-consuming, error-prone, and expensive, motivating the development of automated approaches. This paper describes a novel approach for automated ICD coding, combining several ideas from previous related work. We specifically employ a strong Transformer-based model as a text encoder and, to handle lengthy clinical narratives, we explored either (a) adapting the base encoder model into a Longformer, or (b) dividing the text into chunks and processing each chunk independently. The representations produced by the encoder are combined with a label embedding mechanism that explores diverse ICD code synonyms. Experiments with different splits of the MIMIC-III dataset show that the proposed approach outperforms the current state-of-the-art models in ICD coding, with the label embeddings significantly contributing to the good performance. Our approach also leads to properly calibrated classification results, which can effectively inform downstream tasks such as quantification.

C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models

Authors: Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2402.03181
Pdf link: https://arxiv.org/pdf/2402.03181
Abstract Despite the impressive capabilities of large language models (LLMs) across diverse applications, they still suffer from trustworthiness issues, such as hallucinations and misalignments. Retrieval-augmented language models (RAG) have been proposed to enhance the credibility of generations by grounding external knowledge, but the theoretical understandings of their generation risks remains unexplored. In this paper, we answer: 1) whether RAG can indeed lead to low generation risks, 2) how to provide provable guarantees on the generation risks of RAG and vanilla LLMs, and 3) what sufficient conditions enable RAG models to reduce generation risks. We propose C-RAG, the first framework to certify generation risks for RAG models. Specifically, we provide conformal risk analysis for RAG models and certify an upper confidence bound of generation risks, which we refer to as conformal generation risk. We also provide theoretical guarantees on conformal generation risks for general bounded risk functions under test distribution shifts. We prove that RAG achieves a lower conformal generation risk than that of a single LLM when the quality of the retrieval model and transformer is non-trivial. Our intensive empirical results demonstrate the soundness and tightness of our conformal generation risk guarantees across four widely-used NLP datasets on four state-of-the-art retrieval models.

MobilityGPT: Enhanced Human Mobility Modeling with a GPT model

Authors: Ammar Haydari, Dongjie Chen, Zhengfeng Lai, Chen-Nee Chuah
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.03264
Pdf link: https://arxiv.org/pdf/2402.03264
Abstract Generative models have shown promising results in capturing human mobility characteristics and generating synthetic trajectories. However, it remains challenging to ensure that the generated geospatial mobility data is semantically realistic, including consistent location sequences, and reflects real-world characteristics, such as constraining on geospatial limits. To address these issues, we reformat human mobility modeling as an autoregressive generation task, leveraging Generative Pre-trained Transformer (GPT). To ensure its controllable generation to alleviate the above challenges, we propose a geospatially-aware generative model, MobilityGPT. We propose a gravity-based sampling method to train a transformer for semantic sequence similarity. Then, we constrained the training process via a road connectivity matrix that provides the connectivity of sequences in trajectory generation, thereby keeping generated trajectories in geospatial limits. Lastly, we constructed a Reinforcement Learning from Trajectory Feedback (RLTF) to minimize the travel distance between training and the synthetically generated trajectories. Our experiments on real-world datasets demonstrate that MobilityGPT outperforms state-of-the-art methods in generating high-quality mobility trajectories that are closest to real data in terms of origin-destination similarity, trip length, travel radius, link, and gravity distributions.

Make Every Move Count: LLM-based High-Quality RTL Code Generation Using MCTS

Authors: Matthew DeLorenzo, Animesh Basak Chowdhury, Vasudev Gohil, Shailja Thakur, Ramesh Karri, Siddharth Garg, Jeyavijayan Rajendran
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2402.03289
Pdf link: https://arxiv.org/pdf/2402.03289
Abstract Existing large language models (LLMs) for register transfer level code generation face challenges like compilation failures and suboptimal power, performance, and area (PPA) efficiency. This is due to the lack of PPA awareness in conventional transformer decoding algorithms. In response, we present an automated transformer decoding algorithm that integrates Monte Carlo tree-search for lookahead, guiding the transformer to produce compilable, functionally correct, and PPA-optimized code. Empirical evaluation with a fine-tuned language model on RTL codesets shows that our proposed technique consistently generates functionally correct code compared to prompting-only methods and effectively addresses the PPA-unawareness drawback of naive large language models. For the largest design generated by the state-of-the-art LLM (16-bit adder), our technique can achieve a 31.8% improvement in the area-delay product.

Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining

Authors: Jiarun Liu, Hao Yang, Hong-Yu Zhou, Yan Xi, Lequan Yu, Yizhou Yu, Yong Liang, Guangming Shi, Shaoting Zhang, Hairong Zheng, Shanshan Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2402.03302
Pdf link: https://arxiv.org/pdf/2402.03302
Abstract Accurate medical image segmentation demands the integration of multi-scale information, spanning from local features to global dependencies. However, it is challenging for existing methods to model long-range global information, where convolutional neural networks (CNNs) are constrained by their local receptive fields, and vision transformers (ViTs) suffer from high quadratic complexity of their attention mechanism. Recently, Mamba-based models have gained great attention for their impressive ability in long sequence modeling. Several studies have demonstrated that these models can outperform popular vision models in various tasks, offering higher accuracy, lower memory consumption, and less computational burden. However, existing Mamba-based models are mostly trained from scratch and do not explore the power of pretraining, which has been proven to be quite effective for data-efficient medical image analysis. This paper introduces a novel Mamba-based model, Swin-UMamba, designed specifically for medical image segmentation tasks, leveraging the advantages of ImageNet-based pretraining. Our experimental results reveal the vital role of ImageNet-based training in enhancing the performance of Mamba-based models. Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models. Notably, on AbdomenMRI, Encoscopy, and Microscopy datasets, Swin-UMamba outperforms its closest counterpart U-Mamba by an average score of 3.58%. The code and models of Swin-UMamba are publicly available at: https://github.com/JiarunLiu/Swin-UMamba

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

Feb 06 '24 02:02 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Tue, 6 Feb 24

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

Resource-efficient In-orbit Detection of Earth Objects

Multimodal-Enhanced Objectness Learner for Corner Case Detection in Autonomous Driving

MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning

RIDERS: Radar-Infrared Depth Estimation for Robust Sensing

Decomposition-based and Interference Perception for Infrared and Visible Image Fusion in Complex Scenes

CoFiNet: Unveiling Camouflaged Objects with Multi-Scale Finesse

$\textit{A Contrario}$ Paradigm for YOLO-based Infrared Small Target Detection

DeSparsify: Adversarial Attack Against Token Sparsification Mechanisms in Vision Transformers

Gazebo Plants: Simulating Plant-Robot Interaction with Cosserat Rods

Spatio-temporal Prompting Network for Robust Video Feature Extraction

Improving Robustness of LiDAR-Camera Fusion Model against Weather Corruption from Fusion Strategy Perspective

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

ActiveAnno3D - An Active Learning Framework for Multi-Modal 3D Object Detection

HASSOD: Hierarchical Adaptive Self-Supervised Object Detection

Keyword: transformer

Detection of Machine-Generated Text: Literature Survey

Killer Apps: Low-Speed, Large-Scale AI Weapons

Linguistic-Based Mild Cognitive Impairment Detection Using Informative Loss

Language-Guided World Models: A Model-Based Approach to AI Control

Fortifying Ethical Boundaries in AI: Advanced Strategies for Enhancing Security in Large Language Models

3DG: A Framework for Using Generative AI for Handling Sparse Learner Performance Data From Intelligent Tutoring Systems

Systematic Literature Review: Computational Approaches for Humour Style Classification

BlackMamba: Mixture of Experts for State-Space Models

An introduction to graphical tensor notation for mechanistic interpretability

COMET: Generating Commit Messages using Delta Graph Context Representation

From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers

Hypergraph-Transformer (HGT) for Interactive Event Prediction in Laparoscopic and Robotic Surgery

Topology-Informed Graph Transformer

GenFormer: A Deep-Learning-Based Approach for Generating Multivariate Stochastic Processes

Precise Knowledge Transfer via Flow Matching

ScribFormer: Transformer Makes CNN Work Better for Scribble-based Medical Image Segmentation

TCI-Former: Thermal Conduction-Inspired Transformer for Infrared Small Target Detection

Neural Scaling Laws on Graphs

DiffVein: A Unified Diffusion Network for Finger Vein Segmentation and Authentication

Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

Revisiting the Markov Property for Machine Translation

Deep Semantic-Visual Alignment for Zero-Shot Remote Sensing Image Scene Classification

ParZC: Parametric Zero-Cost Proxies for Efficient NAS

One Graph Model for Cross-domain Dynamic Link Prediction

Frequency Explains the Inverse Correlation of Large Language Models' Size, Training Data Amount, and Surprisal's Fit to Reading Times

XTSFormer: Cross-Temporal-Scale Transformer for Irregular Time Event Prediction

INViT: A Generalizable Routing Problem Solver with Invariant Nested View Transformer

Spin: An Efficient Secure Computation Framework with GPU Acceleration

Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals

Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning

Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning

Vision Transformer-based Multimodal Feature Fusion Network for Lymphoma Segmentation on PET/CT Images

Unification of Symmetries Inside Neural Networks: Transformer, Feedforward and Neural ODE

The Developmental Landscape of In-Context Learning

Transolver: A Fast Transformer Solver for PDEs on General Geometries

Timer: Transformers for Time Series Analysis at Scale

PromptRR: Diffusion Models as Prompt Generators for Single Image Reflection Removal

NOAH: Learning Pairwise Object Category Attentions for Image Classification

FreDF: Learning to Forecast in Frequency Domain

Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition

VM-UNet: Vision Mamba UNet for Medical Image Segmentation

CompeteSMoE - Effective Training of Sparse Mixture of Experts via Competition

"What's my model inside of?": Exploring the role of environments for grounded natural language understanding

DeSparsify: Adversarial Attack Against Token Sparsification Mechanisms in Vision Transformers

Spatio-temporal Prompting Network for Robust Video Feature Extraction

ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars for Write Noise Mitigation

Unified Training of Universal Time Series Forecasting Transformers

Leveraging Continuously Differentiable Activation Functions for Learning in Quantized Noisy Environments

Increasing Trust in Language Models through the Reuse of Verified Circuits

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks

Key-Graph Transformer for Image Restoration

Focal Modulation Networks for Interpretable Sound Classification

Approximate Attributions for Off-the-Shelf Siamese Transformers

Time-, Memory- and Parameter-Efficient Visual Adaptation

Automated Cognate Detection as a Supervised Link Prediction Task with Cognate Transformer

Exploring the Synergies of Hybrid CNNs and ViTs Architectures for Computer Vision: A survey

Putting Context in Context: the Impact of Discussion Structure on Text Classification

UniMem: Towards a Unified View of Long-Context Large Language Models

Taylor Videos for Action Recognition

arxiv-daily
arxiv-daily copied to clipboard