arxiv-daily New submissions for Tuesday, 2 July 2024 (showing 763 of 763 entries )

New submissions for Tuesday, 2 July 2024 (showing 763 of 763 entries )

Open DongZhouGu opened this issue 7 months ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Title:

      Instance Temperature Knowledge Distillation

Authors: Zhengbo Zhang, Yuxi Zhou, Jia Gong, Jun Liu, Zhigang Tu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Knowledge distillation (KD) enhances the performance of a student network by allowing it to learn the knowledge transferred from a teacher network incrementally. Existing methods dynamically adjust the temperature to enable the student network to adapt to the varying learning difficulties at different learning stages of KD. KD is a continuous process, but when adjusting the temperature, these methods consider only the immediate benefits of the operation in the current learning phase and fail to take into account its future returns. To address this issue, we formulate the adjustment of temperature as a sequential decision-making task and propose a method based on reinforcement learning, termed RLKD. Importantly, we design a novel state representation to enable the agent to make more informed action (i.e. instance temperature adjustment). To handle the problem of delayed rewards in our method due to the KD setting, we explore an instance reward calibration approach. In addition,we devise an efficient exploration strategy that enables the agent to learn valuable instance temperature adjustment policy more efficiently. Our framework can serve as a plug-and-play technique to be inserted into various KD methods easily, and we validate its effectiveness on both image classification and object detection tasks. Our code is at this https URL

Title:

      Multi-Species Object Detection in Drone Imagery for Population Monitoring of Endangered Animals

Authors: Sowmya Sankaran
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Animal populations worldwide are rapidly declining, and a technology that can accurately count endangered species could be vital for monitoring population changes over several years. This research focused on fine-tuning object detection models for drone images to create accurate counts of animal species. Hundreds of images taken using a drone and large, openly available drone-image datasets were used to fine-tune machine learning models with the baseline YOLOv8 architecture. We trained 30 different models, with the largest having 43.7 million parameters and 365 layers, and used hyperparameter tuning and data augmentation techniques to improve accuracy. While the state-of-the-art YOLOv8 baseline had only 0.7% accuracy on a dataset of safari animals, our models had 95% accuracy on the same dataset. Finally, we deployed the models on the Jetson Orin Nano for demonstration of low-power real-time species detection for easy inference on drones.

Title:

      RepAct: The Re-parameterizable Adaptive Activation Function

Authors: Xian Wu, Qingchuan Tao, Shuang Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Addressing the imperative need for efficient artificial intelligence in IoT and edge computing, this study presents RepAct, a re-parameterizable adaptive activation function tailored for optimizing lightweight neural networks within the computational limitations of edge devices. By employing a multi-branch structure with learnable adaptive weights, RepAct enriches feature processing and enhances cross-layer interpretability. When evaluated on tasks such as image classification and object detection, RepAct notably surpassed conventional activation functions in lightweight networks, delivering up to a 7.92% accuracy boost on MobileNetV3-Small for the ImageNet100 dataset, while maintaining computational complexity on par with HardSwish. This innovative approach not only maximizes model parameter efficiency but also significantly improves the performance and understanding capabilities of lightweight neural networks, demonstrating its potential for real-time edge computing applications.

Title:

      Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review

Authors: Moseli Mots'oehli
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract While supervised learning has achieved significant success in computer vision tasks, acquiring high-quality annotated data remains a bottleneck. This paper explores both scholarly and non-scholarly works in AI-assistive deep learning image annotation systems that provide textual suggestions, captions, or descriptions of the input image to the annotator. This potentially results in higher annotation efficiency and quality. Our exploration covers annotation for a range of computer vision tasks including image classification, object detection, regression, instance, semantic segmentation, and pose estimation. We review various datasets and how they contribute to the training and evaluation of AI-assistive annotation systems. We also examine methods leveraging neuro-symbolic learning, deep active learning, and self-supervised learning algorithms that enable semantic image understanding and generate free-text output. These include image captioning, visual question answering, and multi-modal reasoning. Despite the promising potential, there is limited publicly available work on AI-assistive image annotation with textual output capabilities. We conclude by suggesting future research directions to advance this field, emphasizing the need for more publicly accessible datasets and collaborative efforts between academia and industry.

Title:

      When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

Authors: Philipp Allgeuer, Hassan Ali, Stefan Wermter
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We investigate the use of Large Language Models (LLMs) to equip neural robotic agents with human-like social and cognitive competencies, for the purpose of open-ended human-robot conversation and collaboration. We introduce a modular and extensible methodology for grounding an LLM with the sensory perceptions and capabilities of a physical robot, and integrate multiple deep learning models throughout the architecture in a form of system integration. The integrated models encompass various functions such as speech recognition, speech generation, open-vocabulary object detection, human pose estimation, and gesture detection, with the LLM serving as the central text-based coordinating unit. The qualitative and quantitative results demonstrate the huge potential of LLMs in providing emergent cognition and interactive language-oriented control of robots in a natural and social manner.

Title:

      DroBoost: An Intelligent Score and Model Boosting Method for Drone Detection

Authors: Ogulcan Eryuksel, Kamil Anil Ozfuttu, Fatih Cagatay Akyon, Kadir Sahin, Efe Buyukborekci, Devrim Cavusoglu, Sinan Altinuc
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Drone detection is a challenging object detection task where visibility conditions and quality of the images may be unfavorable, and detections might become difficult due to complex backgrounds, small visible objects, and hard to distinguish objects. Both provide high confidence for drone detections, and eliminating false detections requires efficient algorithms and approaches. Our previous work, which uses YOLOv5, uses both real and synthetic data and a Kalman-based tracker to track the detections and increase their confidence using temporal information. Our current work improves on the previous approach by combining several improvements. We used a more diverse dataset combining multiple sources and combined with synthetic samples chosen from a large synthetic dataset based on the error analysis of the base model. Also, to obtain more resilient confidence scores for objects, we introduced a classification component that discriminates whether the object is a drone or not. Finally, we developed a more advanced scoring algorithm for object tracking that we use to adjust localization confidence. Furthermore, the proposed technique won 1st Place in the Drone vs. Bird Challenge (Workshop on Small-Drone Surveillance, Detection and Counteraction Techniques at ICIAP 2021).

Title:

      GSO-YOLO: Global Stability Optimization YOLO for Construction Site Detection

Authors: Yuming Zhang, Dongzhi Guan, Shouxin Zhang, Junhao Su, Yunzhi Han, Jiabin Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Safety issues at construction sites have long plagued the industry, posing risks to worker safety and causing economic damage due to potential hazards. With the advancement of artificial intelligence, particularly in the field of computer vision, the automation of safety monitoring on construction sites has emerged as a solution to this longstanding issue. Despite achieving impressive performance, advanced object detection methods like YOLOv8 still face challenges in handling the complex conditions found at construction sites. To solve these problems, this study presents the Global Stability Optimization YOLO (GSO-YOLO) model to address challenges in complex construction sites. The model integrates the Global Optimization Module (GOM) and Steady Capture Module (SCM) to enhance global contextual information capture and detection stability. The innovative AIoU loss function, which combines CIoU and EIoU, improves detection accuracy and efficiency. Experiments on datasets like SODA, MOCS, and CIS show that GSO-YOLO outperforms existing methods, achieving SOTA performance.

Title:

      Real-Time Neuromorphic Navigation: Integrating Event-Based Vision and Physics-Driven Planning on a Parrot Bebop2 Quadrotor

Authors: Amogh Joshi, Sourav Sanyal, Kaushik Roy
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In autonomous aerial navigation, real-time and energy-efficient obstacle avoidance remains a significant challenge, especially in dynamic and complex indoor environments. This work presents a novel integration of neuromorphic event cameras with physics-driven planning algorithms implemented on a Parrot Bebop2 quadrotor. Neuromorphic event cameras, characterized by their high dynamic range and low latency, offer significant advantages over traditional frame-based systems, particularly in poor lighting conditions or during high-speed maneuvers. We use a DVS camera with a shallow Spiking Neural Network (SNN) for event-based object detection of a moving ring in real-time in an indoor lab. Further, we enhance drone control with physics-guided empirical knowledge inside a neural network training mechanism, to predict energy-efficient flight paths to fly through the moving ring. This integration results in a real-time, low-latency navigation system capable of dynamically responding to environmental changes while minimizing energy consumption. We detail our hardware setup, control loop, and modifications necessary for real-world applications, including the challenges of sensor integration without burdening the flight capabilities. Experimental results demonstrate the effectiveness of our approach in achieving robust, collision-free, and energy-efficient flight paths, showcasing the potential of neuromorphic vision and physics-driven planning in enhancing autonomous navigation systems.

Title:

      SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection

Authors: Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, Xiang Bai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects common in aerial images unexplored. At the same time, the annotation cost of multi-oriented objects is significantly higher than that of their horizontal counterparts. Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images are usually arbitrary orientations, small scales, and aggregation, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various multi-oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.92, +2.39, and +2.57 mAP under 10%, 20%, and 30% labeled data settings, respectively) with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. The code will be made available.

Title:

      No More Potentially Dynamic Objects: Static Point Cloud Map Generation based on 3D Object Detection and Ground Projection

Authors: Soojin Woo, Donghwi Jung, Seong-Woo Kim
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In this paper, we propose an algorithm to generate a static point cloud map based on LiDAR point cloud data. Our proposed pipeline detects dynamic objects using 3D object detectors and projects points of dynamic objects onto the ground. Typically, point cloud data acquired in real-time serves as a snapshot of the surrounding areas containing both static objects and dynamic objects. The static objects include buildings and trees, otherwise, the dynamic objects contain objects such as parked cars that change their position over time. Removing dynamic objects from the point cloud map is crucial as they can degrade the quality and localization accuracy of the map. To address this issue, in this paper, we propose an algorithm that creates a map only consisting of static objects. We apply a 3D object detection algorithm to the point cloud data which are obtained from LiDAR to implement our pipeline. We then stack the points to create the map after performing ground segmentation and projection. As a result, not only we can eliminate currently dynamic objects at the time of map generation but also potentially dynamic objects such as parked vehicles. We validate the performance of our method using two kinds of datasets collected on real roads: KITTI and our dataset. The result demonstrates the capability of our proposal to create an accurate static map excluding dynamic objects from input point clouds. Also, we verified the improved performance of localization using a generated map based on our method.

Title:

      Eliminating Position Bias of Language Models: A Mechanistic Approach

Authors: Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham M. Kakade, Hao Peng, Heng Ji
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. Specifically, we find that causal attention generally causes models to favor distant content, while relative positional encodings like RoPE prefer nearby ones based on the analysis of retrieval-augmented question answering (QA). Further, our empirical study on object detection reveals that position bias is also present in vision-language models (VLMs). Based on the above analyses, we propose to ELIMINATE position bias caused by different input segment orders (e.g., options in LM-as-a-judge, retrieved documents in QA) in a TRAINING-FREE ZERO-SHOT manner. Our method changes the causal attention to bidirectional attention between segments and utilizes model attention values to decide the relative orders of segments instead of using the order provided in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the segment level. By eliminating position bias, models achieve better performance and reliability in downstream tasks where position bias widely exists, such as LM-as-a-judge and retrieval-augmented QA. Notably, PINE is especially useful when adapting LMs for evaluating reasoning pairs: it consistently provides 8 to 10 percentage points performance gains in most cases, and makes Llama-3-70B-Instruct perform even better than GPT-4-0125-preview on the RewardBench reasoning subset.

Title:

      Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection

Authors: Francesco Barbato, Umberto Michieli, Jijoong Moon, Pietro Zanuttigh, Mete Ozay
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Recent years have seen object detection robotic systems deployed in several personal devices (e.g., home robots and appliances). This has highlighted a challenge in their design, i.e., they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e.g., a dog vs. user's dog). We refer to this challenging task as Instance-level Personalized Object Detection (IPOD). The personalization task requires many samples for model tuning and optimization in a centralized server, raising privacy concerns. An alternative is provided by approaches based on recent large-scale Foundation Models, but their compute costs preclude on-device applications. In our work we tackle both problems at the same time, designing a Few-Shot IPOD strategy called AuXFT. We introduce a conditional coarse-to-fine few-shot learner to refine the coarse predictions made by an efficient object detector, showing that using an off-the-shelf model leads to poor personalization due to neural collapse. Therefore, we introduce a Translator block that generates an auxiliary feature space where features generated by a self-supervised model (e.g., DINOv2) are distilled without impacting the performance of the detector. We validate AuXFT on three publicly available datasets and one in-house benchmark designed for the IPOD task, achieving remarkable gains in all considered scenarios with excellent time-complexity trade-off: AuXFT reaches a performance of 80% its upper bound at just 32% of the inference time, 13% of VRAM and 19% of the model size.

Title:

      Formal Verification of Object Detection

Authors: Avraham Raviv, Yizhak Y. Elboher, Michelle Aluf-Medina, Yael Leibovich Weiss, Omer Cohen, Roy Assa, Guy Katz, Hillel Kugler
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Deep Neural Networks (DNNs) are ubiquitous in real-world applications, yet they remain vulnerable to errors and adversarial attacks. This work tackles the challenge of applying formal verification to ensure the safety of computer vision models, extending verification beyond image classification to object detection. We propose a general formulation for certifying the robustness of object detection models using formal verification and outline implementation strategies compatible with state-of-the-art verification tools. Our approach enables the application of these tools, originally designed for verifying classification models, to object detection. We define various attacks for object detection, illustrating the diverse ways adversarial inputs can compromise neural network outputs. Our experiments, conducted on several common datasets and networks, reveal potential errors in object detection models, highlighting system vulnerabilities and emphasizing the need for expanding formal verification to these new domains. This work paves the way for further research in integrating formal verification across a broader range of computer vision applications.

Title:

      Scarecrow monitoring system:employing mobilenet ssd for enhanced animal supervision

Authors: Balaji VS, Mahi AR, Anirudh Ganapathy PS, Manju M
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Agriculture faces a growing challenge with wildlife wreaking havoc on crops, threatening sustainability. The project employs advanced object detection, the system utilizes the Mobile Net SSD model for real-time animal classification. The methodology initiates with the creation of a dataset, where each animal is represented by annotated images. The SSD Mobile Net architecture facilitates the use of a model for image classification and object detection. The model undergoes fine-tuning and optimization during training, enhancing accuracy for precise animal classification. Real-time detection is achieved through a webcam and the OpenCV library, enabling prompt identification and categorization of approaching animals. By seamlessly integrating intelligent scarecrow technology with object detection, this system offers a robust solution to field protection, minimizing crop damage and promoting precision farming. It represents a valuable contribution to agricultural sustainability, addressing the challenge of wildlife interference with crops. The implementation of the Intelligent Scarecrow Monitoring System stands as a progressive tool for proactive field management and protection, empowering farmers with an advanced solution for precision agriculture. Keywords: Machine learning, Deep Learning, Computer Vision, MobileNet SSD

Keyword: transformer

Title:

      Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

Authors: Anton Xue, Avishree Khare, Rajeev Alur, Surbhi Goel, Eric Wong
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We study how to subvert language models from following the rules. We model rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if $P$ and $Q$, then $R$" for some propositions $P$, $Q$, and $R$. We prove that although transformers can faithfully abide by such rules, maliciously crafted prompts can nevertheless mislead even theoretically constructed models. Empirically, we find that attacks on our theoretical models mirror popular attacks on large language models. Our work suggests that studying smaller theoretical models can help understand the behavior of large language models in rule-based settings like logical reasoning and jailbreak attacks.

Title:

      OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Authors: Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau$ = {$o_0$, $a_0$, $\dots$} and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.

Title:

      Multimodal Prototyping for cancer survival prediction

Authors: Andrew H. Song, Richard J. Chen, Guillaume Jaume, Anurag J. Vaidya, Alexander S. Baras, Faisal Mahmood
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Multimodal survival methods combining gigapixel histology whole-slide images (WSIs) and transcriptomic profiles are particularly promising for patient prognostication and stratification. Current approaches involve tokenizing the WSIs into smaller patches (>10,000 patches) and transcriptomics into gene groups, which are then integrated using a Transformer for predicting outcomes. However, this process generates many tokens, which leads to high memory requirements for computing attention and complicates post-hoc interpretability analyses. Instead, we hypothesize that we can: (1) effectively summarize the morphological content of a WSI by condensing its constituting tokens using morphological prototypes, achieving more than 300x compression; and (2) accurately characterize cellular functions by encoding the transcriptomic profile with biological pathway prototypes, all in an unsupervised fashion. The resulting multimodal tokens are then processed by a fusion network, either with a Transformer or an optimal transport cross-alignment, which now operates with a small and fixed number of tokens without approximations. Extensive evaluation on six cancer types shows that our framework outperforms state-of-the-art methods with much less computation while unlocking new interpretability analyses.

Title:

      Transformer-based Image and Video Inpainting: Current Challenges and Future Directions

Authors: Omar Elharrouss, Rafat Damseh, Abdelkader Nasreddine Belkacem, Elarbi Badidi, Abderrahmane Lakas
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Image inpainting is currently a hot topic within the field of computer vision. It offers a viable solution for various applications, including photographic restoration, video editing, and medical imaging. Deep learning advancements, notably convolutional neural networks (CNNs) and generative adversarial networks (GANs), have significantly enhanced the inpainting task with an improved capability to fill missing or damaged regions in an image or video through the incorporation of contextually appropriate details. These advancements have improved other aspects, including efficiency, information preservation, and achieving both realistic textures and structures. Recently, visual transformers have been exploited and offer some improvements to image or video inpainting. The advent of transformer-based architectures, which were initially designed for natural language processing, has also been integrated into computer vision tasks. These methods utilize self-attention mechanisms that excel in capturing long-range dependencies within data; therefore, they are particularly effective for tasks requiring a comprehensive understanding of the global context of an image or video. In this paper, we provide a comprehensive review of the current image or video inpainting approaches, with a specific focus on transformer-based techniques, with the goal to highlight the significant improvements and provide a guideline for new researchers in the field of image or video inpainting using visual transformers. We categorized the transformer-based techniques by their architectural configurations, types of damage, and performance metrics. Furthermore, we present an organized synthesis of the current challenges, and suggest directions for future research in the field of image or video inpainting.

Title:

      Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription

Authors: Jaydeep Borkar, David A. Smith
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable for scholars seeking to distinguish images containing lacunae or errors from clean ones. Although we explore the potential of attention mechanisms in flagging lacunae and transcription errors, our findings suggest it is not a significant factor. Our work highlights a promising direction in utilizing transformer-based OCR models for restoring or analyzing damaged historical documents.

Title:

      Personalised Outfit Recommendation via History-aware Transformers

Authors: David Jung, Julien Monteil, Philip Schulz, Volodymyr Vaskovych
Subjects: Subjects: Systems and Control (eess.SY); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We present the history-aware transformer (HAT), a transformer-based model that uses shoppers' purchase history to personalise outfit predictions. The aim of this work is to recommend outfits that are internally coherent while matching an individual shopper's style and taste. To achieve this, we stack two transformer models, one that produces outfit representations and another one that processes the history of purchased outfits for a given shopper. We use these models to score an outfit's compatibility in the context of a shopper's preferences as inferred from their previous purchases. During training, the model learns to discriminate between purchased and random outfits using 3 losses: the focal loss for outfit compatibility typically used in the literature, a contrastive loss to bring closer learned outfit embeddings from a shopper's history, and an adaptive margin loss to facilitate learning from weak negatives. Together, these losses enable the model to make personalised recommendations based on a shopper's purchase history. Our experiments on the IQON3000 and Polyvore datasets show that HAT outperforms strong baselines on the outfit Compatibility Prediction (CP) and the Fill In The Blank (FITB) tasks. The model improves AUC for the CP hard task by 15.7% (IQON3000) and 19.4% (Polyvore) compared to previous SOTA results. It further improves accuracy on the FITB hard task by 6.5% and 9.7%, respectively. We provide ablation studies on the personalisation, constrastive loss, and adaptive margin loss that highlight the importance of these modelling choices.

Title:

      UM2N: Towards Universal Mesh Movement Networks

Authors: Mingrui Zhang, Chunyang Wang, Stephan Kramer, Joseph G. Wallwork, Siyi Li, Jiancheng Liu, Xiang Chen, Matthew D. Piggott
Subjects: Subjects: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Solving complex Partial Differential Equations (PDEs) accurately and efficiently is an essential and challenging problem in all scientific and engineering disciplines. Mesh movement methods provide the capability to improve the accuracy of the numerical solution without increasing the overall mesh degree of freedom count. Conventional sophisticated mesh movement methods are extremely expensive and struggle to handle scenarios with complex boundary geometries. However, existing learning-based methods require re-training from scratch given a different PDE type or boundary geometry, which limits their applicability, and also often suffer from robustness issues in the form of inverted elements. In this paper, we introduce the Universal Mesh Movement Network (UM2N), which -- once trained -- can be applied in a non-intrusive, zero-shot manner to move meshes with different size distributions and structures, for solvers applicable to different PDE types and boundary geometries. UM2N consists of a Graph Transformer (GT) encoder for extracting features and a Graph Attention Network (GAT) based decoder for moving the mesh. We evaluate our method on advection and Navier-Stokes based examples, as well as a real-world tsunami simulation case. Our method outperforms existing learning-based mesh movement methods in terms of the benchmarks described above. In comparison to the conventional sophisticated Monge-Ampère PDE-solver based method, our approach not only significantly accelerates mesh movement, but also proves effective in scenarios where the conventional method fails. Our project page is at \url{this https URL}.

Title:

      Query-Efficient Hard-Label Black-Box Attack against Vision Transformers

Authors: Chao Zhou, Xiaowen Shi, Yuan-Gen Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Recent studies have revealed that vision transformers (ViTs) face similar security risks from adversarial attacks as deep convolutional neural networks (CNNs). However, directly applying attack methodology on CNNs to ViTs has been demonstrated to be ineffective since the ViTs typically work on patch-wise encoding. This article explores the vulnerability of ViTs against adversarial attacks under a black-box scenario, and proposes a novel query-efficient hard-label adversarial attack method called AdvViT. Specifically, considering that ViTs are highly sensitive to patch modification, we propose to optimize the adversarial perturbation on the individual patches. To reduce the dimension of perturbation search space, we modify only a handful of low-frequency components of each patch. Moreover, we design a weight mask matrix for all patches to further optimize the perturbation on different regions of a whole image. We test six mainstream ViT backbones on the ImageNet-1k dataset. Experimental results show that compared with the state-of-the-art attacks on CNNs, our AdvViT achieves much lower $L_2$-norm distortion under the same query budget, sufficiently validating the vulnerability of ViTs against adversarial attacks.

Title:

      eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey

Authors: Krzysztof Nowak, Jędrzej Ziębura, Krzysztof Wróbel, Aleksander Smywiński-Pohl
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This study introduces the eFontes models for automatic linguistic annotation of Medieval Latin texts, focusing on lemmatization, part-of-speech tagging, and morphological feature determination. Using the Transformers library, these models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin. The research evaluates the models' performance, addressing challenges such as orthographic variations and the integration of Latinized vernacular terms. The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%. The findings underscore the importance of high-quality annotated corpora and propose future enhancements, including extending the models to Named Entity Recognition.

Title:

      Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders

Authors: Hok-Shing Lau, Mark Huntly, Nathon Morgan, Adesua Iyenoma, Biao Zeng, Tim Bashford
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Speech contains information that is clinically relevant to some diseases, which has the potential to be used for health assessment. Recent work shows an interest in applying deep learning algorithms, especially pretrained large speech models to the applications of Automatic Speech Assessment. One question that has not been explored is how these models output the results based on their inputs. In this work, we train and compare two configurations of Audio Spectrogram Transformer in the context of Voice Disorder Detection and apply the attention rollout method to produce model relevance maps, the computed relevance of the spectrogram regions when the model makes predictions. We use these maps to analyse how models make predictions in different conditions and to show that the spread of attention is reduced as a model is finetuned, and the model attention is concentrated on specific phoneme regions.

Title:

      WallFacer: Guiding Transformer Model Training Out of the Long-Context Dark Forest with N-body Problem

Authors: Ziming Liu, Shaoyu Wang, Shenggan Cheng, Zhongkai Zhao, Yang Bai, Xuanlei Zhao, James Demmel, Yang You
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In recent years, Transformer-based Large Language Models (LLMs) have garnered significant attention due to their exceptional performance across a variety of tasks. However, training these models on long sequences presents a substantial challenge in terms of efficiency and scalability. Current methods are constrained either by the number of attention heads, limiting scalability, or by excessive communication overheads. In this paper, we propose an insight that Attention Computation can be considered as a special case of n-body problem with direct interactions. Based on this concept, this paper introduces WallFacer, an efficient long-sequence training system with a novel multi-dimensional ring sequence parallelism, fostering an efficient communication paradigm and extra tuning space for communication arrangement. Through comprehensive experiments under diverse environments and model settings, we demonstrate that WallFacer significantly surpasses state-of-the-art method that supports near-infinite sequence length, achieving performance improvements of up to 77.12%.

Title:

      LegalTurk Optimized BERT for Multi-Label Text Classification and NER

Authors: Farnaz Zeidi, Mehmet Fatih Amasyali, Çiğdem Erol
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The introduction of the Transformer neural network, along with techniques like self-supervised pre-training and transfer learning, has paved the way for advanced models like BERT. Despite BERT's impressive performance, opportunities for further enhancement exist. To our knowledge, most efforts are focusing on improving BERT's performance in English and in general domains, with no study specifically addressing the legal Turkish domain. Our study is primarily dedicated to enhancing the BERT model within the legal Turkish domain through modifications in the pre-training phase. In this work, we introduce our innovative modified pre-training approach by combining diverse masking strategies. In the fine-tuning task, we focus on two essential downstream tasks in the legal domain: name entity recognition and multi-label text classification. To evaluate our modified pre-training approach, we fine-tuned all customized models alongside the original BERT models to compare their performance. Our modified approach demonstrated significant improvements in both NER and multi-label text classification tasks compared to the original BERT model. Finally, to showcase the impact of our proposed models, we trained our best models with different corpus sizes and compared them with BERTurk models. The experimental results demonstrate that our innovative approach, despite being pre-trained on a smaller corpus, competes with BERTurk.

Title:

      Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Authors: Yuchuan Tian, Jianhong Han, Hanting Chen, Yuanyuan Xi, Guoyang Zhang, Jie Hu, Chao Xu, Yunhe Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT -- an All-in-One Image Processing Transformer that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. Rather than popular feature adaptation methods, we propose weight modulation that adapts weights to specific tasks. Firstly, we figure out task-sensitive weights via a toy experiment and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. Further, we propose to maneuver Instruct-IPT with text instructions for better user interfaces. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at this https URL.

Title:

      Engineering an Efficient Object Tracker for Non-Linear Motion

Authors: Momir Adžemović, Predrag Tadić, Andrija Petrović, Mladen Nikolić
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The goal of multi-object tracking is to detect and track all objects in a scene while maintaining unique identifiers for each, by associating their bounding boxes across video frames. This association relies on matching motion and appearance patterns of detected objects. This task is especially hard in case of scenarios involving dynamic and non-linear motion patterns. In this paper, we introduce DeepMoveSORT, a novel, carefully engineered multi-object tracker designed specifically for such scenarios. In addition to standard methods of appearance-based association, we improve motion-based association by employing deep learnable filters (instead of the most commonly used Kalman filter) and a rich set of newly proposed heuristics. Our improvements to motion-based association methods are severalfold. First, we propose a new transformer-based filter architecture, TransFilter, which uses an object's motion history for both motion prediction and noise filtering. We further enhance the filter's performance by careful handling of its motion history and accounting for camera motion. Second, we propose a set of heuristics that exploit cues from the position, shape, and confidence of detected bounding boxes to improve association performance. Our experimental evaluation demonstrates that DeepMoveSORT outperforms existing trackers in scenarios featuring non-linear motion, surpassing state-of-the-art results on three such datasets. We also perform a thorough ablation study to evaluate the contributions of different tracker components which we proposed. Based on our study, we conclude that using a learnable filter instead of the Kalman filter, along with appearance-based association is key to achieving strong general tracking performance.

Title:

      Chest-Diffusion: A Light-Weight Text-to-Image Model for Report-to-CXR Generation

Authors: Peng Huang, Xue Gao, Lihong Huang, Jing Jiao, Xiaokang Li, Yuanyuan Wang, Yi Guo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Text-to-image generation has important implications for generation of diverse and controllable images. Several attempts have been made to adapt Stable Diffusion (SD) to the medical domain. However, the large distribution difference between medical reports and natural texts, as well as high computational complexity in common stable diffusion limit the authenticity and feasibility of the generated medical images. To solve above problems, we propose a novel light-weight transformer-based diffusion model learning framework, Chest-Diffusion, for report-to-CXR generation. Chest-Diffusion employs a domain-specific text encoder to obtain accurate and expressive text features to guide image generation, improving the authenticity of the generated images. Meanwhile, we introduce a light-weight transformer architecture as the denoising model, reducing the computational complexity of the diffusion model. Experiments demonstrate that our Chest-Diffusion achieves the lowest FID score 24.456, under the computation budget of 118.918 GFLOPs, which is nearly one-third of the computational complexity of SD.

Title:

      Exploring a Physics-Informed Decision Transformer for Distribution System Restoration: Methodology and Performance Analysis

Authors: Hong Zhao, Jin Wei-Kocsis, Adel Heidari Akhijahani, Karen L Butler-Purry
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Driven by advancements in sensing and computing, deep reinforcement learning (DRL)-based methods have demonstrated significant potential in effectively tackling distribution system restoration (DSR) challenges under uncertain operational scenarios. However, the data-intensive nature of DRL poses obstacles in achieving satisfactory DSR solutions for large-scale, complex distribution systems. Inspired by the transformative impact of emerging foundation models, including large language models (LLMs), across various domains, this paper explores an innovative approach harnessing LLMs' powerful computing capabilities to address scalability challenges inherent in conventional DRL methods for solving DSR. To our knowledge, this study represents the first exploration of foundation models, including LLMs, in revolutionizing conventional DRL applications in power system operations. Our contributions are twofold: 1) introducing a novel LLM-powered Physics-Informed Decision Transformer (PIDT) framework that leverages LLMs to transform conventional DRL methods for DSR operations, and 2) conducting comparative studies to assess the performance of the proposed LLM-powered PIDT framework at its initial development stage for solving DSR problems. While our primary focus in this paper is on DSR operations, the proposed PIDT framework can be generalized to optimize sequential decision-making across various power system operations.

Title:

      NAIST Simultaneous Speech Translation System for IWSLT 2024

Authors: Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Tomoya Yanagita, Kosuke Doi, Mana Makinae, Haotian Tan, Makoto Sakai, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura
Subjects: Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.

Title:

      SAFE: a SAR Feature Extractor based on self-supervised learning and masked Siamese ViTs

Authors: Max Muzeau, Joana Frontera-Pons, Chengfang Ren, Jean-Philippe Ovarlez
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Due to its all-weather and day-and-night capabilities, Synthetic Aperture Radar imagery is essential for various applications such as disaster management, earth monitoring, change detection and target recognition. However, the scarcity of labeled SAR data limits the performance of most deep learning algorithms. To address this issue, we propose a novel self-supervised learning framework based on masked Siamese Vision Transformers to create a General SAR Feature Extractor coined SAFE. Our method leverages contrastive learning principles to train a model on unlabeled SAR data, extracting robust and generalizable features. SAFE is applicable across multiple SAR acquisition modes and resolutions. We introduce tailored data augmentation techniques specific to SAR imagery, such as sub-aperture decomposition and despeckling. Comprehensive evaluations on various downstream tasks, including few-shot classification, segmentation, visualization, and pattern detection, demonstrate the effectiveness and versatility of the proposed approach. Our network competes with or surpasses other state-of-the-art methods in few-shot classification and segmentation tasks, even without being trained on the sensors used for the evaluation.

Title:

      Mechanistic Interpretation through Contextual Decomposition in Transformers

Authors: Aliyah R. Hsu, Yeshwanth Cherapanamjeri, Anobel Y. Odisho, Peter R. Carroll, Bin Yu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Transformers exhibit impressive capabilities but are often regarded as black boxes due to challenges in understanding the complex nonlinear relationships between features. Interpreting machine learning models is of paramount importance to mitigate risks, and mechanistic interpretability is in particular of current interest as it opens up a window for guiding manual modifications and reverse-engineering solutions. In this work, we introduce contextual decomposition for transformers (CD-T), extending a prior work on CD for RNNs and CNNs, to address mechanistic interpretation computationally efficiently. CD-T is a flexible interpretation method for transformers. It can capture contributions of combinations of input features or source internal components (e.g. attention heads, feed-forward networks) to (1) final predictions or (2) the output of any target internal component. Using CD-T, we propose a novel algorithm for circuit discovery. On a real-world pathology report classification task: we show CD-T distills a more faithful circuit of attention heads with improved computational efficiency (speed up 2x) than a prior benchmark, path patching. As a versatile interpretation method, CD-T also exhibits exceptional capabilities for local interpretations. CD-T is shown to reliably find words and phrases of contrasting sentiment/topic on SST-2 and AGNews datasets. Through human experiments, we demonstrate CD-T enables users to identify the more accurate of two models and to better trust a model's outputs compared to alternative interpretation methods such as SHAP and LIME.

Title:

      Papez: Resource-Efficient Speech Separation with Auditory Working Memory

Authors: Hyunseok Oh, Juheon Yi, Youngki Lee
Subjects: Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk Transformer with small-sized auditory working memory. Second, we adaptively prune the input tokens that do not need further processing. Finally, we reduce the number of parameters through the recurrent transformer. Our extensive evaluation shows that Papez achieves the best resource and accuracy tradeoffs with a large margin. We publicly share our source code at \texttt{this https URL}

Title:

      How to Leverage Digit Embeddings to Represent Numbers?

Authors: Jasivan Alex Sivakumar, Nafise Sadat Moosavi
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Apart from performing arithmetic operations, understanding numbers themselves is still a challenge for existing language models. Simple generalisations, such as solving 100+200 instead of 1+2, can substantially affect model performance (Sivakumar and Moosavi, 2023). Among various techniques, character-level embeddings of numbers have emerged as a promising approach to improve number representation. However, this method has limitations as it leaves the task of aggregating digit representations to the model, which lacks direct supervision for this process. In this paper, we explore the use of mathematical priors to compute aggregated digit embeddings and explicitly incorporate these aggregates into transformer models. This can be achieved either by adding a special token to the input embeddings or by introducing an additional loss function to enhance correct predictions. We evaluate the effectiveness of incorporating this explicit aggregation, analysing its strengths and shortcomings, and discuss future directions to better benefit from this approach. Our methods, while simple, are compatible with any pretrained model and require only a few lines of code, which we have made publicly available.

Title:

      Large Language Model Enhanced Knowledge Representation Learning: A Survey

Authors: Xin Wang, Zirui Chen, Haofen Wang, Leong Hou U, Zhao Li, Wenbin Guo
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The integration of Large Language Models (LLMs) with Knowledge Representation Learning (KRL) signifies a pivotal advancement in the field of artificial intelligence, enhancing the ability to capture and utilize complex knowledge structures. This synergy leverages the advanced linguistic and contextual understanding capabilities of LLMs to improve the accuracy, adaptability, and efficacy of KRL, thereby expanding its applications and potential. Despite the increasing volume of research focused on embedding LLMs within the domain of knowledge representation, a thorough review that examines the fundamental components and processes of these enhanced models is conspicuously absent. Our survey addresses this by categorizing these models based on three distinct Transformer architectures, and by analyzing experimental data from various KRL downstream tasks to evaluate the strengths and weaknesses of each approach. Finally, we identify and explore potential future research directions in this emerging yet underexplored domain, proposing pathways for continued progress.

Title:

      Diffusion Transformer Model With Compact Prior for Low-dose PET Reconstruction

Authors: Bin Huang, Xubiao Liu, Lei Fang, Qiegen Liu, Bingxuan Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Positron emission tomography (PET) is an advanced medical imaging technique that plays a crucial role in non-invasive clinical diagnosis. However, while reducing radiation exposure through low-dose PET scans is beneficial for patient safety, it often results in insufficient statistical data. This scarcity of data poses significant challenges for accurately reconstructing high-quality images, which are essential for reliable diagnostic outcomes. In this research, we propose a diffusion transformer model (DTM) guided by joint compact prior (JCP) to enhance the reconstruction quality of low-dose PET imaging. In light of current research findings, we present a pioneering PET reconstruction model that integrates diffusion and transformer models for joint optimization. This model combines the powerful distribution mapping abilities of diffusion models with the capacity of transformers to capture long-range dependencies, offering significant advantages for low-dose PET reconstruction. Additionally, the incorporation of the lesion refining block and penalized weighted least squares (PWLS) enhance the recovery capability of lesion regions and preserves detail information, solving blurring problems in lesion areas and texture details of most deep learning frameworks. Experimental results demonstrate the effectiveness of DTM in enhancing image quality and preserving critical clinical information for low-dose PET scans. Our approach not only reduces radiation exposure risks but also provides a more reliable PET imaging tool for early disease detection and patient management.

Title:

      SpectralKAN: Kolmogorov-Arnold Network for Hyperspectral Images Change Detection

Authors: Yanheng Wang, Xiaohan Yu, Yongsheng Gao, Jianjun Sha, Jian Wang, Lianru Gao, Yonggang Zhang, Xianhui Rong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract It has been verified that deep learning methods, including convolutional neural networks (CNNs), graph neural networks (GNNs), and transformers, can accurately extract features from hyperspectral images (HSIs). These algorithms perform exceptionally well on HSIs change detection (HSIs-CD). However, the downside of these impressive results is the enormous number of parameters, FLOPs, GPU memory, training and test times required. In this paper, we propose an spectral Kolmogorov-Arnold Network for HSIs-CD (SpectralKAN). SpectralKAN represent a multivariate continuous function with a composition of activation functions to extract HSIs feature and classification. These activation functions are b-spline functions with different parameters that can simulate various functions. In SpectralKAN, a KAN encoder is proposed to enhance computational efficiency for HSIs. And a spatial-spectral KAN encoder is introduced, where the spatial KAN encoder extracts spatial features and compresses the spatial dimensions from patch size to one. The spectral KAN encoder then extracts spectral features and classifies them into changed and unchanged categories. We use five HSIs-CD datasets to verify the effectiveness of SpectralKAN. Experimental verification has shown that SpectralKAN maintains high HSIs-CD accuracy while requiring fewer parameters, FLOPs, GPU memory, training and testing times, thereby increasing the efficiency of HSIs-CD. The code will be available at this https URL.

Title:

      Universal Approximation Theory: The basic theory for large language models

Authors: Wei Wang, Qing Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Language models have emerged as a critical area of focus in artificial intelligence, particularly with the introduction of groundbreaking innovations like ChatGPT. Large-scale Transformer networks have quickly become the leading approach for advancing natural language processing algorithms. Built on the Transformer architecture, these models enable interactions that closely mimic human communication and, equipped with extensive knowledge, can even assist in guiding human tasks. Despite their impressive capabilities and growing complexity, a key question remains-the theoretical foundations of large language models (LLMs). What makes Transformer so effective for powering intelligent language applications, such as translation and coding? What underlies LLMs' ability for In-Context Learning (ICL)? How does the LoRA scheme enhance the fine-tuning of LLMs? And what supports the practicality of pruning LLMs? To address these critical questions and explore the technological strategies within LLMs, we leverage the Universal Approximation Theory (UAT) to offer a theoretical backdrop, shedding light on the mechanisms that underpin these advancements.

Title:

      How Does Overparameterization Affect Features?

Authors: Ahmet Cagri Duzgun, Samy Jelassi, Yuanzhi Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Overparameterization, the condition where models have more parameters than necessary to fit their training loss, is a crucial factor for the success of deep learning. However, the characteristics of the features learned by overparameterized networks are not well understood. In this work, we explore this question by comparing models with the same architecture but different widths. We first examine the expressivity of the features of these models, and show that the feature space of overparameterized networks cannot be spanned by concatenating many underparameterized features, and vice versa. This reveals that both overparameterized and underparameterized networks acquire some distinctive features. We then evaluate the performance of these models, and find that overparameterized networks outperform underparameterized networks, even when many of the latter are concatenated. We corroborate these findings using a VGG-16 and ResNet18 on CIFAR-10 and a Transformer on the MNLI classification dataset. Finally, we propose a toy setting to explain how overparameterized networks can learn some important features that the underparamaterized networks cannot learn.

Title:

      Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval

Authors: Hanwen Su, Ge Song, Kai Huang, Jiyan Wang, Ming Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In this paper, we study the problem of zero-shot sketch-based image retrieval (ZS-SBIR). The prior methods tackle the problem in a two-modality setting with only category labels or even no textual information involved. However, the growing prevalence of Large-scale pre-trained Language Models (LLMs), which have demonstrated great knowledge learned from web-scale data, can provide us with an opportunity to conclude collective textual information. Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers. To this end, we propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval. The network consists of three components: (i) a Description Generation Module that generates textual descriptions for each training category by prompting an LLM with several interrogative sentences, (ii) a Feature Extraction Module that includes two ViTs for sketch and image data, a transformer for extracting tokens of sentences of each training category, finally (iii) a Cross-modal Alignment Module that exchanges the token features of both text-sketch and text-image using cross-attention mechanism, and align the tokens locally and globally. Extensive experiments on three benchmark datasets show our superior performances over the state-of-the-art ZS-SBIR methods.

Title:

      Embedded Prompt Tuning: Towards Enhanced Calibration of Pretrained Models for Medical Images

Authors: Wenqiang Zu, Shenghao Xie, Qing Zhao, Guoqi Li, Lei Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Foundation models pre-trained on large-scale data have been widely witnessed to achieve success in various natural imaging downstream tasks. Parameter-efficient fine-tuning (PEFT) methods aim to adapt foundation models to new domains by updating only a small portion of parameters in order to reduce computational overhead. However, the effectiveness of these PEFT methods, especially in cross-domain few-shot scenarios, e.g., medical image analysis, has not been fully explored. In this work, we facilitate the study of the performance of PEFT when adapting foundation models to medical image classification tasks. Furthermore, to alleviate the limitations of prompt introducing ways and approximation capabilities on Transformer architectures of mainstream prompt tuning methods, we propose the Embedded Prompt Tuning (EPT) method by embedding prompt tokens into the expanded channels. We also find that there are anomalies in the feature space distribution of foundation models during pre-training process, and prompt tuning can help mitigate this negative impact. To explain this phenomenon, we also introduce a novel perspective to understand prompt tuning: \textbf{Prompt tuning is a distribution calibrator.} And we support it by analyzing patch-wise scaling and feature separation operations contained in EPT. Our experiments show that EPT outperforms several state-of-the-art fine-tuning methods by a significant margin on few-shot medical image classification tasks, and completes the fine-tuning process within highly competitive time, indicating EPT is an effective PEFT method. Our code will be released once accepted.

Title:

      GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking

Authors: Huijie Fan, Tinghui Zhao, Qiang Wang, Baojie Fan, Yandong Tang, LianQing Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In the task of multi-target multi-camera (MTMC) tracking of pedestrians, the data association problem is a key issue and main challenge, especially with complications arising from camera movements, lighting variations, and obstructions. However, most MTMC models adopt two-step approaches, thus heavily depending on the results of the first-step tracking in practical applications. Moreover, the same targets crossing different cameras may exhibit significant appearance variations, which further increases the difficulty of cross-camera matching. To address the aforementioned issues, we propose a global online MTMC tracking model that addresses the dependency on the first tracking stage in two-step methods and enhances cross-camera matching. Specifically, we propose a transformer-based global MTMC association module to explore target associations across different cameras and frames, generating global trajectories directly. Additionally, to integrate the appearance and spatio-temporal features of targets, we propose a feature extraction and fusion module for MTMC tracking. This module enhances feature representation and establishes correlations between the features of targets across multiple cameras. To accommodate high scene diversity and complex lighting condition variations, we have established the VisionTrack dataset, which enables the development of models that are more generalized and robust to various environments. Our model demonstrates significant improvements over comparison methods on the VisionTrack dataset and others.

Title:

      SE(3)-Hyena Operator for Scalable Equivariant Learning

Authors: Artem Moskalev, Mangal Prakash, Rui Liao, Tommaso Mansi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Modeling global geometric context while maintaining equivariance is crucial for accurate predictions in many fields such as biology, chemistry, or vision. Yet, this is challenging due to the computational demands of processing high-dimensional data at scale. Existing approaches such as equivariant self-attention or distance-based message passing, suffer from quadratic complexity with respect to sequence length, while localized methods sacrifice global information. Inspired by the recent success of state-space and long-convolutional models, in this work, we introduce SE(3)-Hyena operator, an equivariant long-convolutional model based on the Hyena operator. The SE(3)-Hyena captures global geometric context at sub-quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on equivariant associative recall and n-body modeling, SE(3)-Hyena matches or outperforms equivariant self-attention while requiring significantly less memory and computational resources for long sequences. Our model processes the geometric context of 20k tokens x3.5 times faster than the equivariant transformer and allows x175 longer a context within the same memory budget.

Title:

      HGNET: A Hierarchical Feature Guided Network for Occupancy Flow Field Prediction

Authors: Zhan Chen, Chen Tang, Lu Xiong
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Predicting the motion of multiple traffic participants has always been one of the most challenging tasks in autonomous driving. The recently proposed occupancy flow field prediction method has shown to be a more effective and scalable representation compared to general trajectory prediction methods. However, in complex multi-agent traffic scenarios, it remains difficult to model the interactions among various factors and the dependencies among prediction outputs at different time steps. In view of this, we propose a transformer-based hierarchical feature guided network (HGNET), which can efficiently extract features of agents and map information from visual and vectorized inputs, modeling multimodal interaction relationships. Second, we design the Feature-Guided Attention (FGAT) module to leverage the potential guiding effects between different prediction targets, thereby improving prediction accuracy. Additionally, to enhance the temporal consistency and causal relationships of the predictions, we propose a Time Series Memory framework to learn the conditional distribution models of the prediction outputs at future time steps from multivariate time series. The results demonstrate that our model exhibits competitive performance, which ranks 3rd in the 2024 Waymo Occupancy and Flow Prediction Challenge.

Title:

      Investigating the potential of Sparse Mixtures-of-Experts for multi-domain neural machine translation

Authors: Nadezhda Chirkova, Vassilina Nikoulina, Jean-Luc Meunier, Alexandre Bérard
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We focus on multi-domain Neural Machine Translation, with the goal of developing efficient models which can handle data from various domains seen during training and are robust to domains unseen during training. We hypothesize that Sparse Mixture-of-Experts (SMoE) models are a good fit for this task, as they enable efficient model scaling, which helps to accommodate a variety of multi-domain data, and allow flexible sharing of parameters between domains, potentially enabling knowledge transfer between similar domains and limiting negative transfer. We conduct a series of experiments aimed at validating the utility of SMoE for the multi-domain scenario, and find that a straightforward width scaling of Transformer is a simpler and surprisingly more efficient approach in practice, and reaches the same performance level as SMoE. We also search for a better recipe for robustness of multi-domain systems, highlighting the importance of mixing-in a generic domain, i.e. Paracrawl, and introducing a simple technique, domain randomization.

Title:

      DaBiT: Depth and Blur informed Transformer for Joint Refocusing and Super-Resolution

Authors: Crispian Morris, Nantheera Anantrasirichai, Fan Zhang, David Bull
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In many real-world scenarios, recorded videos suffer from accidental focus blur, and while video deblurring methods exist, most specifically target motion blur. This paper introduces a framework optimised for the joint task of focal deblurring (refocusing) and video super-resolution (VSR). The proposed method employs novel map guided transformers, in addition to image propagation, to effectively leverage the continuous spatial variance of focal blur and restore the footage. We also introduce a flow re-focusing module to efficiently align relevant features between the blurry and sharp domains. Additionally, we propose a novel technique for generating synthetic focal blur data, broadening the model's learning capabilities to include a wider array of content. We have made a new benchmark dataset, DAVIS-Blur, available. This dataset, a modified extension of the popular DAVIS video segmentation set, provides realistic out-of-focus blur degradations as well as the corresponding blur maps. Comprehensive experiments on DAVIS-Blur demonstrate the superiority of our approach. We achieve state-of-the-art results with an average PSNR performance over 1.9dB greater than comparable existing video restoration methods. Our source code will be made available at this https URL

Title:

      Hypformer: Exploring Efficient Hyperbolic Transformer Fully in Hyperbolic Space

Authors: Menglin Yang, Harshit Verma, Delvin Ce Zhang, Jiahong Liu, Irwin King, Rex Ying
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Hyperbolic geometry have shown significant potential in modeling complex structured data, particularly those with underlying tree-like and hierarchical structures. Despite the impressive performance of various hyperbolic neural networks across numerous domains, research on adapting the Transformer to hyperbolic space remains limited. Previous attempts have mainly focused on modifying self-attention modules in the Transformer. However, these efforts have fallen short of developing a complete hyperbolic Transformer. This stems primarily from: (i) the absence of well-defined modules in hyperbolic space, including linear transformation layers, LayerNorm layers, activation functions, dropout operations, etc. (ii) the quadratic time complexity of the existing hyperbolic self-attention module w.r.t the number of input tokens, which hinders its scalability. To address these challenges, we propose, Hypformer, a novel hyperbolic Transformer based on the Lorentz model of hyperbolic geometry. In Hypformer, we introduce two foundational blocks that define the essential modules of the Transformer in hyperbolic space. Furthermore, we develop a linear self-attention mechanism in hyperbolic space, enabling hyperbolic Transformer to process billion-scale graph data and long-sequence inputs for the first time. Our experimental results confirm the effectiveness and efficiency of Hypformer across various datasets, demonstrating its potential as an effective and scalable solution for large-scale data representation and large models.

Title:

      Multi-State-Action Tokenisation in Decision Transformers for Multi-Discrete Action Spaces

Authors: Perusha Moodley, Pramod Kaushik, Dhillu Thambi, Mark Trovinger, Praveen Paruchuri, Xia Hong, Benjamin Rosman
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Decision Transformers, in their vanilla form, struggle to perform on image-based environments with multi-discrete action spaces. Although enhanced Decision Transformer architectures have been developed to improve performance, these methods have not specifically addressed this problem of multi-discrete action spaces which hampers existing Decision Transformer architectures from learning good representations. To mitigate this, we propose Multi-State Action Tokenisation (M-SAT), an approach for tokenising actions in multi-discrete action spaces that enhances the model's performance in such environments. Our approach involves two key changes: disentangling actions to the individual action level and tokenising the actions with auxiliary state information. These two key changes also improve individual action level interpretability and visibility within the attention layers. We demonstrate the performance gains of M-SAT on challenging ViZDoom environments with multi-discrete action spaces and image-based state spaces, including the Deadly Corridor and My Way Home scenarios, where M-SAT outperforms the baseline Decision Transformer without any additional data or heavy computational overheads. Additionally, we find that removing positional encoding does not adversely affect M-SAT's performance and, in some cases, even improves it.

Title:

      Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios

Authors: Juan Ignacio Alvarez-Trejos, Beltrán Labrador, Alicia Lozano-Diez
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap. This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities, while maintaining their overlap handling strengths. To achieve this, we propose several methods for incorporating these embeddings along the acoustic features. Furthermore, we delve into an analysis of the correct handling of silence frames, the window length for extracting speaker embeddings and the transformer encoder size. The effectiveness of our proposed approach is thoroughly evaluated on the CallHome dataset for the two-speaker diarization task, with results that demonstrate a significant reduction in diarization error rates achieving a relative improvement of a 10.78% compared to the baseline end-to-end model.

Title:

      Gradient-based Class Weighting for Unsupervised Domain Adaptation in Dense Prediction Visual Tasks

Authors: Roberto Alcover-Couso, Marcos Escudero-Viñolo, Juan C. SanMiguel, Jesus Bescós
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In unsupervised domain adaptation (UDA), where models are trained on source data (e.g., synthetic) and adapted to target data (e.g., real-world) without target annotations, addressing the challenge of significant class imbalance remains an open issue. Despite considerable progress in bridging the domain gap, existing methods often experience performance degradation when confronted with highly imbalanced dense prediction visual tasks like semantic and panoptic segmentation. This discrepancy becomes especially pronounced due to the lack of equivalent priors between the source and target domains, turning class imbalanced techniques used for other areas (e.g., image classification) ineffective in UDA scenarios. This paper proposes a class-imbalance mitigation strategy that incorporates class-weights into the UDA learning losses, but with the novelty of estimating these weights dynamically through the loss gradient, defining a Gradient-based class weighting (GBW) learning. GBW naturally increases the contribution of classes whose learning is hindered by large-represented classes, and has the advantage of being able to automatically and quickly adapt to the iteration training outcomes, avoiding explicitly curricular learning patterns common in loss-weighing strategies. Extensive experimentation validates the effectiveness of GBW across architectures (convolutional and transformer), UDA strategies (adversarial, self-training and entropy minimization), tasks (semantic and panoptic segmentation), and datasets (GTA and Synthia). Analysing the source of advantage, GBW consistently increases the recall of low represented classes.

Title:

      TransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation

Authors: André Sacilotti, Samuel Felipe dos Santos, Nicu Sebe, Jurandy Almeida
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge from different backbones. To improve the transferability of ViT, we introduce a novel and effective module named Domain Transferable-guided Attention Block~(DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets with different backbones, like ResNet101, I3D, and STAM, verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. The code will be made freely available.

Title:

      HyperLoader: Integrating Hypernetwork-Based LoRA and Adapter Layers into Multi-Task Transformers for Sequence Labelling

Authors: Jesus-German Ortiz-Barajas, Helena Gomez-Adorno, Thamar Solorio
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We present HyperLoader, a simple approach that combines different parameter-efficient fine-tuning methods in a multi-task setting. To achieve this goal, our model uses a hypernetwork to generate the weights of these modules based on the task, the transformer layer, and its position within this layer. Our method combines the benefits of multi-task learning by capturing the structure of all tasks while reducing the task interference problem by encapsulating the task-specific knowledge in the generated weights and the benefits of combining different parameter-efficient methods to outperform full-fine tuning. We provide empirical evidence that HyperLoader outperforms previous approaches in most datasets and obtains the best average performance across tasks in high-resource and low-resource scenarios.

Title:

      FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

Authors: Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, Luming Liang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos, largely due to their scalability, which enables the construction of larger models for enhanced performance. However, the increased size of these models leads to higher inference costs, making them less attractive for real-time applications. We present Fast-FORward CAching (FORA), a simple yet effective approach designed to accelerate DiT by exploiting the repetitive nature of the diffusion process. FORA implements a caching mechanism that stores and reuses intermediate outputs from the attention and MLP layers across denoising steps, thereby reducing computational overhead. This approach does not require model retraining and seamlessly integrates with existing transformer-based diffusion models. Experiments show that FORA can speed up diffusion transformers several times over while only minimally affecting performance metrics such as the IS Score and FID. By enabling faster processing with minimal trade-offs in quality, FORA represents a significant advancement in deploying diffusion transformers for real-time applications. Code will be made publicly available at: this https URL.

Title:

      Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting

Authors: Scott H. Hawley
Subjects: Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Recent years have witnessed significant progress in generative models for music, featuring diverse architectures that balance output quality, diversity, speed, and user control. This study explores a user-friendly graphical interface enabling the drawing of masked regions for inpainting by an Hourglass Diffusion Transformer (HDiT) model trained on MIDI piano roll images. To enhance note generation in specified areas, masked regions can be "repainted" with extra noise. The non-latent HDiTs linear scaling with pixel count allows efficient generation in pixel space, providing intuitive and interpretable controls such as masking throughout the network and removing the need to operate in compressed latent spaces such as those provided by pretrained autoencoders. We demonstrate that, in addition to inpainting of melodies, accompaniment, and continuations, the use of repainting can help increase note density yielding musical structures closely matching user specifications such as rising, falling, or diverging melody and/or accompaniment, even when these lie outside the typical training data distribution. We achieve performance on par with prior results while operating at longer context windows, with no autoencoder, and can enable complex geometries for inpainting masks, increasing the options for machine-assisted composers to control the generated music.

Title:

      KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

Authors: Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches -- such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures -- have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this gap by providing a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks. Our work reveals numerous previously unknown phenomena and offers insights -- as well as a friendly workbench -- for the future development of long context-capable LLMs. The source code will be available at this https URL

Title:

      Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning

Authors: Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xiang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, Masayoshi Tomizuka
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). By adopting Mixture of Experts (MoE) within a transformer-based diffusion policy, SDP selectively activates experts and skills, enabling efficient and task-specific learning without retraining the entire model. SDP not only reduces the burden of active parameters but also facilitates the seamless integration and reuse of experts across various tasks. Extensive experiments on diverse tasks in both simulations and real world show that SDP 1) excels in multitask scenarios with negligible increases in active parameters, 2) prevents forgetting in continual learning of new tasks, and 3) enables efficient task transfer, offering a promising solution for advanced robotic applications. Demos and codes can be found in this https URL.

Keyword: scene understanding

Title:

      ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

Authors: Quang P.M. Pham, Khoi T.N. Nguyen, Lan C. Ngo, Truong Do, Truong Son Hy
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision.

Title:

      PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction

Authors: Xuan Yu, Yili Liu, Chenrui Han, Sitong Mao, Shunbo Zhou, Rong Xiong, Yiyi Liao, Yue Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Panoptic reconstruction is a challenging task in 3D scene understanding. However, most existing methods heavily rely on pre-trained semantic segmentation models and known 3D object bounding boxes for 3D panoptic segmentation, which is not available for in-the-wild scenes. In this paper, we propose a novel zero-shot panoptic reconstruction method from RGB-D images of scenes. For zero-shot segmentation, we leverage open-vocabulary instance segmentation, but it has to face partial labeling and instance association challenges. We tackle both challenges by propagating partial labels with the aid of dense generalized features and building a 3D instance graph for associating 2D instance IDs. Specifically, we exploit partial labels to learn a classifier for generalized semantic features to provide complete labels for scenes with dense distilled features. Moreover, we formulate instance association as a 3D instance graph segmentation problem, allowing us to fully utilize the scene geometry prior and all 2D instance masks to infer global unique pseudo 3D instance ID. Our method outperforms state-of-the-art methods on the indoor dataset ScanNet V2 and the outdoor dataset KITTI-360, demonstrating the effectiveness of our graph segmentation method and reconstruction network.

Keyword: visual reasoning

Title:

      Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial Challenges

Authors: Mohammed Elhenawy, Ahmad Abutahoun, Taqwa I.Alhadidi, Ahmed Jaber, Huthaifa I. Ashqar, Shadi Jaradat, Ahmed Abdelhay, Sebastien Glaser, Andry Rakotonirainy
Subjects: Subjects: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Multimodal Large Language Models (MLLMs) harness comprehensive knowledge spanning text, images, and audio to adeptly tackle complex problems, including zero-shot in-context learning scenarios. This study explores the ability of MLLMs in visually solving the Traveling Salesman Problem (TSP) and Multiple Traveling Salesman Problem (mTSP) using images that portray point distributions on a two-dimensional plane. We introduce a novel approach employing multiple specialized agents within the MLLM framework, each dedicated to optimizing solutions for these combinatorial challenges. Our experimental investigation includes rigorous evaluations across zero-shot settings and introduces innovative multi-agent zero-shot in-context scenarios. The results demonstrated that both multi-agent models. Multi-Agent 1, which includes the Initializer, Critic, and Scorer agents, and Multi-Agent 2, which comprises only the Initializer and Critic agents; significantly improved solution quality for TSP and mTSP problems. Multi-Agent 1 excelled in environments requiring detailed route refinement and evaluation, providing a robust framework for sophisticated optimizations. In contrast, Multi-Agent 2, focusing on iterative refinements by the Initializer and Critic, proved effective for rapid decision-making scenarios. These experiments yield promising outcomes, showcasing the robust visual reasoning capabilities of MLLMs in addressing diverse combinatorial problems. The findings underscore the potential of MLLMs as powerful tools in computational optimization, offering insights that could inspire further advancements in this promising field. Project link: this https URL

Title:

      We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Authors: Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at this https URL.

Jul 03 '24 02:07 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Tuesday, 2 July 2024 (showing 763 of 763 entries )

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Keyword: transformer

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Keyword: scene understanding

Title:

Title:

Keyword: visual reasoning

Title:

Title:

arxiv-daily
arxiv-daily copied to clipboard