arxiv-daily
arxiv-daily copied to clipboard
New submissions for Thursday, 19 September 2024 (showing 377 of 377 entries )
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
Title:
VALO: A Versatile Anytime Framework for LiDAR-based Object Detection Deep Neural Networks
- Authors: Ahmet Soyyigit, Shuochao Yao, Heechul Yun
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This work addresses the challenge of adapting dynamic deadline requirements for LiDAR object detection deep neural networks (DNNs). The computing latency of object detection is critically important to ensure safe and efficient navigation. However, state-of-the-art LiDAR object detection DNNs often exhibit significant latency, hindering their real-time performance on resource-constrained edge platforms. Therefore, a tradeoff between detection accuracy and latency should be dynamically managed at runtime to achieve optimum results. In this paper, we introduce VALO (Versatile Anytime algorithm for LiDAR Object detection), a novel data-centric approach that enables anytime computing of 3D LiDAR object detection DNNs. VALO employs a deadline-aware scheduler to selectively process input regions, making execution time and accuracy tradeoffs without architectural modifications. Additionally, it leverages efficient forecasting of past detection results to mitigate possible loss of accuracy due to partial processing of input. Finally, it utilizes a novel input reduction technique within its detection heads to significantly accelerate execution without sacrificing accuracy. We implement VALO on state-of-the-art 3D LiDAR object detection networks, namely CenterPoint and VoxelNext, and demonstrate its dynamic adaptability to a wide range of time constraints while achieving higher accuracy than the prior state-of-the-art. Code is available athttps://github.com/CSL-KU/VALO}{this http URL.
Title:
Open-Set Semantic Uncertainty Aware Metric-Semantic Graph Matching
- Authors: Kurran Singh, John J. Leonard
- Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Underwater object-level mapping requires incorporating visual foundation models to handle the uncommon and often previously unseen object classes encountered in marine scenarios. In this work, a metric of semantic uncertainty for open-set object detections produced by visual foundation models is calculated and then incorporated into an object-level uncertainty tracking framework. Object-level uncertainties and geometric relationships between objects are used to enable robust object-level loop closure detection for unknown object classes. The above loop closure detection problem is formulated as a graph-matching problem. While graph matching, in general, is NP-Complete, a solver for an equivalent formulation of the proposed graph matching problem as a graph editing problem is tested on multiple challenging underwater scenes. Results for this solver as well as three other solvers demonstrate that the proposed methods are feasible for real-time use in marine environments for the robust, open-set, multi-object, semantic-uncertainty-aware loop closure detection. Further experimental results on the KITTI dataset demonstrate that the method generalizes to large-scale terrestrial scenes.
Title:
RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework
- Authors: Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract 3D Multi-Object Tracking (MOT) obtains significant performance improvements with the rapid advancements in 3D object detection, particularly in cost-effective multi-camera setups. However, the prevalent end-to-end training approach for multi-camera trackers results in detector-specific models, limiting their versatility. Moreover, current generic trackers overlook the unique features of multi-camera detectors, i.e., the unreliability of motion observations and the feasibility of visual information. To address these challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors. Following the Tracking-By-Detection framework, RockTrack is compatible with various off-the-shelf detectors. RockTrack incorporates a confidence-guided preprocessing module to extract reliable motion and image observations from distinct representation spaces from a single detector. These observations are then fused in an association module that leverages geometric and appearance cues to minimize mismatches. The resulting matches are propagated through a staged estimation process, forming the basis for heuristic noise modeling. Additionally, we introduce a novel appearance similarity metric for explicitly characterizing object affinities in multi-camera settings. RockTrack achieves state-of-the-art performance on the nuScenes vision-only tracking leaderboard with 59.1% AMOTA while demonstrating impressive computational efficiency.
Title:
Agglomerative Token Clustering
- Authors: Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.
Title:
Applications of Knowledge Distillation in Remote Sensing: A Survey
- Authors: Yassine Himeur, Nour Aburaed, Omar Elharrouss, Iraklis Varlamis, Shadi Atalla, Wathiq Mansoor, Hussain Al Ahmad
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract With the ever-growing complexity of models in the field of remote sensing (RS), there is an increasing demand for solutions that balance model accuracy with computational efficiency. Knowledge distillation (KD) has emerged as a powerful tool to meet this need, enabling the transfer of knowledge from large, complex models to smaller, more efficient ones without significant loss in performance. This review article provides an extensive examination of KD and its innovative applications in RS. KD, a technique developed to transfer knowledge from a complex, often cumbersome model (teacher) to a more compact and efficient model (student), has seen significant evolution and application across various domains. Initially, we introduce the fundamental concepts and historical progression of KD methods. The advantages of employing KD are highlighted, particularly in terms of model compression, enhanced computational efficiency, and improved performance, which are pivotal for practical deployments in RS scenarios. The article provides a comprehensive taxonomy of KD techniques, where each category is critically analyzed to demonstrate the breadth and depth of the alternative options, and illustrates specific case studies that showcase the practical implementation of KD methods in RS tasks, such as instance segmentation and object detection. Further, the review discusses the challenges and limitations of KD in RS, including practical constraints and prospective future directions, providing a comprehensive overview for researchers and practitioners in the field of RS. Through this organization, the paper not only elucidates the current state of research in KD but also sets the stage for future research opportunities, thereby contributing significantly to both academic research and real-world applications.
Keyword: transformer
Title:
Continual Learning of Conjugated Visual Representations through Higher-order Motion Flows
- Authors: Simone Marullo, Matteo Tiezzi, Marco Gori, Stefano Melacci
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Learning with neural networks from a continuous stream of visual information presents several challenges due to the non-i.i.d. nature of the data. However, it also offers novel opportunities to develop representations that are consistent with the information flow. In this paper we investigate the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named motion-conjugated feature representations. Differently from existing approaches, motion is not a given signal (either ground-truth or estimated by external modules), but is the outcome of a progressive and autonomous learning process, occurring at various levels of the feature hierarchy. Multiple motion flows are estimated with neural networks and characterized by different levels of abstractions, spanning from traditional optical flow to other latent signals originating from higher-level features, hence called higher-order motions. Continuously learning to develop consistent multi-order flows and representations is prone to trivial solutions, which we counteract by introducing a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity. We assess our model on photorealistic synthetic streams and real-world videos, comparing to pre-trained state-of-the art feature extractors (also based on Transformers) and to recent unsupervised learning models, significantly outperforming these alternatives.
Title:
Evaluation of pretrained language models on music understanding
- Authors: Yannis Vasilakis, Rachel Bittner, Johan Pauwels
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Music-text multimodal systems have enabled new approaches to Music Information Research (MIR) applications such as audio-to-text and text-to-audio retrieval, text-based song generation, and music captioning. Despite the reported success, little effort has been put into evaluating the musical knowledge of Large Language Models (LLM). In this paper, we demonstrate that LLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g. 'rock song without guitar'), and 3) sensitivity towards the presence of specific words. We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. We leveraged the Audioset ontology to generate triplets consisting of an anchor, a positive (relevant) label, and a negative (less relevant) label for the genre and instruments sub-tree. We evaluated the triplet-based musical knowledge for six general-purpose Transformer-based models. The triplets obtained through this methodology required filtering, as some were difficult to judge and therefore relatively uninformative for evaluation purposes. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.
Title:
Mamba Fusion: Learning Actions Through Questioning
- Authors: Zhikang Dong, Apoorva Beedu, Jason Sheinkopf, Irfan Essa
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture information about actions from multiple perspectives within the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide critical information about actions, objects, and environmental context, leading to enhanced performance. As a result, MambaVL achieves state-of-the-art performance in action recognition on the Epic-Kitchens-100 dataset and outperforms baseline methods in action anticipation.
Title:
VertiEncoder: Self-Supervised Kinodynamic Representation Learning on Vertically Challenging Terrain
- Authors: Mohammad Nazeri, Aniket Datar, Anuj Pokhrel, Chenhui Pan, Garrett Warnell, Xuesu Xiao
- Subjects: Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We present VertiEncoder, a self-supervised representation learning approach for robot mobility on vertically challenging terrain. Using the same pre-training process, VertiEncoder can handle four different downstream tasks, including forward kinodynamics learning, inverse kinodynamics learning, behavior cloning, and patch reconstruction with a single representation. VertiEncoder uses a TransformerEncoder to learn the local context of its surroundings by random masking and next patch reconstruction. We show that VertiEncoder achieves better performance across all four different tasks compared to specialized End-to-End models with 77% fewer parameters. We also show VertiEncoder's comparable performance against state-of-the-art kinodynamic modeling and planning approaches in real-world robot deployment. These results underscore the efficacy of VertiEncoder in mitigating overfitting and fostering more robust generalization across diverse environmental contexts and downstream vehicle kinodynamic tasks.
Title:
Learning-accelerated A* Search for Risk-aware Path Planning
- Authors: Jun Xiang, Junfei Xie, Jun Chen
- Subjects: Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Safety is a critical concern for urban flights of autonomous Unmanned Aerial Vehicles. In populated environments, risk should be accounted for to produce an effective and safe path, known as risk-aware path planning. Risk-aware path planning can be modeled as a Constrained Shortest Path (CSP) problem, aiming to identify the shortest possible route that adheres to specified safety thresholds. CSP is NP-hard and poses significant computational challenges. Although many traditional methods can solve it accurately, all of them are very slow. Our method introduces an additional safety dimension to the traditional A* (called ASD A*), enabling A* to handle CSP. Furthermore, we develop a custom learning-based heuristic using transformer-based neural networks, which significantly reduces the computational load and improves the performance of the ASD A* algorithm. The proposed method is well-validated with both random and realistic simulation scenarios.
Title:
Bridging Domain Gap for Flight-Ready Spaceborne Vision
- Authors: Tae Ha Park, Simone D'Amico
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract This work presents Spacecraft Pose Network v3 (SPNv3), a Neural Network (NN) for monocular pose estimation of a known, non-cooperative target spacecraft. As opposed to existing literature, SPNv3 is designed and trained to be computationally efficient while providing robustness to spaceborne images that have not been observed during offline training and validation on the ground. These characteristics are essential to deploying NNs on space-grade edge devices. They are achieved through careful NN design choices, and an extensive trade-off analysis reveals features such as data augmentation, transfer learning and vision transformer architecture as a few of those that contribute to simultaneously maximizing robustness and minimizing computational overhead. Experiments demonstrate that the final SPNv3 can achieve state-of-the-art pose accuracy on hardware-in-the-loop images from a robotic testbed while having trained exclusively on computer-generated synthetic images, effectively bridging the domain gap between synthetic and real imagery. At the same time, SPNv3 runs well above the update frequency of modern satellite navigation filters when tested on a representative graphical processing unit system with flight heritage. Overall, SPNv3 is an efficient, flight-ready NN model readily applicable to a wide range of close-range rendezvous and proximity operations with target resident space objects. The code implementation of SPNv3 will be made publicly available.
Title:
Recurrent Interpolants for Probabilistic Time Series Prediction
- Authors: Yu Chen, Marin Biloš, Sarthak Mittal, Wei Deng, Kashif Rasul, Anderson Schneider
- Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Sequential models such as recurrent neural networks or transformer-based models became \textit{de facto} tools for multivariate time series forecasting in a probabilistic fashion, with applications to a wide range of datasets, such as finance, biology, medicine, etc. Despite their adeptness in capturing dependencies, assessing prediction uncertainty, and efficiency in training, challenges emerge in modeling high-dimensional complex distributions and cross-feature dependencies. To tackle these issues, recent works delve into generative modeling by employing diffusion or flow-based models. Notably, the integration of stochastic differential equations or probability flow successfully extends these methods to probabilistic time series imputation and forecasting. However, scalability issues necessitate a computational-friendly framework for large-scale generative model-based predictions. This work proposes a novel approach by blending the computational efficiency of recurrent neural networks with the high-quality probabilistic modeling of the diffusion model, which addresses challenges and advances generative models' application in time series forecasting. Our method relies on the foundation of stochastic interpolants and the extension to a broader conditional generation framework with additional control features, offering insights for future developments in this dynamic field.
Title:
RoboMorph: In-Context Meta-Learning for Robot Dynamics Modeling
- Authors: Manuel Bianchi Bazzi, Asad Ali Shahid, Christopher Agia, John Alora, Marco Forgione, Dario Piga, Francesco Braghin, Marco Pavone, Loris Roveda
- Subjects: Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract The landscape of Deep Learning has experienced a major shift with the pervasive adoption of Transformer-based architectures, particularly in Natural Language Processing (NLP). Novel avenues for physical applications, such as solving Partial Differential Equations and Image Vision, have been explored. However, in challenging domains like robotics, where high non-linearity poses significant challenges, Transformer-based applications are scarce. While Transformers have been used to provide robots with knowledge about high-level tasks, few efforts have been made to perform system identification. This paper proposes a novel methodology to learn a meta-dynamical model of a high-dimensional physical system, such as the Franka robotic arm, using a Transformer-based architecture without prior knowledge of the system's physical parameters. The objective is to predict quantities of interest (end-effector pose and joint positions) given the torque signals for each joint. This prediction can be useful as a component for Deep Model Predictive Control frameworks in robotics. The meta-model establishes the correlation between torques and positions and predicts the output for the complete trajectory. This work provides empirical evidence of the efficacy of the in-context learning paradigm, suggesting future improvements in learning the dynamics of robotic systems without explicit knowledge of physical parameters. Code, videos, and supplementary materials can be found at project website. See this https URL
Title:
DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech
- Authors: Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li
- Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
Title:
DocMamba: Efficient Document Pre-training with State Space Model
- Authors: Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu Zhang
- Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba's potential for length extrapolation. The code will be available online.
Title:
Fusion in Context: A Multimodal Approach to Affective State Recognition
- Authors: Youssef Mohamed, Severin Lemaignan, Arzu Guneysu, Patric Jensfelt, Christian Smith
- Subjects: Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Accurate recognition of human emotions is a crucial challenge in affective computing and human-robot interaction (HRI). Emotional states play a vital role in shaping behaviors, decisions, and social interactions. However, emotional expressions can be influenced by contextual factors, leading to misinterpretations if context is not considered. Multimodal fusion, combining modalities like facial expressions, speech, and physiological signals, has shown promise in improving affect recognition. This paper proposes a transformer-based multimodal fusion approach that leverages facial thermal data, facial action units, and textual context information for context-aware emotion recognition. We explore modality-specific encoders to learn tailored representations, which are then fused using additive fusion and processed by a shared transformer encoder to capture temporal dependencies and interactions. The proposed method is evaluated on a dataset collected from participants engaged in a tangible tabletop Pacman game designed to induce various affective states. Our results demonstrate the effectiveness of incorporating contextual information and multimodal fusion for affective state recognition.
Title:
Haptic-ACT: Bridging Human Intuition with Compliant Robotic Manipulation via Immersive VR
- Authors: Kelin Li, Shubham M Wagh, Nitish Sharma, Saksham Bhadani, Wei Chen, Chang Liu, Petar Kormushev
- Subjects: Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Robotic manipulation is essential for the widespread adoption of robots in industrial and home settings and has long been a focus within the robotics community. Advances in artificial intelligence have introduced promising learning-based methods to address this challenge, with imitation learning emerging as particularly effective. However, efficiently acquiring high-quality demonstrations remains a challenge. In this work, we introduce an immersive VR-based teleoperation setup designed to collect demonstrations from a remote human user. We also propose an imitation learning framework called Haptic Action Chunking with Transformers (Haptic-ACT). To evaluate the platform, we conducted a pick-and-place task and collected 50 demonstration episodes. Results indicate that the immersive VR platform significantly reduces demonstrator fingertip forces compared to systems without haptic feedback, enabling more delicate manipulation. Additionally, evaluations of the Haptic-ACT framework in both the MuJoCo simulator and on a real robot demonstrate its effectiveness in teaching robots more compliant manipulation compared to the original ACT. Additional materials are available at this https URL.
Title:
Reinforcement Learning as an Improvement Heuristic for Real-World Production Scheduling
- Authors: Arthur Müller, Lukas Vollenkemper
- Subjects: Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract The integration of Reinforcement Learning (RL) with heuristic methods is an emerging trend for solving optimization problems, which leverages RL's ability to learn from the data generated during the search process. One promising approach is to train an RL agent as an improvement heuristic, starting with a suboptimal solution that is iteratively improved by applying small changes. We apply this approach to a real-world multiobjective production scheduling problem. Our approach utilizes a network architecture that includes Transformer encoding to learn the relationships between jobs. Afterwards, a probability matrix is generated from which pairs of jobs are sampled and then swapped to improve the solution. We benchmarked our approach against other heuristics using real data from our industry partner, demonstrating its superior performance.
Title:
Tracking Any Point with Frame-Event Fusion Network at High Frame Rate
- Authors: Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Tracking any point based on image frames is constrained by frame rates, leading to instability in high-speed scenarios and limited generalization in real-world applications. To overcome these limitations, we propose an image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to model the image generation process guided by events. This module can effectively integrate valuable information from both modalities operating at different frequencies. To achieve smoother point trajectories, we employed a transformer-based refinement strategy that updates the point's trajectories and features iteratively. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, particularly improving expected feature age by 24$%$ on EDS datasets. Finally, we qualitatively validated the robustness of our algorithm in real driving scenarios using our custom-designed high-resolution image-event synchronization device. Our source code will be released at this https URL.
Title:
Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction
- Authors: Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan
- Subjects: Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract In this technical report, we describe the SNTL-NTU team's submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model by reducing the number of base channels. We introduce data augmentation in the form of mixup to increase the diversity of training samples. For the larger training splits, we use FocusNet to provide confusing class information to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models and baseline models trained on the original sampling rate of 44.1 kHz. We use Knowledge Distillation to distill the ensemble model to the baseline student model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile development dataset yielded the highest average testing accuracy of (62.21, 59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over the three systems.
Title:
Towards Global Localization using Multi-Modal Object-Instance Re-Identification
- Authors: Aneesh Chavan, Vaibhav Agrawal, Vineeth Bhat, Sarthak Chittawar, Siddharth Srivastava, Chetan Arora, K Madhava Krishna
- Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available.
Title:
Panoptic-Depth Forecasting
- Authors: Juana Valeria Hurtado, Riya Mohan, Abhinav Valada
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of forecasts in a coherent manner. Furthermore, we present two baselines and propose the novel PDcast architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of PDcast across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at this https URL.
Title:
On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery
- Authors: BW Sheffield, Jeffrey Ellen, Ben Whitmore
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor due to the complex and varied underwater environments. Historically, experts have manually interpreted SSS images, relying on conventional machine learning techniques with hand-crafted features. While Convolutional Neural Networks (CNNs) significantly advanced automated classification in this domain, they often fall short when dealing with diverse seafloor textures, such as rocky or ripple sand bottoms, where false positive rates may increase. Recently, Vision Transformers (ViTs) have shown potential in addressing these limitations by utilizing a self-attention mechanism to capture global information in image patches, offering more flexibility in processing spatial hierarchies. This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures, such as ResNet and ConvNext, for binary classification tasks in SSS imagery. The dataset encompasses diverse geographical seafloor types and is balanced between the presence and absence of man-made objects. ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics, although at the cost of greater computational resources. CNNs, with their inductive biases, demonstrate better computational efficiency, making them suitable for deployment in resource-constrained environments like underwater vehicles. Future research directions include exploring self-supervised learning for ViTs and multi-modal fusion to further enhance performance in challenging underwater environments.
Title:
PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba
- Authors: Chaoqi Luo, Yiping Xie, Zitong Yu
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Facial-video based Remote photoplethysmography (rPPG) aims at measuring physiological signals and monitoring heart activity without any contact, showing significant potential in various applications. Previous deep learning based rPPG measurement are primarily based on CNNs and Transformers. However, the limited receptive fields of CNNs restrict their ability to capture long-range spatio-temporal dependencies, while Transformers also struggle with modeling long video sequences with high complexity. Recently, the state space models (SSMs) represented by Mamba are known for their impressive performance on capturing long-range dependencies from long sequences. In this paper, we propose the PhysMamba, a Mamba-based framework, to efficiently represent long-range physiological dependencies from facial videos. Specifically, we introduce the Temporal Difference Mamba block to first enhance local dynamic differences and further model the long-range spatio-temporal context. Moreover, a dual-stream SlowFast architecture is utilized to fuse the multi-scale temporal features. Extensive experiments are conducted on three benchmark datasets to demonstrate the superiority and efficiency of PhysMamba. The codes are available at this https URL
Title:
Skill matching at scale: freelancer-project alignment for efficient multilingual candidate retrieval
- Authors: Warren Jouanneau, Marc Palyart, Emma Jouffroy
- Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Finding the perfect match between a job proposal and a set of freelancers is not an easy task to perform at scale, especially in multiple languages. In this paper, we propose a novel neural retriever architecture that tackles this problem in a multilingual setting. Our method encodes project descriptions and freelancer profiles by leveraging pre-trained multilingual language models. The latter are used as backbone for a custom transformer architecture that aims to keep the structure of the profiles and project. This model is trained with a contrastive loss on historical data. Thanks to several experiments, we show that this approach effectively captures skill matching similarity and facilitates efficient matching, outperforming traditional methods.
Title:
SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba
- Authors: Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Chengfeng Zhou, Zhengjie Zhang, Xiaobo Li, Dahong Qian
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure initially designed for the treatment of early gastric cancer but is now widely used for various gastrointestinal lesions. Computer-assisted Surgery systems have played a crucial role in improving the precision and safety of ESD procedures, however, their effectiveness is limited by the accurate recognition of surgical phases. The intricate nature of ESD, with different lesion characteristics and tissue structures, presents challenges for real-time surgical phase recognition algorithms. Existing surgical phase recognition algorithms struggle to efficiently capture temporal contexts in video-based scenarios, leading to insufficient performance. To address these issues, we propose SPRMamba, a novel Mamba-based framework for ESD surgical phase recognition. SPRMamba leverages the strengths of Mamba for long-term temporal modeling while introducing the Scaled Residual TranMamba block to enhance the capture of fine-grained details, overcoming the limitations of traditional temporal models like Temporal Convolutional Networks and Transformers. Moreover, a Temporal Sample Strategy is introduced to accelerate the processing, which is essential for real-time phase recognition in clinical settings. Extensive testing on the ESD385 dataset and the cholecystectomy Cholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-art methods and exhibits greater robustness across various surgical phase recognition tasks.
Title:
JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation
- Authors: Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements. Through quantitative and qualitative evaluation, we demonstrate that our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer along with lip synchronization to unseen audio.
Title:
LogoRA: Local-Global Representation Alignment for Robust Time Series Classification
- Authors: Huanyu Zhang, Yi-Fan Zhang, Zhang Zhang, Qingsong Wen, Liang Wang
- Subjects: Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Unsupervised domain adaptation (UDA) of time series aims to teach models to identify consistent patterns across various temporal scenarios, disregarding domain-specific differences, which can maintain their predictive accuracy and effectively adapt to new domains. However, existing UDA methods struggle to adequately extract and align both global and local features in time series data. To address this issue, we propose the Local-Global Representation Alignment framework (LogoRA), which employs a two-branch encoder, comprising a multi-scale convolutional branch and a patching transformer branch. The encoder enables the extraction of both local and global representations from time series. A fusion module is then introduced to integrate these representations, enhancing domain-invariant feature alignment from multi-scale perspectives. To achieve effective alignment, LogoRA employs strategies like invariant feature learning on the source domain, utilizing triplet loss for fine alignment and dynamic time warping-based feature alignment. Additionally, it reduces source-target domain gaps through adversarial training and per-class prototype alignment. Our evaluations on four time-series datasets demonstrate that LogoRA outperforms strong baselines by up to $12.52%$, showcasing its superiority in time series UDA tasks.
Title:
Expanding Expressivity in Transformer Models with M\"obiusAttention
- Authors: Anna-Maria Halacheva, Mojtaba Nayyeri, Steffen Staab
- Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Attention mechanisms and Transformer architectures have revolutionized Natural Language Processing (NLP) by enabling exceptional modeling of long-range dependencies and capturing intricate linguistic patterns. However, their inherent reliance on linear operations in the form of matrix multiplications limits their ability to fully capture inter-token relationships on their own. We propose MöbiusAttention, a novel approach that integrates Möbius transformations within the attention mechanism of Transformer-based models. Möbius transformations are non-linear operations in spaces over complex numbers with the ability to map between various geometries. By incorporating these properties, MöbiusAttention empowers models to learn more intricate geometric relationships between tokens and capture a wider range of information through complex-valued weight vectors. We build and pre-train a BERT and a RoFormer version enhanced with MöbiusAttention, which we then finetune on the GLUE benchmark. We evaluate empirically our approach against the baseline BERT and RoFormer models on a range of downstream tasks. Our approach compares favorably against the baseline models, even with smaller number of parameters suggesting the enhanced expressivity of MöbiusAttention. This research paves the way for exploring the potential of Möbius transformations in the complex projective space to enhance the expressivity and performance of foundation models.
Title:
LifeGPT: Topology-Agnostic Generative Pretrained Transformer Model for Cellular Automata
- Authors: Jaime A. Berkovich, Markus J. Buehler
- Subjects: Subjects: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Statistical Mechanics (cond-mat.stat-mech); Dynamical Systems (math.DS)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract The Game of Life (Life), a well known algorithm within the broader class of cellular automata (CA), exhibits complex emergent dynamics, with extreme sensitivity to initial conditions. Modeling and predicting such intricate behavior without explicit knowledge of the system's underlying topology presents a significant challenge, motivating the development of algorithms that can generalize across various grid configurations and boundary conditions. We develop a decoder-only generative pretrained transformer model to solve this problem, showing that our model can simulate Life on a toroidal grid with no prior knowledge on the size of the grid, or its periodic boundary conditions (LifeGPT). LifeGPT is topology-agnostic with respect to its training data and our results show that a GPT model is capable of capturing the deterministic rules of a Turing-complete system with near-perfect accuracy, given sufficiently diverse training data. We also introduce the idea of an `autoregressive autoregressor' to recursively implement Life using LifeGPT. Our results pave the path towards true universal computation within a large language model (LLM) framework, synthesizing of mathematical analysis with natural language processing, and probing AI systems for situational awareness about the evolution of such algorithms without ever having to compute them. Similar GPTs could potentially solve inverse problems in multicellular self-assembly by extracting CA-compatible rulesets from real-world biological systems to create new predictive models, which would have significant consequences for the fields of bioinspired materials, tissue engineering, and architected materials design.
Title:
Massively Multi-Person 3D Human Motion Forecasting with Scene Context
- Authors: Felix B Mueller, Julian Tanke, Juergen Gall
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at this https URL.
Title:
DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control
- Authors: Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, Lerrel Pinto
- Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at this https URL
Keyword: scene understanding
Title:
DAF-Net: A Dual-Branch Feature Decomposition Fusion Network with Domain Adaptive for Infrared and Visible Image Fusion
- Authors: Jian Xu, Xin He
- Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Infrared and visible image fusion aims to combine complementary information from both modalities to provide a more comprehensive scene understanding. However, due to the significant differences between the two modalities, preserving key features during the fusion process remains a challenge. To address this issue, we propose a dual-branch feature decomposition fusion network (DAF-Net) with domain adaptive, which introduces Multi-Kernel Maximum Mean Discrepancy (MK-MMD) into the base encoder and designs a hybrid kernel function suitable for infrared and visible image fusion. The base encoder built on the Restormer network captures global structural information while the detail encoder based on Invertible Neural Networks (INN) focuses on extracting detail texture information. By incorporating MK-MMD, the DAF-Net effectively aligns the latent feature spaces of visible and infrared images, thereby improving the quality of the fused images. Experimental results demonstrate that the proposed method outperforms existing techniques across multiple datasets, significantly enhancing both visual quality and fusion performance. The related Python code is available at this https URL.
Title:
SpotLight: Robotic Scene Understanding through Interaction and Affordance Detection
- Authors: Tim Engelbracht, René Zurbrügg, Marc Pollefeys, Hermann Blum, Zuria Bauer
- Subjects: Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Despite increasing research efforts on household robotics, robots intended for deployment in domestic settings still struggle with more complex tasks such as interacting with functional elements like drawers or light switches, largely due to limited task-specific understanding and interaction capabilities. These tasks require not only detection and pose estimation but also an understanding of the affordances these elements provide. To address these challenges and enhance robotic scene understanding, we introduce SpotLight: A comprehensive framework for robotic interaction with functional elements, specifically light switches. Furthermore, this framework enables robots to improve their environmental understanding through interaction. Leveraging VLM-based affordance prediction to estimate motion primitives for light switch interaction, we achieve up to 84% operation success in real world experiments. We further introduce a specialized dataset containing 715 images as well as a custom detection model for light switch detection. We demonstrate how the framework can facilitate robot learning through physical interaction by having the robot explore the environment and discover previously unknown relationships in a scene graph representation. Lastly, we propose an extension to the framework to accommodate other functional interactions such as swing doors, showcasing its flexibility. Videos and Code: this http URL
Title:
Towards Global Localization using Multi-Modal Object-Instance Re-Identification
- Authors: Aneesh Chavan, Vaibhav Agrawal, Vineeth Bhat, Sarthak Chittawar, Siddharth Srivastava, Chetan Arora, K Madhava Krishna
- Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/
- Pdf link: https://arxiv.org/pdf/
- Abstract Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available.
Keyword: visual reasoning
There is no result