arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

Showing new listings for Wednesday, 11 December 2024

Open DongZhouGu opened this issue 2 months ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Title:

      LUIEO: A Lightweight Model for Integrating Underwater Image Enhancement and Object Detection
  • Authors: Bin Li, Li Li, Zhenwei Zhang, Yuping Duan
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Underwater optical images inevitably suffer from various degradation factors such as blurring, low contrast, and color distortion, which hinder the accuracy of object detection tasks. Due to the lack of paired underwater/clean images, most research methods adopt a strategy of first enhancing and then detecting, resulting in a lack of feature communication between the two learning tasks. On the other hand, due to the contradiction between the diverse degradation factors of underwater images and the limited number of samples, existing underwater enhancement methods are difficult to effectively enhance degraded images of unknown water bodies, thereby limiting the improvement of object detection accuracy. Therefore, most underwater target detection results are still displayed on degraded images, making it difficult to visually judge the correctness of the detection results. To address the above issues, this paper proposes a multi-task learning method that simultaneously enhances underwater images and improves detection accuracy. Compared with single-task learning, the integrated model allows for the dynamic adjustment of information communication and sharing between different tasks. Due to the fact that real underwater images can only provide annotated object labels, this paper introduces physical constraints to ensure that object detection tasks do not interfere with image enhancement tasks. Therefore, this article introduces a physical module to decompose underwater images into clean images, background light, and transmission images and uses a physical model to calculate underwater images for self-supervision. Numerical experiments demonstrate that the proposed model achieves satisfactory results in visual performance, object detection accuracy, and detection efficiency compared to state-of-the-art comparative methods.

Title:

      Benchmarking Vision-Based Object Tracking for USVs in Complex Maritime Environments
  • Authors: Muhayy Ud Din, Ahsan B. Bakht, Waseem Akram, Yihao Dong, Lakmal Seneviratne, Irfan Hussain
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Vision-based target tracking is crucial for unmanned surface vehicles (USVs) to perform tasks such as inspection, monitoring, and surveillance. However, real-time tracking in complex maritime environments is challenging due to dynamic camera movement, low visibility, and scale variation. Typically, object detection methods combined with filtering techniques are commonly used for tracking, but they often lack robustness, particularly in the presence of camera motion and missed detections. Although advanced tracking methods have been proposed recently, their application in maritime scenarios is limited. To address this gap, this study proposes a vision-guided object-tracking framework for USVs, integrating state-of-the-art tracking algorithms with low-level control systems to enable precise tracking in dynamic maritime environments. We benchmarked the performance of seven distinct trackers, developed using advanced deep learning techniques such as Siamese Networks and Transformers, by evaluating them on both simulated and real-world maritime datasets. In addition, we evaluated the robustness of various control algorithms in conjunction with these tracking systems. The proposed framework was validated through simulations and real-world sea experiments, demonstrating its effectiveness in handling dynamic maritime conditions. The results show that SeqTrack, a Transformer-based tracker, performed best in adverse conditions, such as dust storms. Among the control algorithms evaluated, the linear quadratic regulator controller (LQR) demonstrated the most robust and smooth control, allowing for stable tracking of the USV.

Title:

      DSFEC: Efficient and Deployable Deep Radar Object Detection
  • Authors: Gayathri Dandugula, Santhosh Boddana, Sudesh Mirashi
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Deploying radar object detection models on resource-constrained edge devices like the Raspberry Pi poses significant challenges due to the large size of the model and the limited computational power and the memory of the Pi. In this work, we explore the efficiency of Depthwise Separable Convolutions in radar object detection networks and integrate them into our model. Additionally, we introduce a novel Feature Enhancement and Compression (FEC) module to the PointPillars feature encoder to further improve the model performance. With these innovations, we propose the DSFEC-L model and its two versions, which outperform the baseline (23.9 mAP of Car class, 20.72 GFLOPs) on nuScenes dataset: 1). An efficient DSFEC-M model with a 14.6% performance improvement and a 60% reduction in GFLOPs. 2). A deployable DSFEC-S model with a 3.76% performance improvement and a remarkable 78.5% reduction in GFLOPs. Despite marginal performance gains, our deployable model achieves an impressive 74.5% reduction in runtime on the Raspberry Pi compared to the baseline.

Title:

      Enhancing 3D Object Detection in Autonomous Vehicles Based on Synthetic Virtual Environment Analysis
  • Authors: Vladislav Li, Ilias Siniosoglou, Thomai Karamitsou, Anastasios Lytos, Ioannis D. Moscholios, Sotirios K. Goudos, Jyoti S. Banerjee, Panagiotis Sarigiannidi, Vasileios Argyriou
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Autonomous Vehicles (AVs) use natural images and videos as input to understand the real world by overlaying and inferring digital elements, facilitating proactive detection in an effort to assure safety. A crucial aspect of this process is real-time, accurate object recognition through automatic scene analysis. While traditional methods primarily concentrate on 2D object detection, exploring 3D object detection, which involves projecting 3D bounding boxes into the three-dimensional environment, holds significance and can be notably enhanced using the AR ecosystem. This study examines an AI model's ability to deduce 3D bounding boxes in the context of real-time scene analysis while producing and evaluating the model's performance and processing time, in the virtual domain, which is then applied to AVs. This work also employs a synthetic dataset that includes artificially generated images mimicking various environmental, lighting, and spatiotemporal states. This evaluation is oriented in handling images featuring objects in diverse weather conditions, captured with varying camera settings. These variations pose more challenging detection and recognition scenarios, which the outcomes of this work can help achieve competitive results under most of the tested conditions.

Title:

      Making the Flow Glow -- Robot Perception under Severe Lighting Conditions using Normalizing Flow Gradients
  • Authors: Simon Kristoffersson Lind, Rudolph Triebel, Volker Krüger
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Modern robotic perception is highly dependent on neural networks. It is well known that neural network-based perception can be unreliable in real-world deployment, especially in difficult imaging conditions. Out-of-distribution detection is commonly proposed as a solution for ensuring reliability in real-world deployment. Previous work has shown that normalizing flow models can be used for out-of-distribution detection to improve reliability of robotic perception tasks. Specifically, camera parameters can be optimized with respect to the likelihood output from a normalizing flow, which allows a perception system to adapt to difficult vision scenarios. With this work we propose to use the absolute gradient values from a normalizing flow, which allows the perception system to optimize local regions rather than the whole image. By setting up a table top picking experiment with exceptionally difficult lighting conditions, we show that our method achieves a 60% higher success rate for an object detection task compared to previous methods.

Title:

      Multimodal Contextualized Support for Enhancing Video Retrieval System
  • Authors: Quoc-Bao Nguyen-Le, Thanh-Huy Le-Nguyen
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Current video retrieval systems, especially those used in competitions, primarily focus on querying individual keyframes or images rather than encoding an entire clip or video segment. However, queries often describe an action or event over a series of frames, not a specific image. This results in insufficient information when analyzing a single frame, leading to less accurate query results. Moreover, extracting embeddings solely from images (keyframes) does not provide enough information for models to encode higher-level, more abstract insights inferred from the video. These models tend to only describe the objects present in the frame, lacking a deeper understanding. In this work, we propose a system that integrates the latest methodologies, introducing a novel pipeline that extracts multimodal data, and incorporate information from multiple frames within a video, enabling the model to abstract higher-level information that captures latent meanings, focusing on what can be inferred from the video clip, rather than just focusing on object detection in one single image.

Keyword: transformer

Title:

      Feature Group Tabular Transformer: A Novel Approach to Traffic Crash Modeling and Causality Analysis
  • Authors: Oscar Lares, Hao Zhen, Jidong J. Yang
  • Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Reliable and interpretable traffic crash modeling is essential for understanding causality and improving road safety. This study introduces a novel approach to predicting collision types by utilizing a comprehensive dataset fused from multiple sources, including weather data, crash reports, high-resolution traffic information, pavement geometry, and facility characteristics. Central to our approach is the development of a Feature Group Tabular Transformer (FGTT) model, which organizes disparate data into meaningful feature groups, represented as tokens. These group-based tokens serve as rich semantic components, enabling effective identification of collision patterns and interpretation of causal mechanisms. The FGTT model is benchmarked against widely used tree ensemble models, including Random Forest, XGBoost, and CatBoost, demonstrating superior predictive performance. Furthermore, model interpretation reveals key influential factors, providing fresh insights into the underlying causality of distinct crash types.

Title:

      Innovative Sentiment Analysis and Prediction of Stock Price Using FinBERT, GPT-4 and Logistic Regression: A Data-Driven Approach
  • Authors: Olamilekan Shobayo, Sidikat Adeyemi-Longe, Olusogo Popoola, Bayode Ogunleye
  • Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST); Applications (stat.AP); Computation (stat.CO)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract This study explores the comparative performance of cutting-edge AI models, i.e., Finaance Bidirectional Encoder representations from Transsformers (FinBERT), Generatice Pre-trained Transformer GPT-4, and Logistic Regression, for sentiment analysis and stock index prediction using financial news and the NGX All-Share Index data label. By leveraging advanced natural language processing models like GPT-4 and FinBERT, alongside a traditional machine learning model, Logistic Regression, we aim to classify market sentiment, generate sentiment scores, and predict market price movements. This research highlights global AI advancements in stock markets, showcasing how state-of-the-art language models can contribute to understanding complex financial data. The models were assessed using metrics such as accuracy, precision, recall, F1 score, and ROC AUC. Results indicate that Logistic Regression outperformed the more computationally intensive FinBERT and predefined approach of versatile GPT-4, with an accuracy of 81.83% and a ROC AUC of 89.76%. The GPT-4 predefined approach exhibited a lower accuracy of 54.19% but demonstrated strong potential in handling complex data. FinBERT, while offering more sophisticated analysis, was resource-demanding and yielded a moderate performance. Hyperparameter optimization using Optuna and cross-validation techniques ensured the robustness of the models. This study highlights the strengths and limitations of the practical applications of AI approaches in stock market prediction and presents Logistic Regression as the most efficient model for this task, with FinBERT and GPT-4 representing emerging tools with potential for future exploration and innovation in AI-driven financial analytics

Title:

      GL-Fusion: Rethinking the Combination of Graph Neural Network and Large Language model
  • Authors: Haotong Yang, Xiyuan Wang, Qian Tao, Shuxian Hu, Zhouchen Lin, Muhan Zhang
  • Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Recent research on integrating Large Language Models (LLMs) with Graph Neural Networks (GNNs) typically follows two approaches: LLM-centered models, which convert graph data into tokens for LLM processing, and GNN-centered models, which use LLMs to encode text features into node and edge representations for GNN input. LLM-centered models often struggle to capture graph structures effectively, while GNN-centered models compress variable-length textual data into fixed-size vectors, limiting their ability to understand complex semantics. Additionally, GNN-centered approaches require converting tasks into a uniform, manually-designed format, restricting them to classification tasks and preventing language output. To address these limitations, we introduce a new architecture that deeply integrates GNN with LLM, featuring three key innovations: (1) Structure-Aware Transformers, which incorporate GNN's message-passing capabilities directly into LLM's transformer layers, allowing simultaneous processing of textual and structural information and generating outputs from both GNN and LLM; (2) Graph-Text Cross-Attention, which processes full, uncompressed text from graph nodes and edges, ensuring complete semantic integration; and (3) GNN-LLM Twin Predictor, enabling LLM's flexible autoregressive generation alongside GNN's scalable one-pass prediction. GL-Fusion achieves outstand performance on various tasks. Notably, it achieves state-of-the-art performance on OGBN-Arxiv and OGBG-Code2.

Title:

      Efficient user history modeling with amortized inference for deep learning recommendation models
  • Authors: Lars Hertel, Neil Daftary, Fedor Borisyuk, Aman Gupta, Rahul Mazumder
  • Subjects: Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract We study user history modeling via Transformer encoders in deep learning recommendation models (DLRM). Such architectures can significantly improve recommendation quality, but usually incur high latency cost necessitating infrastructure upgrades or very small Transformer models. An important part of user history modeling is early fusion of the candidate item and various methods have been studied. We revisit early fusion and compare concatenation of the candidate to each history item against appending it to the end of the list as a separate item. Using the latter method, allows us to reformulate the recently proposed amortized history inference algorithm M-FALCON \cite{zhai2024actions} for the case of DLRM models. We show via experimental results that appending with cross-attention performs on par with concatenation and that amortization significantly reduces inference costs. We conclude with results from deploying this model on the LinkedIn Feed and Ads surfaces, where amortization reduces latency by 30% compared to non-amortized inference.

Title:

      SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception
  • Authors: Yaniv Benny, Lior Wolf
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract This paper proposes a novel method for omnidirectional 360$\degree$ perception. Most common previous methods relied on equirectangular projection. This representation is easily applicable to 2D operation layers but introduces distortions into the image. Other methods attempted to remove the distortions by maintaining a sphere representation but relied on complicated convolution kernels that failed to show competitive results. In this work, we introduce a transformer-based architecture that, by incorporating a novel ``Spherical Local Self-Attention'' and other spherically-oriented modules, successfully operates in the spherical domain and outperforms the state-of-the-art in 360$\degree$ perception benchmarks for depth estimation and semantic segmentation.

Title:

      Sequential Compression Layers for Efficient Federated Learning in Foundational Models
  • Authors: Navyansh Mahla, Sunny Gupta, Amit Sethi
  • Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Federated Learning (FL) has gained popularity for fine-tuning large language models (LLMs) across multiple nodes, each with its own private data. While LoRA has been widely adopted for parameter efficient federated fine-tuning, recent theoretical and empirical studies highlight its suboptimal performance in the federated learning context. In response, we propose a novel, simple, and more effective parameter-efficient fine-tuning method that does not rely on LoRA. Our approach introduces a small multi-layer perceptron (MLP) layer between two existing MLP layers the up proj (the FFN projection layer following the self-attention module) and down proj within the feed forward network of the transformer block. This solution addresses the bottlenecks associated with LoRA in federated fine tuning and outperforms recent LoRA-based approaches, demonstrating superior performance for both language models and vision encoders.

Title:

      Static Key Attention in Vision
  • Authors: Zizhao Hu, Xiaolin Zhou, Mohammad Rostami
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract The success of vision transformers is widely attributed to the expressive power of their dynamically parameterized multi-head self-attention mechanism. We examine the impact of substituting the dynamic parameterized key with a static key within the standard attention mechanism in Vision Transformers. Our findings reveal that static key attention mechanisms can match or even exceed the performance of standard self-attention. Integrating static key attention modules into a Metaformer backbone, we find that it serves as a better intermediate stage in hierarchical hybrid architectures, balancing the strengths of depth-wise convolution and self-attention. Experiments on several vision tasks underscore the effectiveness of the static key mechanism, indicating that the typical two-step dynamic parameterization in attention can be streamlined to a single step without impacting performance under certain circumstances.

Title:

      Advancing clinical trial outcomes using deep learning and predictive modelling: bridging precision medicine and patient-centered care
  • Authors: Sydney Anuyah, Mallika K Singh, Hope Nyavor
  • Subjects: Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract The integration of artificial intelligence [AI] into clinical trials has revolutionized the process of drug development and personalized medicine. Among these advancements, deep learning and predictive modelling have emerged as transformative tools for optimizing clinical trial design, patient recruitment, and real-time monitoring. This study explores the application of deep learning techniques, such as convolutional neural networks [CNNs] and transformerbased models, to stratify patients, forecast adverse events, and personalize treatment plans. Furthermore, predictive modelling approaches, including survival analysis and time-series forecasting, are employed to predict trial outcomes, enhancing efficiency and reducing trial failure rates. To address challenges in analysing unstructured clinical data, such as patient notes and trial protocols, natural language processing [NLP] techniques are utilized for extracting actionable insights. A custom dataset comprising structured patient demographics, genomic data, and unstructured text is curated for training and validating these models. Key metrics, including precision, recall, and F1 scores, are used to evaluate model performance, while trade-offs between accuracy and computational efficiency are examined to identify the optimal model for clinical deployment. This research underscores the potential of AI-driven methods to streamline clinical trial workflows, improve patient-centric outcomes, and reduce costs associated with trial inefficiencies. The findings provide a robust framework for integrating predictive analytics into precision medicine, paving the way for more adaptive and efficient clinical trials. By bridging the gap between technological innovation and real-world applications, this study contributes to advancing the role of AI in healthcare, particularly in fostering personalized care and improving overall trial success rates.

Title:

      Enhancing radioisotope identification in gamma spectra with transfer learning
  • Authors: Peter Lalor
  • Subjects: Subjects: Machine Learning (cs.LG); Nuclear Theory (nucl-th)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Machine learning methods in gamma spectroscopy have the potential to provide accurate, real-time classification of unknown radioactive samples. However, obtaining sufficient experimental training data is often prohibitively expensive and time-consuming, and models trained solely on synthetic data can struggle to generalize to the unpredictable range of real-world operating scenarios. In this work, we pretrain a model using physically derived synthetic data and subsequently leverage transfer learning techniques to fine-tune the model for a specific target domain. This paradigm enables us to embed physical principles during the pretraining step, thus requiring less data from the target domain compared to classical machine learning methods. Results of this analysis indicate that fine-tuned models significantly outperform those trained exclusively on synthetic data or solely on target-domain data, particularly in the intermediate data regime (${\approx} 10^4$ training samples). This conclusion is consistent across four different machine learning architectures (MLP, CNN, Transformer, and LSTM) considered in this study. This research serves as proof of concept for applying transfer learning techniques to application scenarios where access to experimental data is limited.

Title:

      Improving the Natural Language Inference robustness to hard dataset by data augmentation and preprocessing
  • Authors: Zijiang Yang
  • Subjects: Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Natural Language Inference (NLI) is the task of inferring whether the hypothesis can be justified by the given premise. Basically, we classify the hypothesis into three labels(entailment, neutrality and contradiction) given the premise. NLI was well studied by the previous researchers. A number of models, especially the transformer based ones, have achieved significant improvement on these tasks. However, it is reported that these models are suffering when they are dealing with hard datasets. Particularly, they perform much worse when dealing with unseen out-of-distribution premise and hypothesis. They may not understand the semantic content but learn the spurious correlations. In this work, we propose the data augmentation and preprocessing methods to solve the word overlap, numerical reasoning and length mismatch problems. These methods are general methods that do not rely on the distribution of the testing data and they help improve the robustness of the models.

Title:

      Post-Training Statistical Calibration for Higher Activation Sparsity
  • Authors: Vui Seng Chua, Yujie Pan, Nilesh Jain
  • Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5x additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability. The code is available at: this https URL.

Title:

      An Enhancement of CNN Algorithm for Rice Leaf Disease Image Classification in Mobile Applications
  • Authors: Kayne Uriel K. Rodrigo, Jerriane Hillary Heart S. Marcial, Samuel C. Brillo, Khatalyn E. Mata, Jonathan C. Morano
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract This study focuses on enhancing rice leaf disease image classification algorithms, which have traditionally relied on Convolutional Neural Network (CNN) models. We employed transfer learning with MobileViTV2_050 using ImageNet-1k weights, a lightweight model that integrates CNN's local feature extraction with Vision Transformers' global context learning through a separable self-attention mechanism. Our approach resulted in a significant 15.66% improvement in classification accuracy for MobileViTV2_050-A, our first enhanced model trained on the baseline dataset, achieving 93.14%. Furthermore, MobileViTV2_050-B, our second enhanced model trained on a broader rice leaf dataset, demonstrated a 22.12% improvement, reaching 99.6% test accuracy. Additionally, MobileViTV2-A attained an F1-score of 93% across four rice labels and a Receiver Operating Characteristic (ROC) curve ranging from 87% to 97%. In terms of resource consumption, our enhanced models reduced the total parameters of the baseline CNN model by up to 92.50%, from 14 million to 1.1 million. These results indicate that MobileViTV2_050 not only improves computational efficiency through its separable self-attention mechanism but also enhances global context learning. Consequently, it offers a lightweight and robust solution suitable for mobile deployment, advancing the interpretability and practicality of models in precision agriculture.

Title:

      A Review on the Applications of Transformer-based language models for Nucleotide Sequence Analysis
  • Authors: Nimisha Ghosh, Daniele Santoni, Indrajit Saha, Giovanni Felici
  • Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract In recent times, Transformer-based language models are making quite an impact in the field of natural language processing. As relevant parallels can be drawn between biological sequences and natural languages, the models used in NLP can be easily extended and adapted for various applications in bioinformatics. In this regard, this paper introduces the major developments of Transformer-based models in the recent past in the context of nucleotide sequences. We have reviewed and analysed a large number of application-based papers on this subject, giving evidence of the main characterizing features and to different approaches that may be adopted to customize such powerful computational machines. We have also provided a structured description of the functioning of Transformers, that may enable even first time users to grab the essence of such complex architectures. We believe this review will help the scientific community in understanding the various applications of Transformer-based language models to nucleotide sequences. This work will motivate the readers to build on these methodologies to tackle also various other problems in the field of bioinformatics.

Title:

      Comateformer: Combined Attention Transformer for Semantic Sentence Matching
  • Authors: Bo Li, Di Liang, Zixin Zhang
  • Subjects: Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract The Transformer-based model have made significant strides in semantic matching tasks by capturing connections between phrase pairs. However, to assess the relevance of sentence pairs, it is insufficient to just examine the general similarity between the sentences. It is crucial to also consider the tiny subtleties that differentiate them from each other. Regrettably, attention softmax operations in transformers tend to miss these subtle differences. To this end, in this work, we propose a novel semantic sentence matching model named Combined Attention Network based on Transformer model (Comateformer). In Comateformer model, we design a novel transformer-based quasi-attention mechanism with compositional properties. Unlike traditional attention mechanisms that merely adjust the weights of input tokens, our proposed method learns how to combine, subtract, or resize specific vectors when building a representation. Moreover, our proposed approach builds on the intuition of similarity and dissimilarity (negative affinity) when calculating dual affinity scores. This allows for a more meaningful representation of relationships between sentences. To evaluate the performance of our proposed model, we conducted extensive experiments on ten public real-world datasets and robustness testing. Experimental results show that our method achieves consistent improvements.

Title:

      EchoIR: Advancing Image Restoration with Echo Upsampling and Bi-Level Optimization
  • Authors: Yuhan He, Yuchun He
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Image restoration represents a fundamental challenge in low-level vision, focusing on reconstructing high-quality images from their degraded counterparts. With the rapid advancement of deep learning technologies, transformer-based methods with pyramid structures have advanced the field by capturing long-range cross-scale spatial interaction. Despite its popularity, the degradation of essential features during the upsampling process notably compromised the restoration performance, resulting in suboptimal reconstruction outcomes. We introduce the EchoIR, an UNet-like image restoration network with a bilateral learnable upsampling mechanism to bridge this gap. Specifically, we proposed the Echo-Upsampler that optimizes the upsampling process by learning from the bilateral intermediate features of U-Net, the "Echo", aiming for a more refined restoration by minimizing the degradation during upsampling. In pursuit of modeling a hierarchical model of image restoration and upsampling tasks, we propose the Approximated Sequential Bi-level Optimization (AS-BLO), an advanced bi-level optimization model establishing a relationship between upsampling learning and image restoration tasks. Extensive experiments against the state-of-the-art (SOTA) methods demonstrate the proposed EchoIR surpasses the existing methods, achieving SOTA performance in image restoration tasks.

Title:

      ArtFormer: Controllable Generation of Diverse 3D Articulated Objects
  • Authors: Jiayi Su, Youhe Feng, Zheng Li, Jinhua Song, Yangfan He, Botao Ren, Botian Xu
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract This paper presents a novel framework for modeling and conditional generation of 3D articulated objects. Troubled by flexibility-quality tradeoffs, existing methods are often limited to using predefined structures or retrieving shapes from static datasets. To address these challenges, we parameterize an articulated object as a tree of tokens and employ a transformer to generate both the object's high-level geometry code and its kinematic relations. Subsequently, each sub-part's geometry is further decoded using a signed-distance-function (SDF) shape prior, facilitating the synthesis of high-quality 3D shapes. Our approach enables the generation of diverse objects with high-quality geometry and varying number of parts. Comprehensive experiments on conditional generation from text descriptions demonstrate the effectiveness and flexibility of our method.

Title:

      Reducing Traffic Wastage in Video Streaming via Bandwidth-Efficient Bitrate Adaptation
  • Authors: Hairong Su, Shibo Wang, Shusen Yang, Tianchi Huang, Xuebin Ren
  • Subjects: Subjects: Networking and Internet Architecture (cs.NI); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Bitrate adaptation (also known as ABR) is a crucial technique to improve the quality of experience (QoE) for video streaming applications. However, existing ABR algorithms suffer from severe traffic wastage, which refers to the traffic cost of downloading the video segments that users do not finally consume, for example, due to early departure or video skipping. In this paper, we carefully formulate the dynamics of buffered data volume (BDV), a strongly correlated indicator of traffic wastage, which, to the best of our knowledge, is the first time to rigorously clarify the effect of downloading plans on potential wastage. To reduce wastage while keeping a high QoE, we present a bandwidth-efficient bitrate adaptation algorithm (named BE-ABR), achieving consistently low BDV without distinct QoE losses. Specifically, we design a precise, time-aware transmission delay prediction model over the Transformer architecture, and develop a fine-grained buffer control scheme. Through extensive experiments conducted on emulated and real network environments including WiFi, 4G, and 5G, we demonstrate that BE-ABR performs well in both QoE and bandwidth savings, enabling a 60.87% wastage reduction and a comparable, or even better, QoE, compared to the state-of-the-art methods.

Title:

      HARP: Hesitation-Aware Reframing in Transformer Inference Pass
  • Authors: Romain Storaï, Seung-won Hwang
  • Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to "off-the-shelf" Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We thoroughly evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP offers a practical solution for enhancing the performance of Transformer-based language models with minimal computational impact.

Title:

      CoMA: Compositional Human Motion Generation with Multi-modal Agents
  • Authors: Shanlin Sun, Gabriel De Araujo, Jiaqi Xu, Shenghan Zhou, Hanwen Zhang, Ziheng Huang, Chenyu You, Xiaohui Xie
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract 3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.

Title:

      Towards Predictive Communication with Brain-Computer Interfaces integrating Large Language Models
  • Authors: Andrea Caria
  • Subjects: Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract This perspective article aims at providing an outline of the state of the art and future developments towards the integration of cutting-edge predictive language models with BCI. A synthetic overview of early and more recent linguistic models, from natural language processing (NLP) models to recent LLM, that to a varying extent improved predictive writing systems, is first provided. Second, a summary of previous BCI implementations integrating language models is presented. The few preliminary studies investigating the possible combination of LLM with BCI spellers to efficiently support fast communication and control are then described. Finally, current challenges and limitations towards the full integration of LLM with BCI systems are discussed. Recent investigations suggest that the combination of LLM with BCI might drastically improve human-computer interaction in patients with motor or language disorders as well as in healthy individuals. In particular, the pretrained autoregressive transformer models, such as GPT, that capitalize from parallelization, learning through pre-training and fine-tuning, promise a substantial improvement of BCI for communication with respect to previous systems incorporating simpler language models. Indeed, among various models, the GPT-2 was shown to represent an excellent candidate for its integration into BCI although testing was only perfomed on simulated conversations and not on real BCI scenarios. Prospectively, the full integration of LLM with advanced BCI systems might lead to a big leap forward towards fast, efficient and user-adaptive neurotechnology.

Title:

      Benchmarking Vision-Based Object Tracking for USVs in Complex Maritime Environments
  • Authors: Muhayy Ud Din, Ahsan B. Bakht, Waseem Akram, Yihao Dong, Lakmal Seneviratne, Irfan Hussain
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Vision-based target tracking is crucial for unmanned surface vehicles (USVs) to perform tasks such as inspection, monitoring, and surveillance. However, real-time tracking in complex maritime environments is challenging due to dynamic camera movement, low visibility, and scale variation. Typically, object detection methods combined with filtering techniques are commonly used for tracking, but they often lack robustness, particularly in the presence of camera motion and missed detections. Although advanced tracking methods have been proposed recently, their application in maritime scenarios is limited. To address this gap, this study proposes a vision-guided object-tracking framework for USVs, integrating state-of-the-art tracking algorithms with low-level control systems to enable precise tracking in dynamic maritime environments. We benchmarked the performance of seven distinct trackers, developed using advanced deep learning techniques such as Siamese Networks and Transformers, by evaluating them on both simulated and real-world maritime datasets. In addition, we evaluated the robustness of various control algorithms in conjunction with these tracking systems. The proposed framework was validated through simulations and real-world sea experiments, demonstrating its effectiveness in handling dynamic maritime conditions. The results show that SeqTrack, a Transformer-based tracker, performed best in adverse conditions, such as dust storms. Among the control algorithms evaluated, the linear quadratic regulator controller (LQR) demonstrated the most robust and smooth control, allowing for stable tracking of the USV.

Title:

      RLT4Rec: Reinforcement Learning Transformer for User Cold Start and Item Recommendation
  • Authors: Dilina Chandika Rajapakse, Douglas Leith
  • Subjects: Subjects: Information Retrieval (cs.IR)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract We introduce a new sequential transformer reinforcement learning architecture RLT4Rec and demonstrate that it achieves excellent performance in a range of item recommendation tasks. RLT4Rec uses a relatively simple transformer architecture that takes as input the user's (item,rating) history and outputs the next item to present to the user. Unlike existing RL approaches, there is no need to input a state observation or estimate. RLT4Rec handles new users and established users within the same consistent framework and automatically balances the "exploration" needed to discover the preferences of a new user with the "exploitation" that is more appropriate for established users. Training of RLT4Rec is robust and fast and is insensitive to the choice of training data, learning to generate "good" personalised sequences that the user tends to rate highly even when trained on "bad" data.

Title:

      Causal World Representation in the GPT Model
  • Authors: Raanan Y. Rohekar, Yaniv Gurwicz, Sungduk Yu, Vasudev Lal
  • Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Are generative pre-trained transformer (GPT) models only trained to predict the next token, or do they implicitly learn a world model from which a sequence is generated one token at a time? We examine this question by deriving a causal interpretation of the attention mechanism in GPT, and suggesting a causal world model that arises from this interpretation. Furthermore, we propose that GPT-models, at inference time, can be utilized for zero-shot causal structure learning for in-distribution sequences. Empirical evaluation is conducted in a controlled synthetic environment using the setup and rules of the Othello board game. A GPT, pre-trained on real-world games played with the intention of winning, is tested on synthetic data that only adheres to the game rules. We find that the GPT model tends to generate next moves that adhere to the game rules for sequences for which the attention mechanism encodes a causal structure with high confidence. In general, in cases for which the GPT model generates moves that do not adhere to the game rules, it also fails to capture any causal structure.

Title:

      Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence
  • Authors: Wenbo Huang, Jinghui Zhang, Guang Li, Lei Zhang, Shuoyuan Wang, Fang Dong, Jiahui Jin, Takahiro Ogawa, Miki Haseyama
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract In few-shot action recognition~(FSAR), long sub-sequences of video naturally express entire actions more effectively. However, the computational complexity of mainstream Transformer-based methods limits their application. Recent Mamba demonstrates efficiency in modeling long sequences, but directly applying Mamba to FSAR overlooks the importance of local feature modeling and alignment. Moreover, long sub-sequences within the same class accumulate intra-class variance, which adversely impacts FSAR performance. To solve these challenges, we propose a \underline{\textbf{M}}atryoshka M\underline{\textbf{A}}mba and Co\underline{\textbf{N}}tras\underline{\textbf{T}}ive Le\underline{\textbf{A}}rning framework~(\textbf{Manta}). Firstly, the Matryoshka Mamba introduces multiple Inner Modules to enhance local feature representation, rather than directly modeling global features. An Outer Module captures dependencies of timeline between these local features for implicit temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining both supervised and unsupervised methods, is designed to mitigate the negative effects of intra-class variance accumulation. The Matryoshka Mamba and the hybrid contrastive learning paradigm operate in parallel branches within Manta, enhancing Mamba for FSAR of long sub-sequence. Manta achieves new state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics, UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly improves FSAR of long sub-sequence from multiple perspectives. The code is released at this https URL.

Title:

      ConfigX: Modular Configuration for Evolutionary Algorithms via Multitask Reinforcement Learning
  • Authors: Hongshu Guo, Zeyuan Ma, Jiacheng Chen, Yining Ma, Zhiguang Cao, Xinglin Zhang, Yue-Jiao Gong
  • Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Recent advances in Meta-learning for Black-Box Optimization (MetaBBO) have shown the potential of using neural networks to dynamically configure evolutionary algorithms (EAs), enhancing their performance and adaptability across various BBO instances. However, they are often tailored to a specific EA, which limits their generalizability and necessitates retraining or redesigns for different EAs and optimization problems. To address this limitation, we introduce ConfigX, a new paradigm of the MetaBBO framework that is capable of learning a universal configuration agent (model) for boosting diverse EAs. To achieve so, our ConfigX first leverages a novel modularization system that enables the flexible combination of various optimization sub-modules to generate diverse EAs during training. Additionally, we propose a Transformer-based neural network to meta-learn a universal configuration policy through multitask reinforcement learning across a designed joint optimization task space. Extensive experiments verify that, our ConfigX, after large-scale pre-training, achieves robust zero-shot generalization to unseen tasks and outperforms state-of-the-art baselines. Moreover, ConfigX exhibits strong lifelong learning capabilities, allowing efficient adaptation to new tasks through fine-tuning. Our proposed ConfigX represents a significant step toward an automatic, all-purpose configuration agent for EAs.

Title:

      Anomaly detection using Diffusion-based methods
  • Authors: Aryan Bhosale, Samrat Mukherjee, Biplab Banerjee, Fabio Cuzzolin
  • Subjects: Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract This paper explores the utility of diffusion-based models for anomaly detection, focusing on their efficacy in identifying deviations in both compact and high-resolution datasets. Diffusion-based architectures, including Denoising Diffusion Probabilistic Models (DDPMs) and Diffusion Transformers (DiTs), are evaluated for their performance using reconstruction objectives. By leveraging the strengths of these models, this study benchmarks their performance against traditional anomaly detection methods such as Isolation Forests, One-Class SVMs, and COPOD. The results demonstrate the superior adaptability, scalability, and robustness of diffusion-based methods in handling complex real-world anomaly detection tasks. Key findings highlight the role of reconstruction error in enhancing detection accuracy and underscore the scalability of these models to high-dimensional datasets. Future directions include optimizing encoder-decoder architectures and exploring multi-modal datasets to further advance diffusion-based anomaly detection.

Title:

      Scaling Sequential Recommendation Models with Transformers
  • Authors: Pablo Zivic, Hernan Vazquez, Jorge Sanchez
  • Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Modeling user preferences has been mainly addressed by looking at users' interaction history with the different elements available in the system. Tailoring content to individual preferences based on historical data is the main goal of sequential recommendation. The nature of the problem, as well as the good performance observed across various domains, has motivated the use of the transformer architecture, which has proven effective in leveraging increasingly larger amounts of training data when accompanied by an increase in the number of model parameters. This scaling behavior has brought a great deal of attention, as it provides valuable guidance in the design and training of even larger models. Taking inspiration from the scaling laws observed in training large language models, we explore similar principles for sequential recommendation. We use the full Amazon Product Data dataset, which has only been partially explored in other studies, and reveal scaling behaviors similar to those found in language models. Compute-optimal training is possible but requires a careful analysis of the compute-performance trade-offs specific to the application. We also show that performance scaling translates to downstream tasks by fine-tuning larger pre-trained models on smaller task-specific domains. Our approach and findings provide a strategic roadmap for model training and deployment in real high-dimensional preference spaces, facilitating better training and inference efficiency. We hope this paper bridges the gap between the potential of transformers and the intrinsic complexities of high-dimensional sequential recommendation in real-world recommender systems. Code and models can be found at this https URL

Title:

      ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer
  • Authors: Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract The recent surge of interest in comprehensive multimodal models has necessitated the unification of diverse modalities. However, the unification suffers from disparate methodologies. Continuous visual generation necessitates the full-sequence diffusion-based approach, despite its divergence from the autoregressive modeling in the text domain. We posit that autoregressive modeling, i.e., predicting the future based on past deterministic experience, remains crucial in developing both a visual generation model and a potential unified multimodal model. In this paper, we explore an interpolation between the autoregressive modeling and full-parameters diffusion to model visual information. At its core, we present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer, where the block size of diffusion, i.e., the size of autoregressive units, can be flexibly adjusted to interpolate between token-wise autoregression and full-sequence diffusion. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We verify the effectiveness of ACDiT on image and video generation tasks. We also demonstrate that benefitted from autoregressive modeling, ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. These strengths make it promising as the backbone of future unified models.

Title:

      STIV: Scalable Text and Image Conditioned Video Generation
  • Authors: Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Yifan Jiang, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress toward more versatile and reliable video generation solutions.

Title:

      FlashRNN: Optimizing Traditional RNNs on Modern Hardware
  • Authors: Korbinian Pöppel, Maximilian Beck, Sepp Hochreiter
  • Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract While Transformers and other sequence-parallelizable neural network architectures seem like the current state of the art in sequence modeling, they specifically lack state-tracking capabilities. These are important for time-series tasks and logical reasoning. Traditional RNNs like LSTMs and GRUs, as well as modern variants like sLSTM do have these capabilities at the cost of strictly sequential processing. While this is often seen as a strong limitation, we show how fast these networks can get with our hardware-optimization FlashRNN in Triton and CUDA, optimizing kernels to the register level on modern GPUs. We extend traditional RNNs with a parallelization variant that processes multiple RNNs of smaller hidden state in parallel, similar to the head-wise processing in Transformers. To enable flexibility on different GPU variants, we introduce a new optimization framework for hardware-internal cache sizes, memory and compute handling. It models the hardware in a setting using polyhedral-like constraints, including the notion of divisibility. This speeds up the solution process in our ConstrINT library for general integer constraint satisfaction problems (integer CSPs). We show that our kernels can achieve 50x speed-ups over a vanilla PyTorch implementation and allow 40x larger hidden sizes compared to our Triton implementation. Our open-source kernels and the optimization library are released here to boost research in the direction of state-tracking enabled RNNs and sequence modeling: \url{this https URL}

Title:

      From Slow Bidirectional to Fast Causal Video Generators
  • Authors: Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, Xun Huang
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to a causal transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher's ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model supports fast streaming generation of high quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future.

Title:

      Video Motion Transfer with Diffusion Transformers
  • Authors: Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov, Philip Torr, Fabio Pizzati
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

Title:

      MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models
  • Authors: Sayak Chakrabarty, Souradip Pal
  • Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/
  • Pdf link: https://arxiv.org/pdf/
  • Abstract This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, herein referred to as Multi-Modal Process of Elimination (MM-PoE). This novel methodology is engineered to augment the efficacy of Vision-Language Models (VLMs) in multiple-choice visual reasoning tasks. Diverging from conventional approaches that evaluate each option independently, MM-PoE employs a dual-step scoring paradigm that initially identifies and excludes implausible choices, subsequently concentrating on the most probable remaining options. This method emulates human test-taking strategies, where individuals typically eliminate clearly incorrect answers prior to selecting the optimal response. Our empirical evaluations, conducted across three benchmark datasets, reveal that MM-PoE significantly improves both zero-shot and few-shot performance of contemporary state-of-the-art VLMs. Critically, this approach not only broadens the application of the elimination process to multi-modal contexts but also allows few-shot experiments, thereby addressing two principal limitations concerning usage of PoE only in zero-shot settings and only with a language-only framework. As a result, MM-PoE not only refines the reasoning capabilities of VLMs but also broadens their applicability to complex visual question-answering scenarios. All code and documentation supporting our work are available at this https URL, enabling researchers and practitioners to easily integrate and further develop these techniques.

DongZhouGu avatar Dec 11 '24 02:12 DongZhouGu