arxiv-daily
arxiv-daily copied to clipboard
New submissions for Mon, 19 Feb 24
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
CodaMal: Contrastive Domain Adaptation for Malaria Detection in Low-Cost Microscopes
- Authors: Ishan Rajendrakumar Dave, Tristan de Blegiers, Chen Chen, Mubarak Shah
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2402.10478
- Pdf link: https://arxiv.org/pdf/2402.10478
- Abstract Malaria is a major health issue worldwide, and its diagnosis requires scalable solutions that can work effectively with low-cost microscopes (LCM). Deep learning-based methods have shown success in computer-aided diagnosis from microscopic images. However, these methods need annotated images that show cells affected by malaria parasites and their life stages. Annotating images from LCM significantly increases the burden on medical experts compared to annotating images from high-cost microscopes (HCM). For this reason, a practical solution would be trained on HCM images which should generalize well on LCM images during testing. While earlier methods adopted a multi-stage learning process, they did not offer an end-to-end approach. In this work, we present an end-to-end learning framework, named CodaMal (Contrastive Domain Adpation for Malaria). In order to bridge the gap between HCM (training) and LCM (testing), we propose a domain adaptive contrastive loss. It reduces the domain shift by promoting similarity between the representations of HCM and its corresponding LCM image, without imposing an additional annotation burden. In addition, the training objective includes object detection objectives with carefully designed augmentations, ensuring the accurate detection of malaria parasites. On the publicly available large-scale M5-dataset, our proposed method shows a significant improvement of 16% over the state-of-the-art methods in terms of the mean average precision metric (mAP), provides 21x speed up during inference, and requires only half learnable parameters than the prior methods. Our code is publicly available.
AutoGPT+P: Affordance-based Task Planning with Large Language Models
- Authors: Timo Birr, Christoph Pohl, Abdelrahman Younes, Tamim Asfour
- Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.10778
- Pdf link: https://arxiv.org/pdf/2402.10778
- Abstract Recent advances in task planning leverage Large Language Models (LLMs) to improve generalizability by combining such models with classical planning algorithms to address their inherent limitations in reasoning capabilities. However, these approaches face the challenge of dynamically capturing the initial state of the task planning problem. To alleviate this issue, we propose AutoGPT+P, a system that combines an affordance-based scene representation with a planning system. Affordances encompass the action possibilities of an agent on the environment and objects present in it. Thus, deriving the planning domain from an affordance-based scene representation allows symbolic planning with arbitrary objects. AutoGPT+P leverages this representation to derive and execute a plan for a task specified by the user in natural language. In addition to solving planning tasks under a closed-world assumption, AutoGPT+P can also handle planning with incomplete information, e. g., tasks with missing objects by exploring the scene, suggesting alternatives, or providing a partial plan. The affordance-based scene representation combines object detection with an automatically generated object-affordance-mapping using ChatGPT. The core planning tool extends existing work by automatically correcting semantic and syntactic errors. Our approach achieves a success rate of 98%, surpassing the current 81% success rate of the current state-of-the-art LLM-based planning method SayCan on the SayCan instruction set. Furthermore, we evaluated our approach on our newly created dataset with 150 scenarios covering a wide range of complex tasks with missing objects, achieving a success rate of 79% on our dataset. The dataset and the code are publicly available at https://git.h2t.iar.kit.edu/birr/autogpt-p-standalone.
Keyword: transformer
UMAIR-FPS: User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style
- Authors: Yan Kang, Hao Lin, Mingjian Yang, Shin-Jye Lee
- Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.10381
- Pdf link: https://arxiv.org/pdf/2402.10381
- Abstract The rapid advancement of high-quality image generation models based on AI has generated a deluge of anime illustrations. Recommending illustrations to users within massive data has become a challenging and popular task. However, existing anime recommendation systems have focused on text features but still need to integrate image features. In addition, most multi-modal recommendation research is constrained by tightly coupled datasets, limiting its applicability to anime illustrations. We propose the User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style (UMAIR-FPS) to tackle these gaps. In the feature extract phase, for image features, we are the first to combine image painting style features with semantic features to construct a dual-output image encoder for enhancing representation. For text features, we obtain text embeddings based on fine-tuning Sentence-Transformers by incorporating domain knowledge that composes a variety of domain text pairs from multilingual mappings, entity relationships, and term explanation perspectives, respectively. In the multi-modal fusion phase, we novelly propose a user-aware multi-modal contribution measurement mechanism to weight multi-modal features dynamically according to user features at the interaction level and employ the DCN-V2 module to model bounded-degree multi-modal crosses effectively. UMAIR-FPS surpasses the stat-of-the-art baselines on large real-world datasets, demonstrating substantial performance enhancements.
Dynamic Patch-aware Enrichment Transformer for Occluded Person Re-Identification
- Authors: Xin Zhang, Keren Fu, Qijun Zhao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.10435
- Pdf link: https://arxiv.org/pdf/2402.10435
- Abstract Person re-identification (re-ID) continues to pose a significant challenge, particularly in scenarios involving occlusions. Prior approaches aimed at tackling occlusions have predominantly focused on aligning physical body features through the utilization of external semantic cues. However, these methods tend to be intricate and susceptible to noise. To address the aforementioned challenges, we present an innovative end-to-end solution known as the Dynamic Patch-aware Enrichment Transformer (DPEFormer). This model effectively distinguishes human body information from occlusions automatically and dynamically, eliminating the need for external detectors or precise image alignment. Specifically, we introduce a dynamic patch token selection module (DPSM). DPSM utilizes a label-guided proxy token as an intermediary to identify informative occlusion-free tokens. These tokens are then selected for deriving subsequent local part features. To facilitate the seamless integration of global classification features with the finely detailed local features selected by DPSM, we introduce a novel feature blending module (FBM). FBM enhances feature representation through the complementary nature of information and the exploitation of part diversity. Furthermore, to ensure that DPSM and the entire DPEFormer can effectively learn with only identity labels, we also propose a Realistic Occlusion Augmentation (ROA) strategy. This strategy leverages the recent advances in the Segment Anything Model (SAM). As a result, it generates occlusion images that closely resemble real-world occlusions, greatly enhancing the subsequent contrastive learning process. Experiments on occluded and holistic re-ID benchmarks signify a substantial advancement of DPEFormer over existing state-of-the-art approaches. The code will be made publicly available.
Can Transformers Predict Vibrations?
- Authors: Fusataka Kuniyoshi, Yoshihide Sawada
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.10511
- Pdf link: https://arxiv.org/pdf/2402.10511
- Abstract Highly accurate time-series vibration prediction is an important research issue for electric vehicles (EVs). EVs often experience vibrations when driving on rough terrains, known as torsional resonance. This resonance, caused by the interaction between motor and tire vibrations, puts excessive loads on the vehicle's drive shaft. However, current damping technologies only detect resonance after the vibration amplitude of the drive shaft torque reaches a certain threshold, leading to significant loads on the shaft at the time of detection. In this study, we propose a novel approach to address this issue by introducing Resoformer, a transformer-based model for predicting torsional resonance. Resoformer utilizes time-series of the motor rotation speed as input and predicts the amplitude of torsional vibration at a specified quantile occurring in the shaft after the input series. By calculating the attention between recursive and convolutional features extracted from the measured data points, Resoformer improves the accuracy of vibration forecasting. To evaluate the model, we use a vibration dataset called VIBES (Dataset for Forecasting Vibration Transition in EVs), consisting of 2,600 simulator-generated vibration sequences. Our experiments, conducted on strong baselines built on the VIBES dataset, demonstrate that Resoformer achieves state-of-the-art results. In conclusion, our study answers the question "Can Transformers Forecast Vibrations?" While traditional transformer architectures show low performance in forecasting torsional resonance waves, our findings indicate that combining recurrent neural network and temporal convolutional network using the transformer architecture improves the accuracy of long-term vibration forecasting.
Personalised Drug Identifier for Cancer Treatment with Transformers using Auxiliary Information
- Authors: Aishwarya Jayagopal, Hansheng Xue, Ziyang He, Robert J. Walsh, Krishna Kumar Hariprasannan, David Shao Peng Tan, Tuan Zea Tan, Jason J. Pitt, Anand D. Jeyasekharan, Vaibhav Rajan
- Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
- Arxiv link: https://arxiv.org/abs/2402.10551
- Pdf link: https://arxiv.org/pdf/2402.10551
- Abstract Cancer remains a global challenge due to its growing clinical and economic burden. Its uniquely personal manifestation, which makes treatment difficult, has fuelled the quest for personalized treatment strategies. Thus, genomic profiling is increasingly becoming part of clinical diagnostic panels. Effective use of such panels requires accurate drug response prediction (DRP) models, which are challenging to build due to limited labelled patient data. Previous methods to address this problem have used various forms of transfer learning. However, they do not explicitly model the variable length sequential structure of the list of mutations in such diagnostic panels. Further, they do not utilize auxiliary information (like patient survival) for model training. We address these limitations through a novel transformer based method, which surpasses the performance of state-of-the-art DRP models on benchmark data. We also present the design of a treatment recommendation system (TRS), which is currently deployed at the National University Hospital, Singapore and is being evaluated in a clinical trial.
Do Llamas Work in English? On the Latent Language of Multilingual Transformers
- Authors: Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West
- Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
- Arxiv link: https://arxiv.org/abs/2402.10588
- Pdf link: https://arxiv.org/pdf/2402.10588
- Abstract We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language models function and the origins of linguistic bias. Focusing on the Llama-2 family of transformer models, our study uses carefully constructed non-English prompts with a unique correct single-token continuation. From layer to layer, transformers gradually map an input embedding of the final prompt token to an output embedding from which next-token probabilities are computed. Tracking intermediate embeddings through their high-dimensional space reveals three distinct phases, whereby intermediate embeddings (1) start far away from output token embeddings; (2) already allow for decoding a semantically correct next token in the middle layers, but give higher probability to its version in English than in the input language; (3) finally move into an input-language-specific region of the embedding space. We cast these results into a conceptual model where the three phases operate in "input space", "concept space", and "output space", respectively. Crucially, our evidence suggests that the abstract "concept space" lies closer to English than to other languages, which may have important consequences regarding the biases held by multilingual language models.
ContiFormer: Continuous-Time Transformer for Irregular Time Series Modeling
- Authors: Yuqi Chen, Kan Ren, Yansen Wang, Yuchen Fang, Weiwei Sun, Dongsheng Li
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.10635
- Pdf link: https://arxiv.org/pdf/2402.10635
- Abstract Modeling continuous-time dynamics on irregular time series is critical to account for data evolution and correlations that occur continuously. Traditional methods including recurrent neural networks or Transformer models leverage inductive bias via powerful neural architectures to capture complex patterns. However, due to their discrete characteristic, they have limitations in generalizing to continuous-time data paradigms. Though neural ordinary differential equations (Neural ODEs) and their variants have shown promising results in dealing with irregular time series, they often fail to capture the intricate correlations within these sequences. It is challenging yet demanding to concurrently model the relationship between input data points and capture the dynamic changes of the continuous-time system. To tackle this problem, we propose ContiFormer that extends the relation modeling of vanilla Transformer to the continuous-time domain, which explicitly incorporates the modeling abilities of continuous dynamics of Neural ODEs with the attention mechanism of Transformers. We mathematically characterize the expressive power of ContiFormer and illustrate that, by curated designs of function hypothesis, many Transformer variants specialized in irregular time series modeling can be covered as a special case of ContiFormer. A wide range of experiments on both synthetic and real-world datasets have illustrated the superior modeling capacities and prediction performance of ContiFormer on irregular time series data. The project link is https://seqml.github.io/contiformer/.
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
- Authors: Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, Daniil Gavrilov
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2402.10644
- Pdf link: https://arxiv.org/pdf/2402.10644
- Abstract Advancing the frontier of subquadratic architectures for Language Models (LMs) is crucial in the rapidly evolving field of natural language processing. Current innovations, including State Space Models, were initially celebrated for surpassing Transformer performance on language modeling tasks. However, these models have revealed deficiencies in essential In-Context Learning capabilities - a domain where the Transformer traditionally shines. The Based model emerged as a hybrid solution, blending a Linear Transformer with a kernel inspired by the Taylor expansion of exponential functions, augmented by convolutional networks. Mirroring the Transformer's in-context adeptness, it became a strong contender in the field. In our work, we present a singular, elegant alteration to the Based kernel that amplifies its In-Context Learning abilities evaluated with the Multi-Query Associative Recall task and overall language modeling process, as demonstrated on the Pile dataset.
PointMamba: A Simple State Space Model for Point Cloud Analysis
- Authors: Dingkang Liang, Xin Zhou, Xinyu Wang, Xingkui Zhu, Wei Xu, Zhikang Zou, Xiaoqing Ye, Xiang Bai
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2402.10739
- Pdf link: https://arxiv.org/pdf/2402.10739
- Abstract Transformers have become one of the foundational architectures in point cloud analysis tasks due to their excellent global modeling ability. However, the attention mechanism has quadratic complexity and is difficult to extend to long sequence modeling due to limited computational resources and so on. Recently, state space models (SSM), a new family of deep sequence models, have presented great potential for sequence modeling in NLP tasks. In this paper, taking inspiration from the success of SSM in NLP, we propose PointMamba, a framework with global modeling and linear complexity. Specifically, by taking embedded point patches as input, we proposed a reordering strategy to enhance SSM's global modeling ability by providing a more logical geometric scanning order. The reordered point tokens are then sent to a series of Mamba blocks to causally capture the point cloud structure. Experimental results show our proposed PointMamba outperforms the transformer-based counterparts on different point cloud analysis datasets, while significantly saving about 44.3% parameters and 25% FLOPs, demonstrating the potential option for constructing foundational 3D vision models. We hope our PointMamba can provide a new perspective for point cloud analysis. The code is available at https://github.com/LMD0311/PointMamba.
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss
- Authors: Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2402.10790
- Pdf link: https://arxiv.org/pdf/2402.10790
- Abstract This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to $10^4$ elements. In contrast, fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to $10^7$ elements. This achievement marks a substantial leap, as it is by far the longest input processed by any open neural network model to date, demonstrating a significant improvement in the processing capabilities for long sequences.
VATr++: Choose Your Words Wisely for Handwritten Text Generation
- Authors: Bram Vanherle, Vittorio Pippi, Silvia Cascianelli, Nick Michiels, Frank Van Reeth, Rita Cucchiara
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2402.10798
- Pdf link: https://arxiv.org/pdf/2402.10798
- Abstract Styled Handwritten Text Generation (HTG) has received significant attention in recent years, propelled by the success of learning-based solutions employing GANs, Transformers, and, preliminarily, Diffusion Models. Despite this surge in interest, there remains a critical yet understudied aspect - the impact of the input, both visual and textual, on the HTG model training and its subsequent influence on performance. This study delves deeper into a cutting-edge Styled-HTG approach, proposing strategies for input preparation and training regularization that allow the model to achieve better performance and generalize better. These aspects are validated through extensive analysis on several different settings and datasets. Moreover, in this work, we go beyond performance optimization and address a significant hurdle in HTG research - the lack of a standardized evaluation protocol. In particular, we propose a standardization of the evaluation protocol for HTG and conduct a comprehensive benchmarking of existing approaches. By doing so, we aim to establish a foundation for fair and meaningful comparisons between HTG strategies, fostering progress in the field.
Keyword: scene understanding
There is no result
Keyword: visual reasoning
There is no result