arxiv-daily
arxiv-daily copied to clipboard
New submissions for Fri, 26 Jan 24
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
MultiTest: Physical-Aware Object Insertion for Testing Multi-sensor Fusion Perception Systems
- Authors: Xinyu Gao, Zhijie Wang, Yang Feng, Lei Ma, Zhenyu Chen, Baowen Xu
- Subjects: Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2401.14314
- Pdf link: https://arxiv.org/pdf/2401.14314
- Abstract Multi-sensor fusion stands as a pivotal technique in addressing numerous safety-critical tasks and applications, e.g., self-driving cars and automated robotic arms. With the continuous advancement in data-driven artificial intelligence (AI), MSF's potential for sensing and understanding intricate external environments has been further amplified, bringing a profound impact on intelligent systems and specifically on their perception systems. Similar to traditional software, adequate testing is also required for AI-enabled MSF systems. Yet, existing testing methods primarily concentrate on single-sensor perception systems (e.g., image-/point cloud-based object detection systems). There remains a lack of emphasis on generating multi-modal test cases for MSF systems. To address these limitations, we design and implement MultiTest, a fitness-guided metamorphic testing method for complex MSF perception systems. MultiTest employs a physical-aware approach to synthesize realistic multi-modal object instances and insert them into critical positions of background images and point clouds. A fitness metric is designed to guide and boost the test generation process. We conduct extensive experiments with five SOTA perception systems to evaluate MultiTest from the perspectives of: (1) generated test cases' realism, (2) fault detection capabilities, and (3) performance improvement. The results show that MultiTest can generate realistic and modality-consistent test data and effectively detect hundreds of diverse faults of an MSF system under test. Moreover, retraining an MSF system on the test cases generated by MultiTest can improve the system's robustness.
UrbanGenAI: Reconstructing Urban Landscapes using Panoptic Segmentation and Diffusion Models
- Authors: Timo Kapsalis
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.14379
- Pdf link: https://arxiv.org/pdf/2401.14379
- Abstract In contemporary design practices, the integration of computer vision and generative artificial intelligence (genAI) represents a transformative shift towards more interactive and inclusive processes. These technologies offer new dimensions of image analysis and generation, which are particularly relevant in the context of urban landscape reconstruction. This paper presents a novel workflow encapsulated within a prototype application, designed to leverage the synergies between advanced image segmentation and diffusion models for a comprehensive approach to urban design. Our methodology encompasses the OneFormer model for detailed image segmentation and the Stable Diffusion XL (SDXL) diffusion model, implemented through ControlNet, for generating images from textual descriptions. Validation results indicated a high degree of performance by the prototype application, showcasing significant accuracy in both object detection and text-to-image generation. This was evidenced by superior Intersection over Union (IoU) and CLIP scores across iterative evaluations for various categories of urban landscape features. Preliminary testing included utilising UrbanGenAI as an educational tool enhancing the learning experience in design pedagogy, and as a participatory instrument facilitating community-driven urban planning. Early results suggested that UrbanGenAI not only advances the technical frontiers of urban landscape reconstruction but also provides significant pedagogical and participatory planning benefits. The ongoing development of UrbanGenAI aims to further validate its effectiveness across broader contexts and integrate additional features such as real-time feedback mechanisms and 3D modelling capabilities. Keywords: generative AI; panoptic image segmentation; diffusion models; urban landscape design; design pedagogy; co-design
Keyword: transformer
S2TPVFormer: Spatio-Temporal Tri-Perspective View for temporally coherent 3D Semantic Occupancy Prediction
- Authors: Sathira Silva, Savindu Bhashitha Wannigama, Roshan Ragel, Gihan Jayatilaka
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.13785
- Pdf link: https://arxiv.org/pdf/2401.13785
- Abstract Holistic understanding and reasoning in 3D scenes play a vital role in the success of autonomous driving systems. The evolution of 3D semantic occupancy prediction as a pretraining task for autonomous driving and robotic downstream tasks captures finer 3D details compared to methods like 3D detection. Existing approaches predominantly focus on spatial cues, often overlooking temporal cues. Query-based methods tend to converge on computationally intensive Voxel representation for encoding 3D scene information. This study introduces S2TPVFormer, an extension of TPVFormer, utilizing a spatiotemporal transformer architecture for coherent 3D semantic occupancy prediction. Emphasizing the importance of spatiotemporal cues in 3D scene perception, particularly in 3D semantic occupancy prediction, our work explores the less-explored realm of temporal cues. Leveraging Tri-Perspective View (TPV) representation, our spatiotemporal encoder generates temporally rich embeddings, improving prediction coherence while maintaining computational efficiency. To achieve this, we propose a novel Temporal Cross-View Hybrid Attention (TCVHA) mechanism, facilitating effective spatiotemporal information exchange across TPV views. Experimental evaluations on the nuScenes dataset demonstrate a substantial 3.1% improvement in mean Intersection over Union (mIoU) for 3D Semantic Occupancy compared to TPVFormer, confirming the effectiveness of the proposed S2TPVFormer in enhancing 3D scene perception.
Inverse Molecular Design with Multi-Conditional Diffusion Guidance
- Authors: Gang Liu, Jiaxin Xu, Tengfei Luo, Meng Jiang
- Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
- Arxiv link: https://arxiv.org/abs/2401.13858
- Pdf link: https://arxiv.org/pdf/2401.13858
- Abstract Inverse molecular design with diffusion models holds great potential for advancements in material and drug discovery. Despite success in unconditional molecule generation, integrating multiple properties such as synthetic score and gas permeability as condition constraints into diffusion models remains unexplored. We introduce multi-conditional diffusion guidance. The proposed Transformer-based denoising model has a condition encoder that learns the representations of numerical and categorical conditions. The denoising model, consisting of a structure encoder-decoder, is trained for denoising under the representation of conditions. The diffusion process becomes graph-dependent to accurately estimate graph-related noise in molecules, unlike the previous models that focus solely on the marginal distributions of atoms or bonds. We extensively validate our model for multi-conditional polymer and small molecule generation. Results demonstrate our superiority across metrics from distribution learning to condition control for molecular properties. An inverse polymer design task for gas separation with feedback from domain experts further demonstrates its practical utility.
A Survey of Deep Learning and Foundation Models for Time Series Forecasting
- Authors: John A. Miller, Mohammed Aldosari, Farah Saeed, Nasid Habib Barna, Subas Rana, I. Budak Arpinar, Ninghao Liu
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.13912
- Pdf link: https://arxiv.org/pdf/2401.13912
- Abstract Deep Learning has been successfully applied to many application domains, yet its advantages have been slow to emerge for time series forecasting. For example, in the well-known Makridakis (M) Competitions, hybrids of traditional statistical or machine learning techniques have only recently become the top performers. With the recent architectural advances in deep learning being applied to time series forecasting (e.g., encoder-decoders with attention, transformers, and graph neural networks), deep learning has begun to show significant advantages. Still, in the area of pandemic prediction, there remain challenges for deep learning models: the time series is not long enough for effective training, unawareness of accumulated scientific knowledge, and interpretability of the model. To this end, the development of foundation models (large deep learning models with extensive pre-training) allows models to understand patterns and acquire knowledge that can be applied to new related problems before extensive training data becomes available. Furthermore, there is a vast amount of knowledge available that deep learning models can tap into, including Knowledge Graphs and Large Language Models fine-tuned with scientific domain knowledge. There is ongoing research examining how to utilize or inject such knowledge into deep learning models. In this survey, several state-of-the-art modeling techniques are reviewed, and suggestions for further work are provided.
AM-SORT: Adaptable Motion Predictor with Historical Trajectory Embedding for Multi-Object Tracking
- Authors: Vitaliy Kim, Gunho Jung, Seong-Whan Lee
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.13950
- Pdf link: https://arxiv.org/pdf/2401.13950
- Abstract Many multi-object tracking (MOT) approaches, which employ the Kalman Filter as a motion predictor, assume constant velocity and Gaussian-distributed filtering noises. These assumptions render the Kalman Filter-based trackers effective in linear motion scenarios. However, these linear assumptions serve as a key limitation when estimating future object locations within scenarios involving non-linear motion and occlusions. To address this issue, we propose a motion-based MOT approach with an adaptable motion predictor, called AM-SORT, which adapts to estimate non-linear uncertainties. AM-SORT is a novel extension of the SORT-series trackers that supersedes the Kalman Filter with the transformer architecture as a motion predictor. We introduce a historical trajectory embedding that empowers the transformer to extract spatio-temporal features from a sequence of bounding boxes. AM-SORT achieves competitive performance compared to state-of-the-art trackers on DanceTrack, with 56.3 IDF1 and 55.6 HOTA. We conduct extensive experiments to demonstrate the effectiveness of our method in predicting non-linear movement under occlusions.
Dynamic Long-Term Time-Series Forecasting via Meta Transformer Networks
- Authors: Muhammad Anwar Ma'sum, MD Rasel Sarkar, Mahardhika Pratama, Savitha Ramasamy, Sreenatha Anavatti, Lin Liu, Habibullah, Ryszard Kowalczyk
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.13968
- Pdf link: https://arxiv.org/pdf/2401.13968
- Abstract A reliable long-term time-series forecaster is highly demanded in practice but comes across many challenges such as low computational and memory footprints as well as robustness against dynamic learning environments. This paper proposes Meta-Transformer Networks (MANTRA) to deal with the dynamic long-term time-series forecasting tasks. MANTRA relies on the concept of fast and slow learners where a collection of fast learners learns different aspects of data distributions while adapting quickly to changes. A slow learner tailors suitable representations to fast learners. Fast adaptations to dynamic environments are achieved using the universal representation transformer layers producing task-adapted representations with a small number of parameters. Our experiments using four datasets with different prediction lengths demonstrate the advantage of our approach with at least $3%$ improvements over the baseline algorithms for both multivariate and univariate settings. Source codes of MANTRA are publicly available in \url{https://github.com/anwarmaxsum/MANTRA}.
Cross-Domain Few-Shot Learning via Adaptive Transformer Networks
- Authors: Naeem Paeedeh, Mahardhika Pratama, Muhammad Anwar Ma'sum, Wolfgang Mayer, Zehong Cao, Ryszard Kowlczyk
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.13987
- Pdf link: https://arxiv.org/pdf/2401.13987
- Abstract Most few-shot learning works rely on the same domain assumption between the base and the target tasks, hindering their practical applications. This paper proposes an adaptive transformer network (ADAPTER), a simple but effective solution for cross-domain few-shot learning where there exist large domain shifts between the base task and the target task. ADAPTER is built upon the idea of bidirectional cross-attention to learn transferable features between the two domains. The proposed architecture is trained with DINO to produce diverse, and less biased features to avoid the supervision collapse problem. Furthermore, the label smoothing approach is proposed to improve the consistency and reliability of the predictions by also considering the predicted labels of the close samples in the embedding space. The performance of ADAPTER is rigorously evaluated in the BSCD-FSL benchmarks in which it outperforms prior arts with significant margins.
Diffusion-based Data Augmentation for Object Counting Problems
- Authors: Zhen Wang, Yuelei Li, Jia Wan, Nuno Vasconcelos
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.13992
- Pdf link: https://arxiv.org/pdf/2401.13992
- Abstract Crowd counting is an important problem in computer vision due to its wide range of applications in image understanding. Currently, this problem is typically addressed using deep learning approaches, such as Convolutional Neural Networks (CNNs) and Transformers. However, deep networks are data-driven and are prone to overfitting, especially when the available labeled crowd dataset is limited. To overcome this limitation, we have designed a pipeline that utilizes a diffusion model to generate extensive training data. We are the first to generate images conditioned on a location dot map (a binary dot map that specifies the location of human heads) with a diffusion model. We are also the first to use these diverse synthetic data to augment the crowd counting models. Our proposed smoothed density map input for ControlNet significantly improves ControlNet's performance in generating crowds in the correct locations. Also, Our proposed counting loss for the diffusion model effectively minimizes the discrepancies between the location dot map and the crowd images generated. Additionally, our innovative guidance sampling further directs the diffusion process toward regions where the generated crowd images align most accurately with the location dot map. Collectively, we have enhanced ControlNet's ability to generate specified objects from a location dot map, which can be used for data augmentation in various counting problems. Moreover, our framework is versatile and can be easily adapted to all kinds of counting problems. Extensive experiments demonstrate that our framework improves the counting performance on the ShanghaiTech, NWPU-Crowd, UCF-QNRF, and TRANCOS datasets, showcasing its effectiveness.
On the Feasibility of Simple Transformer for Dynamic Graph Modeling
- Authors: Yuxia Wu, Yuan Fang, Lizi Liao
- Subjects: Social and Information Networks (cs.SI)
- Arxiv link: https://arxiv.org/abs/2401.14009
- Pdf link: https://arxiv.org/pdf/2401.14009
- Abstract Dynamic graph modeling is crucial for understanding complex structures in web graphs, spanning applications in social networks, recommender systems, and more. Most existing methods primarily emphasize structural dependencies and their temporal changes. However, these approaches often overlook detailed temporal aspects or struggle with long-term dependencies. Furthermore, many solutions overly complicate the process by emphasizing intricate module designs to capture dynamic evolutions. In this work, we harness the strength of the Transformer's self-attention mechanism, known for adeptly handling long-range dependencies in sequence modeling. Our approach offers a simple Transformer model tailored for dynamic graph modeling without complex modifications. We re-conceptualize dynamic graphs as a sequence modeling challenge and introduce an innovative temporal alignment technique. This technique not only captures the inherent temporal evolution patterns within dynamic graphs but also streamlines the modeling process of their evolution. As a result, our method becomes versatile, catering to an array of applications. Our model's effectiveness is underscored through rigorous experiments on four real-world datasets from various sectors, solidifying its potential in dynamic graph modeling.
(Chat)GPT v BERT: Dawn of Justice for Semantic Change Detection
- Authors: Francesco Periti, Haim Dubossarsky, Nina Tahmasebi
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2401.14040
- Pdf link: https://arxiv.org/pdf/2401.14040
- Abstract In the universe of Natural Language Processing, Transformer-based language models like BERT and (Chat)GPT have emerged as lexical superheroes with great power to solve open research problems. In this paper, we specifically focus on the temporal problem of semantic change, and evaluate their ability to solve two diachronic extensions of the Word-in-Context (WiC) task: TempoWiC and HistoWiC. In particular, we investigate the potential of a novel, off-the-shelf technology like ChatGPT (and GPT) 3.5 compared to BERT, which represents a family of models that currently stand as the state-of-the-art for modeling semantic change. Our experiments represent the first attempt to assess the use of (Chat)GPT for studying semantic change. Our results indicate that ChatGPT performs significantly worse than the foundational GPT version. Furthermore, our results demonstrate that (Chat)GPT achieves slightly lower performance than BERT in detecting long-term changes but performs significantly worse in detecting short-term changes.
Vivim: a Video Vision Mamba for Medical Video Object Segmentation
- Authors: Yijun Yang, Zhaohu Xing, Lei Zhu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.14168
- Pdf link: https://arxiv.org/pdf/2401.14168
- Abstract Traditional convolutional neural networks have a limited receptive field while transformer-based networks are mediocre in constructing long-term dependency from the perspective of computational complexity. Such the bottleneck poses a significant challenge when processing long video sequences in video analysis tasks. Very recently, the state space models (SSMs) with efficient hardware-aware designs, famous by Mamba, have exhibited impressive achievements in long sequence modeling, which facilitates the development of deep neural networks on many vision tasks. To better capture available cues in video frames, this paper presents a generic Video Vision Mamba-based framework for medical video object segmentation tasks, named Vivim. Our Vivim can effectively compress the long-term spatiotemporal representation into sequences at varying scales by our designed Temporal Mamba Block. Compared to existing video-level Transformer-based methods, our model maintains excellent segmentation results with better speed performance. Extensive experiments on the breast US dataset demonstrate the effectiveness and efficiency of our Vivim. The code for Vivim is available at: https://github.com/scott-yjyang/Vivim.
Friendly Attacks to Improve Channel Coding Reliability
- Authors: Anastasiia Kurmukova, Deniz Gunduz
- Subjects: Information Theory (cs.IT); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.14184
- Pdf link: https://arxiv.org/pdf/2401.14184
- Abstract This paper introduces a novel approach called "friendly attack" aimed at enhancing the performance of error correction channel codes. Inspired by the concept of adversarial attacks, our method leverages the idea of introducing slight perturbations to the neural network input, resulting in a substantial impact on the network's performance. By introducing small perturbations to fixed-point modulated codewords before transmission, we effectively improve the decoder's performance without violating the input power constraint. The perturbation design is accomplished by a modified iterative fast gradient method. This study investigates various decoder architectures suitable for computing gradients to obtain the desired perturbations. Specifically, we consider belief propagation (BP) for LDPC codes; the error correcting code transformer, BP and neural BP (NBP) for polar codes, and neural BCJR for convolutional codes. We demonstrate that the proposed friendly attack method can improve the reliability across different channels, modulations, codes, and decoders. This method allows us to increase the reliability of communication with a legacy receiver by simply modifying the transmitted codeword appropriately.
Producing Plankton Classifiers that are Robust to Dataset Shift
- Authors: Cheng Chen, Sreenath Kyathanahally, Marta Reyes, Stefanie Merkli, Ewa Merz, Emanuele Francazi, Marvin Hoege, Francesco Pomati, Marco Baity-Jesi
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.14256
- Pdf link: https://arxiv.org/pdf/2401.14256
- Abstract Modern plankton high-throughput monitoring relies on deep learning classifiers for species recognition in water ecosystems. Despite satisfactory nominal performances, a significant challenge arises from Dataset Shift, which causes performances to drop during deployment. In our study, we integrate the ZooLake dataset with manually-annotated images from 10 independent days of deployment, serving as test cells to benchmark Out-Of-Dataset (OOD) performances. Our analysis reveals instances where classifiers, initially performing well in In-Dataset conditions, encounter notable failures in practical scenarios. For example, a MobileNet with a 92% nominal test accuracy shows a 77% OOD accuracy. We systematically investigate conditions leading to OOD performance drops and propose a preemptive assessment method to identify potential pitfalls when classifying new data, and pinpoint features in OOD images that adversely impact classification. We present a three-step pipeline: (i) identifying OOD degradation compared to nominal test performance, (ii) conducting a diagnostic analysis of degradation causes, and (iii) providing solutions. We find that ensembles of BEiT vision transformers, with targeted augmentations addressing OOD robustness, geometric ensembling, and rotation-based test-time augmentation, constitute the most robust model, which we call BEsT model. It achieves an 83% OOD accuracy, with errors concentrated on container classes. Moreover, it exhibits lower sensitivity to dataset shift, and reproduces well the plankton abundances. Our proposed pipeline is applicable to generic plankton classifiers, contingent on the availability of suitable test cells. By identifying critical shortcomings and offering practical procedures to fortify models against dataset shift, our study contributes to the development of more reliable plankton classification technologies.
Transformers and Cortical Waves: Encoders for Pulling In Context Across Time
- Authors: Lyle Muller, Patricia S. Churchland, Terrence J. Sejnowski
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.14267
- Pdf link: https://arxiv.org/pdf/2401.14267
- Abstract The capabilities of transformer networks such as ChatGPT and other Large Language Models (LLMs) have captured the world's attention. The crucial computational mechanism underlying their performance relies on transforming a complete input sequence - for example, all the words in a sentence into a long "encoding vector" - that allows transformers to learn long-range temporal dependencies in naturalistic sequences. Specifically, "self-attention" applied to this encoding vector enhances temporal context in transformers by computing associations between pairs of words in the input sequence. We suggest that waves of neural activity, traveling across single cortical regions or across multiple regions at the whole-brain scale, could implement a similar encoding principle. By encapsulating recent input history into a single spatial pattern at each moment in time, cortical waves may enable temporal context to be extracted from sequences of sensory inputs, the same computational principle used in transformers.
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
- Authors: Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.14405
- Pdf link: https://arxiv.org/pdf/2401.14405
- Abstract We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.
Keyword: scene understanding
There is no result
Keyword: visual reasoning
There is no result