arxiv-daily
arxiv-daily copied to clipboard
New submissions for Thu, 25 Jan 24
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
AMANet: Advancing SAR Ship Detection with Adaptive Multi-Hierarchical Attention Network
- Authors: Xiaolin Ma, Junkai Cheng, Aihua Li, Yuhua Zhang, Zhilong Lin
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.13214
- Pdf link: https://arxiv.org/pdf/2401.13214
- Abstract Recently, methods based on deep learning have been successfully applied to ship detection for synthetic aperture radar (SAR) images. Despite the development of numerous ship detection methodologies, detecting small and coastal ships remains a significant challenge due to the limited features and clutter in coastal environments. For that, a novel adaptive multi-hierarchical attention module (AMAM) is proposed to learn multi-scale features and adaptively aggregate salient features from various feature layers, even in complex environments. Specifically, we first fuse information from adjacent feature layers to enhance the detection of smaller targets, thereby achieving multi-scale feature enhancement. Then, to filter out the adverse effects of complex backgrounds, we dissect the previously fused multi-level features on the channel, individually excavate the salient regions, and adaptively amalgamate features originating from different channels. Thirdly, we present a novel adaptive multi-hierarchical attention network (AMANet) by embedding the AMAM between the backbone network and the feature pyramid network (FPN). Besides, the AMAM can be readily inserted between different frameworks to improve object detection. Lastly, extensive experiments on two large-scale SAR ship detection datasets demonstrate that our AMANet method is superior to state-of-the-art methods.
PLATE: A perception-latency aware estimator,
- Authors: Rodrigo Aldana-López, Rosario Aragüés, Carlos Sagüés
- Subjects: Systems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
- Arxiv link: https://arxiv.org/abs/2401.13596
- Pdf link: https://arxiv.org/pdf/2401.13596
- Abstract Target tracking is a popular problem with many potential applications. There has been a lot of effort on improving the quality of the detection of targets using cameras through different techniques. In general, with higher computational effort applied, i.e., a longer perception-latency, a better detection accuracy is obtained. However, it is not always useful to apply the longest perception-latency allowed, particularly when the environment doesn't require to and when the computational resources are shared between other tasks. In this work, we propose a new Perception-LATency aware Estimator (PLATE), which uses different perception configurations in different moments of time in order to optimize a certain performance measure. This measure takes into account a perception-latency and accuracy trade-off aiming for a good compromise between quality and resource usage. Compared to other heuristic frame-skipping techniques, PLATE comes with a formal complexity and optimality analysis. The advantages of PLATE are verified by several experiments including an evaluation over a standard benchmark with real data and using state of the art deep learning object detection methods for the perception stage.
Keyword: transformer
TCE at Qur'an QA 2023 Shared Task: Low Resource Enhanced Transformer-based Ensemble Approach for Qur'anic QA
- Authors: Mohammed Alaa Elkomy, Amany Sarhan
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.13060
- Pdf link: https://arxiv.org/pdf/2401.13060
- Abstract In this paper, we present our approach to tackle Qur'an QA 2023 shared tasks A and B. To address the challenge of low-resourced training data, we rely on transfer learning together with a voting ensemble to improve prediction stability across multiple runs. Additionally, we employ different architectures and learning mechanisms for a range of Arabic pre-trained transformer-based models for both tasks. To identify unanswerable questions, we propose using a thresholding mechanism. Our top-performing systems greatly surpass the baseline performance on the hidden split, achieving a MAP score of 25.05% for task A and a partial Average Precision (pAP) of 57.11% for task B.
PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion
- Authors: Shyam Sundar Kannan, Byung-Cheol Min
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2401.13082
- Pdf link: https://arxiv.org/pdf/2401.13082
- Abstract Visual place recognition is a challenging task in the field of computer vision, and autonomous robotics and vehicles, which aims to identify a location or a place from visual inputs. Contemporary methods in visual place recognition employ convolutional neural networks and utilize every region within the image for the place recognition task. However, the presence of dynamic and distracting elements in the image may impact the effectiveness of the place recognition process. Therefore, it is meaningful to focus on task-relevant regions of the image for improved recognition. In this paper, we present PlaceFormer, a novel transformer-based approach for visual place recognition. PlaceFormer employs patch tokens from the transformer to create global image descriptors, which are then used for image retrieval. To re-rank the retrieved images, PlaceFormer merges the patch tokens from the transformer to form multi-scale patches. Utilizing the transformer's self-attention mechanism, it selects patches that correspond to task-relevant areas in an image. These selected patches undergo geometric verification, generating similarity scores across different patch sizes. Subsequently, spatial scores from each patch size are fused to produce a final similarity score. This score is then used to re-rank the images initially retrieved using global image descriptors. Extensive experiments on benchmark datasets demonstrate that PlaceFormer outperforms several state-of-the-art methods in terms of accuracy and computational efficiency, requiring less time and memory.
Gravity-Informed Deep Learning Framework for Predicting Ship Traffic Flow and Invasion Risk of Non-Indigenous Species via Ballast Water Discharge
- Authors: Ruixin Song, Gabriel Spadon, Sarah Bailey, Ronald Pelot, Stan Matwin, Amilcar Soares
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Applications (stat.AP)
- Arxiv link: https://arxiv.org/abs/2401.13098
- Pdf link: https://arxiv.org/pdf/2401.13098
- Abstract Invasive species in water bodies pose a major threat to the environment and biodiversity globally. Due to increased transportation and trade, non-native species have been introduced to new environments, causing damage to ecosystems and leading to economic losses in agriculture, forestry, and fisheries. Therefore, there is a pressing need for risk assessment and management techniques to mitigate the impact of these invasions. This study aims to develop a new physics-inspired model to forecast maritime shipping traffic and thus inform risk assessment of invasive species spread through global transportation networks. Inspired by the gravity model for international trades, our model considers various factors that influence the likelihood and impact of vessel activities, such as shipping flux density, distance between ports, trade flow, and centrality measures of transportation hubs. Additionally, by analyzing the risk network of invasive species, we provide a comprehensive framework for assessing the invasion threat level given a pair of origin and destination. Accordingly, this paper introduces transformers to gravity models to rebuild the short- and long-term dependencies that make the risk analysis feasible. Thus, we introduce a physics-inspired framework that achieves an 89% segmentation accuracy for existing and non-existing trajectories and an 84.8% accuracy for the number of vessels flowing between key port areas, representing more than 10% improvement over the traditional deep-gravity model. Along these lines, this research contributes to a better understanding of invasive species risk assessment. It allows policymakers, conservationists, and stakeholders to prioritize management actions by identifying high-risk invasion pathways. Besides, our model is versatile and can include new data sources, making it suitable for assessing species invasion risks in a changing global landscape.
Analyzing COVID-19 Vaccination Sentiments in Nigerian Cyberspace: Insights from a Manually Annotated Twitter Dataset
- Authors: Ibrahim Said Ahmad, Lukman Jibril Aliyu, Abubakar Auwal Khalid, Saminu Muhammad Aliyu, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Bala Mairiga Abduljalil, Bello Shehu Bello, Amina Imam Abubakar
- Subjects: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
- Arxiv link: https://arxiv.org/abs/2401.13133
- Pdf link: https://arxiv.org/pdf/2401.13133
- Abstract Numerous successes have been achieved in combating the COVID-19 pandemic, initially using various precautionary measures like lockdowns, social distancing, and the use of face masks. More recently, various vaccinations have been developed to aid in the prevention or reduction of the severity of the COVID-19 infection. Despite the effectiveness of the precautionary measures and the vaccines, there are several controversies that are massively shared on social media platforms like Twitter. In this paper, we explore the use of state-of-the-art transformer-based language models to study people's acceptance of vaccines in Nigeria. We developed a novel dataset by crawling multi-lingual tweets using relevant hashtags and keywords. Our analysis and visualizations revealed that most tweets expressed neutral sentiments about COVID-19 vaccines, with some individuals expressing positive views, and there was no strong preference for specific vaccine types, although Moderna received slightly more positive sentiment. We also found out that fine-tuning a pre-trained LLM with an appropriate dataset can yield competitive results, even if the LLM was not initially pre-trained on the specific language of that dataset.
Enhancing cross-domain detection: adaptive class-aware contrastive transformer
- Authors: Ziru Zeng, Yue Ding, Hongtao Lu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.13264
- Pdf link: https://arxiv.org/pdf/2401.13264
- Abstract Recently,the detection transformer has gained substantial attention for its inherent minimal post-processing requirement.However,this paradigm relies on abundant training data,yet in the context of the cross-domain adaptation,insufficient labels in the target domain exacerbate issues of class imbalance and model performance degradation.To address these challenges, we propose a novel class-aware cross domain detection transformer based on the adversarial learning and mean-teacher framework.First,considering the inconsistencies between the classification and regression tasks,we introduce an IoU-aware prediction branch and exploit the consistency of classification and location scores to filter and reweight pseudo labels.Second, we devise a dynamic category threshold refinement to adaptively manage model confidence.Third,to alleviate the class imbalance,an instance-level class-aware contrastive learning module is presented to encourage the generation of discriminative features for each class,particularly benefiting minority classes.Experimental results across diverse domain-adaptive scenarios validate our method's effectiveness in improving performance and alleviating class imbalance issues,which outperforms the state-of-the-art transformer based methods.
TraKDis: A Transformer-based Knowledge Distillation Approach for Visual Reinforcement Learning with Application to Cloth Manipulation
- Authors: Wei Chen, Nicolas Rojas
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2401.13362
- Pdf link: https://arxiv.org/pdf/2401.13362
- Abstract Approaching robotic cloth manipulation using reinforcement learning based on visual feedback is appealing as robot perception and control can be learned simultaneously. However, major challenges result due to the intricate dynamics of cloth and the high dimensionality of the corresponding states, what shadows the practicality of the idea. To tackle these issues, we propose TraKDis, a novel Transformer-based Knowledge Distillation approach that decomposes the visual reinforcement learning problem into two distinct stages. In the first stage, a privileged agent is trained, which possesses complete knowledge of the cloth state information. This privileged agent acts as a teacher, providing valuable guidance and training signals for subsequent stages. The second stage involves a knowledge distillation procedure, where the knowledge acquired by the privileged agent is transferred to a vision-based agent by leveraging pre-trained state estimation and weight initialization. TraKDis demonstrates better performance when compared to state-of-the-art RL techniques, showing a higher performance of 21.9%, 13.8%, and 8.3% in cloth folding tasks in simulation. Furthermore, to validate robustness, we evaluate the agent in a noisy environment; the results indicate its ability to handle and adapt to environmental uncertainties effectively. Real robot experiments are also conducted to showcase the efficiency of our method in real-world scenarios.
Analysis and implementation of the Buck-Boost Modified Series Forward converter applied to photovoltaic systems
- Authors: David Lopez del Moral, Andres Barrado, Marina Sanz, Antonio Lazaro, Pablo Zumel
- Subjects: Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2401.13464
- Pdf link: https://arxiv.org/pdf/2401.13464
- Abstract The mismatching phenomenon is one of the main issues in photovoltaic (PV) applications. It could reduce the generated power of a string when a PV panel has different performances from the other PV panels connected to the same string. Distributed Maximum Power Point Tracking (DMPPT) architectures are one of the most promising solutions to overcome the drawbacks associated with mismatching phenomena in PV applications. In this kind of architectures, a DC-DC module integrated converter (MIC) manages each PV panel, isolating it from the rest of the PV panels, for harvesting the maximum available power from the Sun. Due to the high number of DCDC converters used in a grid-tied PV installation, the most desired MIC requirements are high efficiency, low cost and the capability of voltage step-up and step-down. This paper proposes the Buck-Boost Modified Forward (BBMSF) converter as a good candidate to be applied in DMPPT architectures. A complete analysis of the BBMSF converter is carried out, including the steady-state analysis as well as the small signal analysis in continuous conduction mode. The main advantages of the BBMSF converter are its step-up and step-down voltage transfer function; a higher simplicity, since it only includes a single controlled switch; the soft switching characteristics in all the diodes and MOSFET, reaching in some cases ZVS and ZCS, and yielding high efficiencies; the use of an autotransformer, with better performances than a typical Forward transformer; and the good dynamic performance, like the Forward converter ones. The theoretical analyses are validated through the experimental results in a 225 W BBMSF prototype designed and built under the requirements of a 100 kW grid-tied PV installation, achieving an efficiency up to 93.6%.
LDCA: Local Descriptors with Contextual Augmentation for Few-Shot Learning
- Authors: Maofa Wang, Bingchen Yan
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.13499
- Pdf link: https://arxiv.org/pdf/2401.13499
- Abstract Few-shot image classification has emerged as a key challenge in the field of computer vision, highlighting the capability to rapidly adapt to new tasks with minimal labeled data. Existing methods predominantly rely on image-level features or local descriptors, often overlooking the holistic context surrounding these descriptors. In this work, we introduce a novel approach termed "Local Descriptor with Contextual Augmentation (LDCA)". Specifically, this method bridges the gap between local and global understanding uniquely by leveraging an adaptive global contextual enhancement module. This module incorporates a visual transformer, endowing local descriptors with contextual awareness capabilities, ranging from broad global perspectives to intricate surrounding nuances. By doing so, LDCA transcends traditional descriptor-based approaches, ensuring each local feature is interpreted within its larger visual narrative. Extensive experiments underscore the efficacy of our method, showing a maximal absolute improvement of 20% over the next-best on fine-grained classification datasets, thus demonstrating significant advancements in few-shot classification tasks.
Learning Representations for Clustering via Partial Information Discrimination and Cross-Level Interaction
- Authors: Hai-Xin Zhang, Dong Huang, Hua-Bao Ling, Guang-Yu Zhang, Wei-jun Sun, Zi-hao Wen
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.13503
- Pdf link: https://arxiv.org/pdf/2401.13503
- Abstract In this paper, we present a novel deep image clustering approach termed PICI, which enforces the partial information discrimination and the cross-level interaction in a joint learning framework. In particular, we leverage a Transformer encoder as the backbone, through which the masked image modeling with two paralleled augmented views is formulated. After deriving the class tokens from the masked images by the Transformer encoder, three partial information learning modules are further incorporated, including the PISD module for training the auto-encoder via masked image reconstruction, the PICD module for employing two levels of contrastive learning, and the CLI module for mutual interaction between the instance-level and cluster-level subspaces. Extensive experiments have been conducted on six real-world image datasets, which demononstrate the superior clustering performance of the proposed PICI approach over the state-of-the-art deep clustering approaches. The source code is available at https://github.com/Regan-Zhang/PICI.
TPRF: A Transformer-based Pseudo-Relevance Feedback Model for Efficient and Effective Retrieval
- Authors: Chuting Yu, Hang Li, Ahmed Mourad, Bevan Koopman, Guido Zuccon
- Subjects: Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2401.13509
- Pdf link: https://arxiv.org/pdf/2401.13509
- Abstract This paper considers Pseudo-Relevance Feedback (PRF) methods for dense retrievers in a resource constrained environment such as that of cheap cloud instances or embedded systems (e.g., smartphones and smartwatches), where memory and CPU are limited and GPUs are not present. For this, we propose a transformer-based PRF method (TPRF), which has a much smaller memory footprint and faster inference time compared to other deep language models that employ PRF mechanisms, with a marginal effectiveness loss. TPRF learns how to effectively combine the relevance feedback signals from dense passage representations. Specifically, TPRF provides a mechanism for modelling relationships and weights between the query and the relevance feedback signals. The method is agnostic to the specific dense representation used and thus can be generally applied to any dense retriever.
Analysis, design, and implementation of the AFZ converter applied to photovoltaic systems
- Authors: David Lopez del Moral, Andres Barrado, Marina Sanz, Antonio Lazaro, Pablo Zumel
- Subjects: Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2401.13546
- Pdf link: https://arxiv.org/pdf/2401.13546
- Abstract Grid-tied photovoltaic (PV) installations with Distributed Maximum Power Point Tracking (DMPPT) architectures include a DC-DC Module Integrated Converter (MIC) for managing each PV panel, isolating it from the others, reducing the mismatching effect and maximizing the harvested power. In this paper, the Autotransformer Forward converter with type-Zeta resonant reset (AFZ) is proposed as a DMPPT architecture MIC candidate. The main characteristics of the AFZ converter are the high versatility due to its voltage step-up and step-down capability; the use of an optimized autotransformer with only two windings, reducing the complexity and power losses of this component; the good dynamic performances, like the Forward converter ones; the low number of components and the simplicity and high feasibility associated to the use of just one active switch. Besides, soft switching transitions are achieved thanks to the autotransformer type-Zeta resonant reset. The steady-state theoretical analysis, considering the effect of the autotransformer leakage inductance, is presented. The converter is also studied in the frequency domain, obtaining the small-signal transfer functions. A design procedure based on the requirements of a 100 kW grid-tied photovoltaic installation is described, yielding in a 225 W prototype with efficiencies up to 95.6 %. Experimental results validate the theoretical analysis.
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation
- Authors: Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, Lei Zhu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2401.13560
- Pdf link: https://arxiv.org/pdf/2401.13560
- Abstract The Transformer architecture has shown a remarkable ability in modeling global relationships. However, it poses a significant computational challenge when processing high-dimensional medical images. This hinders its development and widespread adoption in this task. Mamba, as a State Space Model (SSM), recently emerged as a notable manner for long-range dependencies in sequential modeling, excelling in natural language processing filed with its remarkable memory efficiency and computational speed. Inspired by its success, we introduce SegMamba, a novel 3D medical image \textbf{Seg}mentation \textbf{Mamba} model, designed to effectively capture long-range dependencies within whole volume features at every scale. Our SegMamba, in contrast to Transformer-based methods, excels in whole volume feature modeling from a state space model standpoint, maintaining superior processing speed, even with volume features at a resolution of {$64\times 64\times 64$}. Comprehensive experiments on the BraTS2023 dataset demonstrate the effectiveness and efficiency of our SegMamba. The code for SegMamba is available at: https://github.com/ge-xing/SegMamba
Inadequacy of common stochastic neural networks for reliable clinical decision support
- Authors: Adrian Lindenmeyer, Malte Blattmann, Stefan Franke, Thomas Neumuth, Daniel Schneider
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.13657
- Pdf link: https://arxiv.org/pdf/2401.13657
- Abstract Widespread adoption of AI for medical decision making is still hindered due to ethical and safety-related concerns. For AI-based decision support systems in healthcare settings it is paramount to be reliable and trustworthy. Common deep learning approaches, however, have the tendency towards overconfidence under data shift. Such inappropriate extrapolation beyond evidence-based scenarios may have dire consequences. This highlights the importance of reliable estimation of local uncertainty and its communication to the end user. While stochastic neural networks have been heralded as a potential solution to these issues, this study investigates their actual reliability in clinical applications. We centered our analysis on the exemplary use case of mortality prediction for ICU hospitalizations using EHR from MIMIC3 study. For predictions on the EHR time series, Encoder-Only Transformer models were employed. Stochasticity of model functions was achieved by incorporating common methods such as Bayesian neural network layers and model ensembles. Our models achieve state of the art performance in terms of discrimination performance (AUC ROC: 0.868+-0.011, AUC PR: 0.554+-0.034) and calibration on the mortality prediction benchmark. However, epistemic uncertainty is critically underestimated by the selected stochastic deep learning methods. A heuristic proof for the responsible collapse of the posterior distribution is provided. Our findings reveal the inadequacy of commonly used stochastic deep learning approaches to reliably recognize OoD samples. In both methods, unsubstantiated model confidence is not prevented due to strongly biased functional posteriors, rendering them inappropriate for reliable clinical decision support. This highlights the need for approaches with more strictly enforced or inherent distance-awareness to known data points, e.g., using kernel-based techniques.
MambaByte: Token-free Selective State Space Model
- Authors: Junxiong Wang, Tushaar Gangavarapu, Jing Nathan Yan, Alexander M Rush
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.13660
- Pdf link: https://arxiv.org/pdf/2401.13660
- Abstract Token-free language models learn directly from raw bytes and remove the bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences, and standard autoregressive Transformers scale poorly in such settings. We experiment with MambaByte, a token-free adaptation of the Mamba state space model, trained autoregressively on byte sequences. Our experiments indicate the computational efficiency of MambaByte compared to other byte-level models. We also find MambaByte to be competitive with and even outperform state-of-the-art subword Transformers. Furthermore, owing to linear scaling in length, MambaByte benefits from fast inference compared to Transformers. Our findings establish the viability of MambaByte in enabling token-free language modeling.
Keyword: scene understanding
Digital Divides in Scene Recognition: Uncovering Socioeconomic Biases in Deep Learning Systems
- Authors: Michelle R. Greene, Mariam Josyula, Wentao Si, Jennifer A. Hart
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2401.13097
- Pdf link: https://arxiv.org/pdf/2401.13097
- Abstract Computer-based scene understanding has influenced fields ranging from urban planning to autonomous vehicle performance, yet little is known about how well these technologies work across social differences. We investigate the biases of deep convolutional neural networks (dCNNs) in scene classification, using nearly one million images from global and US sources, including user-submitted home photographs and Airbnb listings. We applied statistical models to quantify the impact of socioeconomic indicators such as family income, Human Development Index (HDI), and demographic factors from public data sources (CIA and US Census) on dCNN performance. Our analyses revealed significant socioeconomic bias, where pretrained dCNNs demonstrated lower classification accuracy, lower classification confidence, and a higher tendency to assign labels that could be offensive when applied to homes (e.g., "ruin", "slum"), especially in images from homes with lower socioeconomic status (SES). This trend is consistent across two datasets of international images and within the diverse economic and racial landscapes of the United States. This research contributes to understanding biases in computer vision, emphasizing the need for more inclusive and representative training datasets. By mitigating the bias in the computer vision pipelines, we can ensure fairer and more equitable outcomes for applied computer vision, including home valuation and smart home security systems. There is urgency in addressing these biases, which can significantly impact critical decisions in urban development and resource allocation. Our findings also motivate the development of AI systems that better understand and serve diverse communities, moving towards technology that equitably benefits all sectors of society.
Keyword: visual reasoning
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
- Authors: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2401.13311
- Pdf link: https://arxiv.org/pdf/2401.13311
- Abstract Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs' ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/