arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Mon, 11 Dec 23

Open DongZhouGu opened this issue 1 year ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks

  • Authors: Shuli Jiang, Swanand Ravindra Kadhe, Yi Zhou, Ling Cai, Nathalie Baracaldo
  • Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2312.04748
  • Pdf link: https://arxiv.org/pdf/2312.04748
  • Abstract Growing applications of large language models (LLMs) trained by a third party raise serious concerns on the security vulnerability of LLMs.It has been demonstrated that malicious actors can covertly exploit these vulnerabilities in LLMs through poisoning attacks aimed at generating undesirable outputs. While poisoning attacks have received significant attention in the image domain (e.g., object detection), and classification tasks, their implications for generative models, particularly in the realm of natural language generation (NLG) tasks, remain poorly understood. To bridge this gap, we perform a comprehensive exploration of various poisoning techniques to assess their effectiveness across a range of generative tasks. Furthermore, we introduce a range of metrics designed to quantify the success and stealthiness of poisoning attacks specifically tailored to NLG tasks. Through extensive experiments on multiple NLG tasks, LLMs and datasets, we show that it is possible to successfully poison an LLM during the fine-tuning stage using as little as 1% of the total tuning data samples. Our paper presents the first systematic approach to comprehend poisoning attacks targeting NLG tasks considering a wide range of triggers and attack settings. We hope our findings will assist the AI security community in devising appropriate defenses against such threats.

Unify Change Point Detection and Segment Classification in a Regression Task for Transportation Mode Identification

  • Authors: Rongsong Li, Xin Pei
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2312.04821
  • Pdf link: https://arxiv.org/pdf/2312.04821
  • Abstract Identifying travelers' transportation modes is important in transportation science and location-based services. It's appealing for researchers to leverage GPS trajectory data to infer transportation modes with the popularity of GPS-enabled devices, e.g., smart phones. Existing studies frame this problem as classification task. The dominant two-stage studies divide the trip into single-one mode segments first and then categorize these segments. The over segmentation strategy and inevitable error propagation bring difficulties to classification stage and make optimizing the whole system hard. The recent one-stage works throw out trajectory segmentation entirely to avoid these by directly conducting point-wise classification for the trip, whereas leaving predictions dis-continuous. To solve above-mentioned problems, inspired by YOLO and SSD in object detection, we propose to reframe change point detection and segment classification as a unified regression task instead of the existing classification task. We directly regress coordinates of change points and classify associated segments. In this way, our method divides the trip into segments under a supervised manner and leverage more contextual information, obtaining predictions with high accuracy and continuity. Two frameworks, TrajYOLO and TrajSSD, are proposed to solve the regression task and various feature extraction backbones are exploited. Exhaustive experiments on GeoLife dataset show that the proposed method has competitive overall identification accuracy of 0.853 when distinguishing five modes: walk, bike, bus, car, train. As for change point detection, our method increases precision at the cost of drop in recall. All codes are available at https://github.com/RadetzkyLi/TrajYOLO-SSD.

SiCP: Simultaneous Individual and Cooperative Perception for 3D Object Detection in Connected and Automated Vehicles

  • Authors: Deyuan Qu, Qi Chen, Tianyu Bai, Andy Qin, Hongsheng Lu, Heng Fan, Song Fu, Qing Yang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2312.04822
  • Pdf link: https://arxiv.org/pdf/2312.04822
  • Abstract Cooperative perception for connected and automated vehicles is traditionally achieved through the fusion of feature maps from two or more vehicles. However, the absence of feature maps shared from other vehicles can lead to a significant decline in object detection performance for cooperative perception models compared to standalone 3D detection models. This drawback impedes the adoption of cooperative perception as vehicle resources are often insufficient to concurrently employ two perception models. To tackle this issue, we present Simultaneous Individual and Cooperative Perception (SiCP), a generic framework that supports a wide range of the state-of-the-art standalone perception backbones and enhances them with a novel Dual-Perception Network (DP-Net) designed to facilitate both individual and cooperative perception. In addition to its lightweight nature with only 0.13M parameters, DP-Net is robust and retains crucial gradient information during feature map fusion. As demonstrated in a comprehensive evaluation on the OPV2V dataset, thanks to DP-Net, SiCP surpasses state-of-the-art cooperative perception solutions while preserving the performance of standalone perception solutions.

Design and Implementation of Automatic Assisted Aiming System For Robomaster EP Based on YOLOv5

  • Authors: Junjia Qin, Kangli Xu
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2312.05055
  • Pdf link: https://arxiv.org/pdf/2312.05055
  • Abstract In the crucial stages of the Robomaster Youth Championship, the Robomaster EP Robot must operate exclusively on autonomous algorithms to remain competitive. Target recognition and automatic assisted aiming are indispensable for the EP robot. In this study, we use YOLOv5 for multi-object detection to identify the Robomaster EP Robot and its armor. Additionally, we integrate the DeepSORT algorithm for vehicle identification and tracking. As a result, we introduce a refined YOLOv5-based system that allows the robot to recognize and aim at multiple targets simultaneously. To ensure precise tracking, we use a PID controller with Feedforward Enhancement and an FIR controller paired with a Kalman filter. This setup enables quick gimbal movement towards the target and predicts its next position, optimizing potential damage during motion. Our proposed system enhances the robot's accuracy in targeting armor, improving its competitive performance.

Keyword: transformer

On Sarcasm Detection with OpenAI GPT-based Models

  • Authors: Montgomery Gole, Williams-Paul Nwadiugwu, Andriy Miranskyy
  • Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2312.04642
  • Pdf link: https://arxiv.org/pdf/2312.04642
  • Abstract Sarcasm is a form of irony that requires readers or listeners to interpret its intended meaning by considering context and social cues. Machine learning classification models have long had difficulty detecting sarcasm due to its social complexity and contradictory nature. This paper explores the applications of the Generative Pretrained Transformer (GPT) models, including GPT-3, InstructGPT, GPT-3.5, and GPT-4, in detecting sarcasm in natural language. It tests fine-tuned and zero-shot models of different sizes and releases. The GPT models were tested on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC 2.0) sarcasm dataset. In the fine-tuning case, the largest fine-tuned GPT-3 model achieves accuracy and $F_1$-score of 0.81, outperforming prior models. In the zero-shot case, one of GPT-4 models yields an accuracy of 0.70 and $F_1$-score of 0.75. Other models score lower. Additionally, a model's performance may improve or deteriorate with each release, highlighting the need to reassess performance after each release.

Self-Supervised Behavior Cloned Transformers are Path Crawlers for Text Games

  • Authors: Ruoyao Wang, Peter Jansen
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2312.04657
  • Pdf link: https://arxiv.org/pdf/2312.04657
  • Abstract In this work, we introduce a self-supervised behavior cloning transformer for text games, which are challenging benchmarks for multi-step reasoning in virtual environments. Traditionally, Behavior Cloning Transformers excel in such tasks but rely on supervised training data. Our approach auto-generates training data by exploring trajectories (defined by common macro-action sequences) that lead to reward within the games, while determining the generality and utility of these trajectories by rapidly training small models then evaluating their performance on unseen development games. Through empirical analysis, we show our method consistently uncovers generalizable training data, achieving about 90% performance of supervised systems across three benchmark text games.

Is Feedback All You Need? Leveraging Natural Language Feedback in Goal-Conditioned Reinforcement Learning

  • Authors: Sabrina McCallum, Max Taylor-Davies, Stefano V. Albrecht, Alessandro Suglia
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2312.04736
  • Pdf link: https://arxiv.org/pdf/2312.04736
  • Abstract Despite numerous successes, the field of reinforcement learning (RL) remains far from matching the impressive generalisation power of human behaviour learning. One possible way to help bridge this gap be to provide RL agents with richer, more human-like feedback expressed in natural language. To investigate this idea, we first extend BabyAI to automatically generate language feedback from the environment dynamics and goal condition success. Then, we modify the Decision Transformer architecture to take advantage of this additional signal. We find that training with language feedback either in place of or in addition to the return-to-go or goal descriptions improves agents' generalisation performance, and that agents can benefit from feedback even when this is only available during training, but not at inference.

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

  • Authors: Xuan Wang, Guanhong Wang, Wenhao Chai, Jiayu Zhou, Gaoang Wang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2312.04793
  • Pdf link: https://arxiv.org/pdf/2312.04793
  • Abstract Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing methods emphasize the user context fusion process by memory networks or transformers. However, these methods ignore the distinct domains of each dataset. Therefore, they need to update the entire caption model parameters when meeting new samples, which is time-consuming and calculation-intensive. To address this challenge, we propose a novel personalized image captioning framework that leverages user context to consider personality factors. Additionally, our framework utilizes the prefix-tuning paradigm to extract knowledge from a frozen large language model, reducing the gap between different language domains. Specifically, we employ CLIP to extract the visual features of an image and align the semantic space using a query-guided mapping network. By incorporating the transformer layer, we merge the visual features with the user's contextual prior knowledge to generate informative prefixes. Moreover, we employ GPT-2 as the frozen large language model. With a small number of parameters to be trained, our model performs efficiently and effectively. Our model outperforms existing baseline models on Instagram and YFCC100M datasets across five evaluation metrics, demonstrating its superiority, including twofold improvements in metrics such as BLEU-4 and CIDEr.

HuRef: HUman-REadable Fingerprint for Large Language Models

  • Authors: Boyi Zeng, Chenghu Zhou, Xinbing Wang, Zhouhan Lin
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2312.04828
  • Pdf link: https://arxiv.org/pdf/2312.04828
  • Abstract Protecting the copyright of large language models (LLMs) has become crucial due to their resource-intensive training and accompanying carefully designed licenses. However, identifying the original base model of an LLM is challenging due to potential parameter alterations through fine-tuning or continued pretraining. In this study, we introduce HuRef, a human-readable fingerprint for LLMs that uniquely identifies the base model without exposing model parameters or interfering with training. We first observe that the vector direction of LLM parameters remains stable after the model has converged during pretraining, showing negligible perturbations through subsequent training steps, including continued pretraining, supervised fine-tuning (SFT), and RLHF, which makes it a sufficient condition to identify the base model. The necessity is validated by continuing to train an LLM with an extra term to drive away the model parameters' direction and the model becomes damaged. However, this direction is vulnerable to simple attacks like dimension permutation or matrix rotation, which significantly change it without affecting performance. To address this, leveraging the Transformer structure, we systematically analyze potential attacks and define three invariant terms that identify an LLM's base model. We make these invariant terms human-readable by mapping them to a Gaussian vector using a convolutional encoder and then converting it into a natural image with StyleGAN2. Our method generates a dog image as an identity fingerprint for an LLM, where the dog's appearance strongly indicates the LLM's base model. Experimental results across various LLMs demonstrate the effectiveness of our method, the generated dog image remains invariant to different training steps, including SFT, RLHF, or even continued pretraining with augmented vocabulary in a new language.

Adapting Vision Transformer for Efficient Change Detection

  • Authors: Yang Zhao, Yuxiang Zhang, Yanni Dong, Bo Du
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2312.04869
  • Pdf link: https://arxiv.org/pdf/2312.04869
  • Abstract Most change detection models based on vision transformers currently follow a "pretraining then fine-tuning" strategy. This involves initializing the model weights using large scale classification datasets, which can be either natural images or remote sensing images. However, fully tuning such a model requires significant time and resources. In this paper, we propose an efficient tuning approach that involves freezing the parameters of the pretrained image encoder and introducing additional training parameters. Through this approach, we have achieved competitive or even better results while maintaining extremely low resource consumption across six change detection benchmarks. For example, training time on LEVIR-CD, a change detection benchmark, is only half an hour with 9 GB memory usage, which could be very convenient for most researchers. Additionally, the decoupled tuning framework can be extended to any pretrained model for semantic change detection and multi temporal change detection as well. We hope that our proposed approach will serve as a part of foundational model to inspire more unified training approaches on change detection in the future.

Illicit Darkweb Classification via Natural-language Processing: Classifying Illicit Content of Webpages based on Textual Information

  • Authors: Giuseppe Cascavilla, Gemma Catolino, Mirella Sangiovanni
  • Subjects: Information Retrieval (cs.IR); Computers and Society (cs.CY)
  • Arxiv link: https://arxiv.org/abs/2312.04944
  • Pdf link: https://arxiv.org/pdf/2312.04944
  • Abstract This work aims at expanding previous works done in the context of illegal activities classification, performing three different steps. First, we created a heterogeneous dataset of 113995 onion sites and dark marketplaces. Then, we compared pre-trained transferable models, i.e., ULMFit (Universal Language Model Fine-tuning), Bert (Bidirectional Encoder Representations from Transformers), and RoBERTa (Robustly optimized BERT approach) with a traditional text classification approach like LSTM (Long short-term memory) neural networks. Finally, we developed two illegal activities classification approaches, one for illicit content on the Dark Web and one for identifying the specific types of drugs. Results show that Bert obtained the best approach, classifying the dark web's general content and the types of Drugs with 96.08% and 91.98% of accuracy.

MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness

  • Authors: Xiaoyun Xu, Shujian Yu, Jingzheng Wu, Stjepan Picek
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2312.04960
  • Pdf link: https://arxiv.org/pdf/2312.04960
  • Abstract Vision Transformers (ViTs) achieve superior performance on various tasks compared to convolutional neural networks (CNNs), but ViTs are also vulnerable to adversarial attacks. Adversarial training is one of the most successful methods to build robust CNN models. Thus, recent works explored new methodologies for adversarial training of ViTs based on the differences between ViTs and CNNs, such as better training strategies, preventing attention from focusing on a single block, or discarding low-attention embeddings. However, these methods still follow the design of traditional supervised adversarial training, limiting the potential of adversarial training on ViTs. This paper proposes a novel defense method, MIMIR, which aims to build a different adversarial training methodology by utilizing Masked Image Modeling at pre-training. We create an autoencoder that accepts adversarial examples as input but takes the clean examples as the modeling target. Then, we create a mutual information (MI) penalty following the idea of the Information Bottleneck. Among the two information source inputs and corresponding adversarial perturbation, the perturbation information is eliminated due to the constraint of the modeling target. Next, we provide a theoretical analysis of MIMIR using the bounds of the MI penalty. We also design two adaptive attacks when the adversary is aware of the MIMIR defense and show that MIMIR still performs well. The experimental results show that MIMIR improves (natural and adversarial) accuracy on average by 4.19% on CIFAR-10 and 5.52% on ImageNet-1K, compared to baselines. On Tiny-ImageNet, we obtained improved natural accuracy of 2.99% on average and comparable adversarial accuracy. Our code and trained models are publicly available\footnote{\url{https://anonymous.4open.science/r/MIMIR-5444/README.md}}.

DeepFidelity: Perceptual Forgery Fidelity Assessment for Deepfake Detection

  • Authors: Chunlei Peng, Huiqing Guo, Decheng Liu, Nannan Wang, Ruimin Hu, Xinbo Gao
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2312.04961
  • Pdf link: https://arxiv.org/pdf/2312.04961
  • Abstract Deepfake detection refers to detecting artificially generated or edited faces in images or videos, which plays an essential role in visual information security. Despite promising progress in recent years, Deepfake detection remains a challenging problem due to the complexity and variability of face forgery techniques. Existing Deepfake detection methods are often devoted to extracting features by designing sophisticated networks but ignore the influence of perceptual quality of faces. Considering the complexity of the quality distribution of both real and fake faces, we propose a novel Deepfake detection framework named DeepFidelity to adaptively distinguish real and fake faces with varying image quality by mining the perceptual forgery fidelity of face images. Specifically, we improve the model's ability to identify complex samples by mapping real and fake face data of different qualities to different scores to distinguish them in a more detailed way. In addition, we propose a network structure called Symmetric Spatial Attention Augmentation based vision Transformer (SSAAFormer), which uses the symmetry of face images to promote the network to model the geographic long-distance relationship at the shallow level and augment local features. Extensive experiments on multiple benchmark datasets demonstrate the superiority of the proposed method over state-of-the-art methods.

Out of Context: How important is Local Context in Neural Program Repair?

  • Authors: Julian Aron Prenner, Romain Robbes
  • Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2312.04986
  • Pdf link: https://arxiv.org/pdf/2312.04986
  • Abstract Deep learning source code models have been applied very successfully to the problem of automated program repair. One of the standing issues is the small input window of current models which often cannot fully fit the context code required for a bug fix (e.g., method or class declarations of a project). Instead, input is often restricted to the local context, that is, the lines below and above the bug location. In this work we study the importance of this local context on repair success: how much local context is needed?; is context before or after the bug location more important? how is local context tied to the bug type? To answer these questions we train and evaluate Transformer models in many different local context configurations on three datasets and two programming languages. Our results indicate that overall repair success increases with the size of the local context (albeit not for all bug types) and confirm the common practice that roughly 50-60% of the input window should be used for context leading the bug. Our results are not only relevant for researchers working on Transformer-based APR tools but also for benchmark and dataset creators who must decide what and how much context to include in their datasets.

Kraken: enabling joint trajectory prediction by utilizing Mode Transformer and Greedy Mode Processing

  • Authors: Daniil S. Antonenko, Stepan Konev, Yuriy Biktairov, Boris Yangel
  • Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2312.05144
  • Pdf link: https://arxiv.org/pdf/2312.05144
  • Abstract Accurate and reliable motion prediction is essential for safe urban autonomy. The most prominent motion prediction approaches are based on modeling the distribution of possible future trajectories of each actor in autonomous system's vicinity. These "independent" marginal predictions might be accurate enough to properly describe casual driving situations where the prediction target is not likely to interact with other actors. They are, however, inadequate for modeling interactive situations where the actors' future trajectories are likely to intersect. To mitigate this issue we propose Kraken -- a real-time trajectory prediction model capable of approximating pairwise interactions between the actors as well as producing accurate marginal predictions. Kraken relies on a simple Greedy Mode Processing technique allowing it to convert a factorized prediction for a pair of agents into a physically-plausible joint prediction. It also utilizes the Mode Transformer module to increase the diversity of predicted trajectories and make the joint prediction more informative. We evaluate Kraken on Waymo Motion Prediction challenge where it held the first place in the Interaction leaderboard and the second place in the Motion leaderboard in October 2021.

Reconstructing Hands in 3D with Transformers

  • Authors: Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, Jitendra Malik
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2312.05251
  • Pdf link: https://arxiv.org/pdf/2312.05251
  • Abstract We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand reconstruction. For training data, we combine multiple datasets that contain 2D or 3D hand annotations. For the deep model, we use a large scale Vision Transformer architecture. Our final model consistently outperforms the previous baselines on popular 3D hand pose benchmarks. To further evaluate the effect of our design in non-controlled settings, we annotate existing in-the-wild datasets with 2D hand keypoint annotations. On this newly collected dataset of annotations, HInt, we demonstrate significant improvements over existing baselines. We make our code, data and models available on the project website: https://geopavlakos.github.io/hamer/.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

DongZhouGu avatar Dec 11 '23 02:12 DongZhouGu