arxiv-daily
arxiv-daily copied to clipboard
New submissions for Thu, 8 Dec 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
There is no result
Keyword: transformer
Fine-Grained Emotional Paraphrasing along Emotion Gradients
- Authors: Justin Xie
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2212.03297
- Pdf link: https://arxiv.org/pdf/2212.03297
- Abstract Paraphrase generation, a.k.a. paraphrasing, is a common and important task in natural language processing. Emotional paraphrasing, which changes the emotion embodied in a piece of text while preserving its meaning, has many potential applications, e.g., moderating online dialogues and preventing cyberbullying. We introduce a new task of fine-grained emotional paraphrasing along emotion gradients, that is, altering the emotional intensities of the paraphrases in fine grain following smooth variations in affective dimensions while preserving the meanings of the originals. We propose a framework for addressing this task by fine-tuning text-to-text Transformers through multi-task training. We enhance several widely used paraphrasing corpus by annotating the input and target texts with their fine-grained emotion labels. With these labels, fine-tuning text-to-text Transformers on these corpus entails multi-task training. Evaluations of the fine-tuned Transformers on separate test sets show that including fine-grained emotion labels in the paraphrase task significantly improve the chance of obtaining high-quality paraphrases of the desired emotions, i.e., more than doubling the number of exact matches of desired emotions while achieving consistently better scores in paraphrase metrics such as BLEU, ROGUE, and METEOR.
Semantically Enhanced Global Reasoning for Semantic Segmentation
- Authors: Mir Rayat Imtiaz Hossain, Leonid Sigal, James J. Little
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2212.03338
- Pdf link: https://arxiv.org/pdf/2212.03338
- Abstract Recent advances in pixel-level tasks (e.g., segmentation) illustrate the benefit of long-range interactions between aggregated region-based representations that can enhance local features. However, such pixel-to-region associations and the resulting representation, which often take the form of attention, cannot model the underlying semantic structure of the scene (e.g., individual objects and, by extension, their interactions). In this work, we take a step toward addressing this limitation. Specifically, we propose an architecture where we learn to project image features into latent region representations and perform global reasoning across them, using a transformer, to produce contextualized and scene-consistent representations that are then fused with original pixel-level features. Our design enables the latent regions to represent semantically meaningful concepts, by ensuring that activated regions are spatially disjoint and unions of such regions correspond to connected object segments. The resulting semantic global reasoning (SGR) is end-to-end trainable and can be combined with any semantic segmentation framework and backbone. Combining SGR with DeepLabV3 results in a semantic segmentation performance that is competitive to the state-of-the-art, while resulting in more semantically interpretable and diverse region representations, which we show can effectively transfer to detection and instance segmentation. Further, we propose a new metric that allows us to measure the semantics of representations at both the object class and instance level.
Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer
- Authors: Shenghan Su, Lin Gu, Ziteng Cui, Yue Yang, Jingjing Shen, Hiroaki Yamane, Zenghui Zhang, Tatsuya Harada
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2212.03434
- Pdf link: https://arxiv.org/pdf/2212.03434
- Abstract The long-standing theory that a colour-naming system evolves under the dual pressure of efficient communication and perceptual mechanism is supported by more and more linguistic studies including the analysis of four decades' diachronic data from the Nafaanra language. This inspires us to explore whether artificial intelligence could evolve and discover a similar colour-naming system via optimising the communication efficiency represented by high-level recognition performance. Here, we propose a novel colour quantisation transformer, CQFormer, that quantises colour space while maintaining the accuracy of machine recognition on the quantised images. Given an RGB image, Annotation Branch maps it into an index map before generating the quantised image with a colour palette, meanwhile the Palette Branch utilises a key-point detection way to find proper colours in palette among whole colour space. By interacting with colour annotation, CQFormer is able to balance both the machine vision accuracy and colour perceptual structure such as distinct and stable colour distribution for discovered colour system. Very interestingly, we even observe the consistent evolution pattern between our artificial colour system and basic colour terms across human languages. Besides, our colour quantisation method also offers an efficient quantisation method that effectively compresses the image storage while maintaining a high performance in high-level recognition tasks such as classification and detection. Extensive experiments demonstrate the superior performance of our method with extremely low bit-rate colours. We will release the source code soon.
Learning Continuous Depth Representation via Geometric Spatial Aggregator
- Authors: Xiaohang Wang, Xuanhong Chen, Bingbing Ni, Zhengyan Tong, Hang Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2212.03499
- Pdf link: https://arxiv.org/pdf/2212.03499
- Abstract Depth map super-resolution (DSR) has been a fundamental task for 3D computer vision. While arbitrary scale DSR is a more realistic setting in this scenario, previous approaches predominantly suffer from the issue of inefficient real-numbered scale upsampling. To explicitly address this issue, we propose a novel continuous depth representation for DSR. The heart of this representation is our proposed Geometric Spatial Aggregator (GSA), which exploits a distance field modulated by arbitrarily upsampled target gridding, through which the geometric information is explicitly introduced into feature aggregation and target generation. Furthermore, bricking with GSA, we present a transformer-style backbone named GeoDSR, which possesses a principled way to construct the functional mapping between local coordinates and the high-resolution output results, empowering our model with the advantage of arbitrary shape transformation ready to help diverse zooming demand. Extensive experimental results on standard depth map benchmarks, e.g., NYU v2, have demonstrated that the proposed framework achieves significant restoration gain in arbitrary scale depth map super-resolution compared with the prior art. Our codes are available at https://github.com/nana01219/GeoDSR.
Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data
- Authors: Matthias Zeller, Jens Behley, Michael Heidingsfeld, Cyrill Stachniss
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2212.03690
- Pdf link: https://arxiv.org/pdf/2212.03690
- Abstract Scene understanding is crucial for autonomous robots in dynamic environments for making future state predictions, avoiding collisions, and path planning. Camera and LiDAR perception made tremendous progress in recent years, but face limitations under adverse weather conditions. To leverage the full potential of multi-modal sensor suites, radar sensors are essential for safety critical tasks and are already installed in most new vehicles today. In this paper, we address the problem of semantic segmentation of moving objects in radar point clouds to enhance the perception of the environment with another sensor modality. Instead of aggregating multiple scans to densify the point clouds, we propose a novel approach based on the self-attention mechanism to accurately perform sparse, single-scan segmentation. Our approach, called Gaussian Radar Transformer, includes the newly introduced Gaussian transformer layer, which replaces the softmax normalization by a Gaussian function to decouple the contribution of individual points. To tackle the challenge of the transformer to capture long-range dependencies, we propose our attentive up- and downsampling modules to enlarge the receptive field and capture strong spatial relations. We compare our approach to other state-of-the-art methods on the RadarScenes data set and show superior segmentation quality in diverse environments, even without exploiting temporal information.
Transformer-Based Named Entity Recognition for French Using Adversarial Adaptation to Similar Domain Corpora
- Authors: Arjun Choudhry, Pankaj Gupta, Inder Khatri, Aaryan Gupta, Maxime Nicol, Marie-Jean Meurs, Dinesh Kumar Vishwakarma
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2212.03692
- Pdf link: https://arxiv.org/pdf/2212.03692
- Abstract Named Entity Recognition (NER) involves the identification and classification of named entities in unstructured text into predefined classes. NER in languages with limited resources, like French, is still an open problem due to the lack of large, robust, labelled datasets. In this paper, we propose a transformer-based NER approach for French using adversarial adaptation to similar domain or general corpora for improved feature extraction and better generalization. We evaluate our approach on three labelled datasets and show that our adaptation framework outperforms the corresponding non-adaptive models for various combinations of transformer models, source datasets and target corpora.
Intent Recognition in Conversational Recommender Systems
- Authors: Sahar Moradizeyveh
- Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2212.03721
- Pdf link: https://arxiv.org/pdf/2212.03721
- Abstract Any organization needs to improve their products, services, and processes. In this context, engaging with customers and understanding their journey is essential. Organizations have leveraged various techniques and technologies to support customer engagement, from call centres to chatbots and virtual agents. Recently, these systems have used Machine Learning (ML) and Natural Language Processing (NLP) to analyze large volumes of customer feedback and engagement data. The goal is to understand customers in context and provide meaningful answers across various channels. Despite multiple advances in Conversational Artificial Intelligence (AI) and Recommender Systems (RS), it is still challenging to understand the intent behind customer questions during the customer journey. To address this challenge, in this paper, we study and analyze the recent work in Conversational Recommender Systems (CRS) in general and, more specifically, in chatbot-based CRS. We introduce a pipeline to contextualize the input utterances in conversations. We then take the next step towards leveraging reverse feature engineering to link the contextualized input and learning model to support intent recognition. Since performance evaluation is achieved based on different ML models, we use transformer base models to evaluate the proposed approach using a labelled dialogue dataset (MSDialogue) of question-answering interactions between information seekers and answer providers.
Learning-To-Embed: Adopting Transformer based models for E-commerce Products Representation Learning
- Authors: Lakshya Kumar, Sreekanth Vempati
- Subjects: Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2212.03725
- Pdf link: https://arxiv.org/pdf/2212.03725
- Abstract Learning low-dimensional representation for large number of products present in an e-commerce catalogue plays a vital role as they are helpful in tasks like product ranking, product recommendation, finding similar products, modelling user-behaviour etc. Recently, a lot of tasks in the NLP field are getting tackled using the Transformer based models and these deep models are widely applicable in the industries setting to solve various problems. With this motivation, we apply transformer based model for learning contextual representation of products in an e-commerce setting. In this work, we propose a novel approach of pre-training transformer based model on a users generated sessions dataset obtained from a large fashion e-commerce platform to obtain latent product representation. Once pre-trained, we show that the low-dimension representation of the products can be obtained given the product attributes information as a textual sentence. We mainly pre-train BERT, RoBERTa, ALBERT and XLNET variants of transformer model and show a quantitative analysis of the products representation obtained from these models with respect to Next Product Recommendation(NPR) and Content Ranking(CR) tasks. For both the tasks, we collect an evaluation data from the fashion e-commerce platform and observe that XLNET model outperform other variants with a MRR of 0.5 for NPR and NDCG of 0.634 for CR. XLNET model also outperforms the Word2Vec based non-transformer baseline on both the downstream tasks. To the best of our knowledge, this is the first and novel work for pre-training transformer based models using users generated sessions data containing products that are represented with rich attributes information for adoption in e-commerce setting. These models can be further fine-tuned in order to solve various downstream tasks in e-commerce, thereby eliminating the need to train a model from scratch.
A Transformer-Based User Satisfaction Prediction for Proactive Interaction Mechanism in DuerOS
- Authors: Wei Shen, Xiaonan He, Chuheng Zhang, Xuyun Zhang, Jian XIe
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2212.03817
- Pdf link: https://arxiv.org/pdf/2212.03817
- Abstract Recently, spoken dialogue systems have been widely deployed in a variety of applications, serving a huge number of end-users. A common issue is that the errors resulting from noisy utterances, semantic misunderstandings, or lack of knowledge make it hard for a real system to respond properly, possibly leading to an unsatisfactory user experience. To avoid such a case, we consider a proactive interaction mechanism where the system predicts the user satisfaction with the candidate response before giving it to the user. If the user is not likely to be satisfied according to the prediction, the system will ask the user a suitable question to determine the real intent of the user instead of providing the response directly. With such an interaction with the user, the system can give a better response to the user. Previous models that predict the user satisfaction are not applicable to DuerOS which is a large-scale commercial dialogue system. They are based on hand-crafted features and thus can hardly learn the complex patterns lying behind millions of conversations and temporal dependency in multiple turns of the conversation. Moreover, they are trained and evaluated on the benchmark datasets with adequate labels, which are expensive to obtain in a commercial dialogue system. To face these challenges, we propose a pipeline to predict the user satisfaction to help DuerOS decide whether to ask for clarification in each turn. Specifically, we propose to first generate a large number of weak labels and then train a transformer-based model to predict the user satisfaction with these weak labels. Empirically, we deploy and evaluate our model on DuerOS, and observe a 19% relative improvement on the accuracy of user satisfaction prediction and 2.3% relative improvement on user experience.
Teaching Matters: Investigating the Role of Supervision in Vision Transformers
- Authors: Matthew Walmer, Saksham Suri, Kamal Gupta, Abhinav Shrivastava
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2212.03862
- Pdf link: https://arxiv.org/pdf/2212.03862
- Abstract Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, it is not well explored how varied their behavior is under different learning paradigms. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models. Finally, we show how the "best" layer for a given task varies by both supervision method and task, further demonstrating the differing order of information processing in ViTs.
Keyword: scene understanding
LWSIS: LiDAR-guided Weakly Supervised Instance Segmentation for Autonomous Driving
- Authors: Xiang Li, Junbo Yin, Botian Shi, Yikang Li, Ruigang Yang, Jianbin Shen
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2212.03504
- Pdf link: https://arxiv.org/pdf/2212.03504
- Abstract Image instance segmentation is a fundamental research topic in autonomous driving, which is crucial for scene understanding and road safety. Advanced learning-based approaches often rely on the costly 2D mask annotations for training. In this paper, we present a more artful framework, LiDAR-guided Weakly Supervised Instance Segmentation (LWSIS), which leverages the off-the-shelf 3D data, i.e., Point Cloud, together with the 3D boxes, as natural weak supervisions for training the 2D image instance segmentation models. Our LWSIS not only exploits the complementary information in multimodal data during training, but also significantly reduces the annotation cost of the dense 2D masks. In detail, LWSIS consists of two crucial modules, Point Label Assignment (PLA) and Graph-based Consistency Regularization (GCR). The former module aims to automatically assign the 3D point cloud as 2D point-wise labels, while the latter further refines the predictions by enforcing geometry and appearance consistency of the multimodal data. Moreover, we conduct a secondary instance segmentation annotation on the nuScenes, named nuInsSeg, to encourage further research on multimodal perception tasks. Extensive experiments on the nuInsSeg, as well as the large-scale Waymo, show that LWSIS can substantially improve existing weakly supervised segmentation models by only involving 3D data during training. Additionally, LWSIS can also be incorporated into 3D object detectors like PointPainting to boost the 3D detection performance for free. The code and dataset are available at https://github.com/Serenos/LWSIS.
Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data
- Authors: Matthias Zeller, Jens Behley, Michael Heidingsfeld, Cyrill Stachniss
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2212.03690
- Pdf link: https://arxiv.org/pdf/2212.03690
- Abstract Scene understanding is crucial for autonomous robots in dynamic environments for making future state predictions, avoiding collisions, and path planning. Camera and LiDAR perception made tremendous progress in recent years, but face limitations under adverse weather conditions. To leverage the full potential of multi-modal sensor suites, radar sensors are essential for safety critical tasks and are already installed in most new vehicles today. In this paper, we address the problem of semantic segmentation of moving objects in radar point clouds to enhance the perception of the environment with another sensor modality. Instead of aggregating multiple scans to densify the point clouds, we propose a novel approach based on the self-attention mechanism to accurately perform sparse, single-scan segmentation. Our approach, called Gaussian Radar Transformer, includes the newly introduced Gaussian transformer layer, which replaces the softmax normalization by a Gaussian function to decouple the contribution of individual points. To tackle the challenge of the transformer to capture long-range dependencies, we propose our attentive up- and downsampling modules to enlarge the receptive field and capture strong spatial relations. We compare our approach to other state-of-the-art methods on the RadarScenes data set and show superior segmentation quality in diverse environments, even without exploiting temporal information.
Keyword: visual reasoning
There is no result