arxiv-daily
arxiv-daily copied to clipboard
New submissions for Fri, 26 Aug 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window
- Authors: Mocho Go, Hideyuki Tachibana
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2208.11718
- Pdf link: https://arxiv.org/pdf/2208.11718
- Abstract Following the success in language domain, the self-attention mechanism (transformer) is adopted in the vision domain and achieving great success recently. Additionally, as another stream, multi-layer perceptron (MLP) is also explored in the vision domain. These architectures, other than traditional CNNs, have been attracting attention recently, and many methods have been proposed. As one that combines parameter efficiency and performance with locality and hierarchy in image recognition, we propose gSwin, which merges the two streams; Swin Transformer and (multi-head) gMLP. We showed that our gSwin can achieve better accuracy on three vision tasks, image classification, object detection and semantic segmentation, than Swin Transformer, with smaller model size.
Bridging the View Disparity of Radar and Camera Features for Multi-modal Fusion 3D Object Detection
- Authors: Taohua Zhou, Yining Shi, Junjie Chen, Kun Jiang, Mengmeng Yang, Diange Yang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.12079
- Pdf link: https://arxiv.org/pdf/2208.12079
- Abstract Environmental perception with multi-modal fusion of radar and camera is crucial in autonomous driving to increase the accuracy, completeness, and robustness. This paper focuses on how to utilize millimeter-wave (MMW) radar and camera sensor fusion for 3D object detection. A novel method which realizes the feature-level fusion under bird-eye view (BEV) for a better feature representation is proposed. Firstly, radar features are augmented with temporal accumulation and sent to a temporal-spatial encoder for radar feature extraction. Meanwhile, multi-scale image 2D features which adapt to various spatial scales are obtained by image backbone and neck model. Then, image features are transformed to BEV with the designed view transformer. In addition, this work fuses the multi-modal features with a two-stage fusion model called point fusion and ROI fusion, respectively. Finally, a detection head regresses objects category and 3D locations. Experimental results demonstrate that the proposed method realizes the state-of-the-art performance under the most important detection metrics, mean average precision (mAP) and nuScenes detection score (NDS) on the challenging nuScenes dataset.
Anytime-Lidar: Deadline-aware 3D Object Detection
- Authors: Ahmet Soyyigit, Shuochao Yao, Heechul Yun
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2208.12181
- Pdf link: https://arxiv.org/pdf/2208.12181
- Abstract In this work, we present a novel scheduling framework enabling anytime perception for deep neural network (DNN) based 3D object detection pipelines. We focus on computationally expensive region proposal network (RPN) and per-category multi-head detector components, which are common in 3D object detection pipelines, and make them deadline-aware. We propose a scheduling algorithm, which intelligently selects the subset of the components to make effective time and accuracy trade-off on the fly. We minimize accuracy loss of skipping some of the neural network sub-components by projecting previously detected objects onto the current scene through estimations. We apply our approach to a state-of-art 3D object detection network, PointPillars, and evaluate its performance on Jetson Xavier AGX using nuScenes dataset. Compared to the baselines, our approach significantly improve the network's accuracy under various deadline constraints.
Keyword: transformer
Ontology-Driven Self-Supervision for Adverse Childhood Experiences Identification Using Social Media Datasets
- Authors: Jinge Wu, Rowena Smith, Honghan Wu
- Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2208.11701
- Pdf link: https://arxiv.org/pdf/2208.11701
- Abstract Adverse Childhood Experiences (ACEs) are defined as a collection of highly stressful, and potentially traumatic, events or circumstances that occur throughout childhood and/or adolescence. They have been shown to be associated with increased risks of mental health diseases or other abnormal behaviours in later lives. However, the identification of ACEs from textual data with Natural Language Processing (NLP) is challenging because (a) there are no NLP ready ACE ontologies; (b) there are few resources available for machine learning, necessitating the data annotation from clinical experts; (c) costly annotations by domain experts and large number of documents for supporting large machine learning models. In this paper, we present an ontology-driven self-supervised approach (derive concept embeddings using an auto-encoder from baseline NLP results) for producing a publicly available resource that would support large-scale machine learning (e.g., training transformer based large language models) on social media corpus. This resource as well as the proposed approach are aimed to facilitate the community in training transferable NLP models for effectively surfacing ACEs in low-resource scenarios like NLP on clinical notes within Electronic Health Records. The resource including a list of ACE ontology terms, ACE concept embeddings and the NLP annotated corpus is available at https://github.com/knowlab/ACE-NLP.
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window
- Authors: Mocho Go, Hideyuki Tachibana
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2208.11718
- Pdf link: https://arxiv.org/pdf/2208.11718
- Abstract Following the success in language domain, the self-attention mechanism (transformer) is adopted in the vision domain and achieving great success recently. Additionally, as another stream, multi-layer perceptron (MLP) is also explored in the vision domain. These architectures, other than traditional CNNs, have been attracting attention recently, and many methods have been proposed. As one that combines parameter efficiency and performance with locality and hierarchy in image recognition, we propose gSwin, which merges the two streams; Swin Transformer and (multi-head) gMLP. We showed that our gSwin can achieve better accuracy on three vision tasks, image classification, object detection and semantic segmentation, than Swin Transformer, with smaller model size.
Addressing Token Uniformity in Transformers via Singular Value Transformation
- Authors: Hanqi Yan, Lin Gui, Wenjie Li, Yulan He
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2208.11790
- Pdf link: https://arxiv.org/pdf/2208.11790
- Abstract Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can alleviate the `token uniformity' problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks. Our source code is available at https://github.com/hanqi-qi/tokenUni.git.
Unbiased Multi-Modality Guidance for Image Inpainting
- Authors: Yongsheng Yu, Dawei Du, Libo Zhang, Tiejian Luo
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.11844
- Pdf link: https://arxiv.org/pdf/2208.11844
- Abstract Image inpainting is an ill-posed problem to recover missing or damaged image content based on incomplete images with masks. Previous works usually predict the auxiliary structures (e.g., edges, segmentation and contours) to help fill visually realistic patches in a multi-stage fashion. However, imprecise auxiliary priors may yield biased inpainted results. Besides, it is time-consuming for some methods to be implemented by multiple stages of complex neural networks. To solve this issue, we develop an end-to-end multi-modality guided transformer network, including one inpainting branch and two auxiliary branches for semantic segmentation and edge textures. Within each transformer block, the proposed multi-scale spatial-aware attention module can learn the multi-modal structural features efficiently via auxiliary denormalization. Different from previous methods relying on direct guidance from biased priors, our method enriches semantically consistent context in an image based on discriminative interplay information from multiple modalities. Comprehensive experiments on several challenging image inpainting datasets show that our method achieves state-of-the-art performance to deal with various regular/irregular masks efficiently.
Adaptive Perception Transformer for Temporal Action Localization
- Authors: Yizheng Ouyang, Tianjin Zhang, Weibo Gu, Hongfa Wang, Liming Wang, Xiaojie Guo
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.11908
- Pdf link: https://arxiv.org/pdf/2208.11908
- Abstract Temporal action localization aims to predict the boundary and category of each action instance in untrimmed long videos. Most of previous methods based on anchors or proposals neglect the global-local context interaction in entire video sequences. Besides, their multi-stage designs cannot generate action boundaries and categories straightforwardly. To address the above issues, this paper proposes a novel end-to-end model, called adaptive perception transformer (AdaPerFormer for short). Specifically, AdaPerFormer explores a dual-branch multi-head self-attention mechanism. One branch takes care of the global perception attention, which can model entire video sequences and aggregate global relevant contexts. While the other branch concentrates on the local convolutional shift to aggregate intra-frame and inter-frame information through our bidirectional shift operation. The end-to-end nature produces the boundaries and categories of video actions without extra steps. Extensive experiments together with ablation studies are provided to reveal the effectiveness of our design. Our method achieves a state-of-the-art accuracy on the THUMOS14 dataset (65.8% in terms of [email protected], 42.6% [email protected], and 62.7% mAP@Avg), and obtains competitive performance on the ActivityNet-1.3 dataset with an average mAP of 36.1%. The code and models are available at https://github.com/SouperO/AdaPerFormer.
Learning to Construct 3D Building Wireframes from 3D Line Clouds
- Authors: Yicheng Luo, Jing Ren, Xuefei Zhe, Di Kang, Yajing Xu, Peter Wonka, Linchao Bao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.11948
- Pdf link: https://arxiv.org/pdf/2208.11948
- Abstract Line clouds, though under-investigated in the previous work, potentially encode more compact structural information of buildings than point clouds extracted from multi-view images. In this work, we propose the first network to process line clouds for building wireframe abstraction. The network takes a line cloud as input , i.e., a nonstructural and unordered set of 3D line segments extracted from multi-view images, and outputs a 3D wireframe of the underlying building, which consists of a sparse set of 3D junctions connected by line segments. We observe that a line patch, i.e., a group of neighboring line segments, encodes sufficient contour information to predict the existence and even the 3D position of a potential junction, as well as the likelihood of connectivity between two query junctions. We therefore introduce a two-layer Line-Patch Transformer to extract junctions and connectivities from sampled line patches to form a 3D building wireframe model. We also introduce a synthetic dataset of multi-view images with ground-truth 3D wireframe. We extensively justify that our reconstructed 3D wireframe models significantly improve upon multiple baseline building reconstruction methods.
Bridging the View Disparity of Radar and Camera Features for Multi-modal Fusion 3D Object Detection
- Authors: Taohua Zhou, Yining Shi, Junjie Chen, Kun Jiang, Mengmeng Yang, Diange Yang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.12079
- Pdf link: https://arxiv.org/pdf/2208.12079
- Abstract Environmental perception with multi-modal fusion of radar and camera is crucial in autonomous driving to increase the accuracy, completeness, and robustness. This paper focuses on how to utilize millimeter-wave (MMW) radar and camera sensor fusion for 3D object detection. A novel method which realizes the feature-level fusion under bird-eye view (BEV) for a better feature representation is proposed. Firstly, radar features are augmented with temporal accumulation and sent to a temporal-spatial encoder for radar feature extraction. Meanwhile, multi-scale image 2D features which adapt to various spatial scales are obtained by image backbone and neck model. Then, image features are transformed to BEV with the designed view transformer. In addition, this work fuses the multi-modal features with a two-stage fusion model called point fusion and ROI fusion, respectively. Finally, a detection head regresses objects category and 3D locations. Experimental results demonstrate that the proposed method realizes the state-of-the-art performance under the most important detection metrics, mean average precision (mAP) and nuScenes detection score (NDS) on the challenging nuScenes dataset.
Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling
- Authors: Rui Wang, Zuxuan Wu, Dongdong Chen, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Luowei Zhou, Lu Yuan, Yu-Gang Jiang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.12257
- Pdf link: https://arxiv.org/pdf/2208.12257
- Abstract Transformer-based models have achieved top performance on major video recognition benchmarks. Benefiting from the self-attention mechanism, these models show stronger ability of modeling long-range dependencies compared to CNN-based models. However, significant computation overheads, resulted from the quadratic complexity of self-attention on top of a tremendous number of tokens, limit the use of existing video transformers in applications with limited resources like mobile devices. In this paper, we extend Mobile-Former to Video Mobile-Former, which decouples the video architecture into a lightweight 3D-CNNs for local context modeling and a Transformer modules for global interaction modeling in a parallel fashion. To avoid significant computational cost incurred by computing self-attention between the large number of local patches in videos, we propose to use very few global tokens (e.g., 6) for a whole video in Transformers to exchange information with 3D-CNNs with a cross-attention mechanism. Through efficient global spatial-temporal modeling, Video Mobile-Former significantly improves the video recognition performance of alternative lightweight baselines, and outperforms other efficient CNN-based models at the low FLOP regime from 500M to 6G total FLOPs on various video recognition tasks. It is worth noting that Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding
- Authors: Guocheng Qian, Xingdi Zhang, Abdullah Hamdi, Bernard Ghanem
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2208.12259
- Pdf link: https://arxiv.org/pdf/2208.12259
- Abstract Pure Transformer models have achieved impressive success in natural language processing and computer vision. However, one limitation with Transformers is their need for large training data. In the realm of 3D point clouds, the availability of large datasets is a challenge, which exacerbates the issue of training Transformers for 3D tasks. In this work, we empirically study and investigate the effect of utilizing knowledge from a large number of images for point cloud understanding. We formulate a pipeline dubbed \textit{Pix4Point} that allows harnessing pretrained Transformers in the image domain to improve downstream point cloud tasks. This is achieved by a modality-agnostic pure Transformer backbone with the help of tokenizer and decoder layers specialized in the 3D domain. Using image-pretrained Transformers, we observe significant performance gains of Pix4Point on the tasks of 3D point cloud classification, part segmentation, and semantic segmentation on ScanObjectNN, ShapeNetPart, and S3DIS benchmarks, respectively. Our code and models are available at: \url{https://github.com/guochengqian/Pix4Point}.
Keyword: scene understanding
There is no result
Keyword: visual reasoning
There is no result