arxiv-daily New submissions for Mon, 15 Jan 24

New submissions for Mon, 15 Jan 24

Open DongZhouGu opened this issue 1 year ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection

Authors: Chanho Lee, Jinsu Son, Hyounguk Shon, Yunho Jeon, Junmo Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.06159
Pdf link: https://arxiv.org/pdf/2401.06159
Abstract Rotation-equivariance is an essential yet challenging property in oriented object detection. While general object detectors naturally leverage robustness to spatial shifts due to the translation-equivariance of the conventional CNNs, achieving rotation-equivariance remains an elusive goal. Current detectors deploy various alignment techniques to derive rotation-invariant features, but still rely on high capacity models and heavy data augmentation with all possible rotations. In this paper, we introduce a Fully Rotation-Equivariant Oriented Object Detector (FRED), whose entire process from the image to the bounding box prediction is strictly equivariant. Specifically, we decouple the invariant task (object classification) and the equivariant task (object localization) to achieve end-to-end equivariance. We represent the bounding box as a set of rotation-equivariant vectors to implement rotation-equivariant localization. Moreover, we utilized these rotation-equivariant vectors as offsets in the deformable convolution, thereby enhancing the existing advantages of spatial adaptation. Leveraging full rotation-equivariance, our FRED demonstrates higher robustness to image-level rotation compared to existing methods. Furthermore, we show that FRED is one step closer to non-axis aligned learning through our experiments. Compared to state-of-the-art methods, our proposed method delivers comparable performance on DOTA-v1.0 and outperforms by 1.5 mAP on DOTA-v1.5, all while significantly reducing the model parameters to 16%.

YOLO-Former: YOLO Shakes Hand With ViT

Authors: Javad Khoramdel, Ahmad Moori, Yasamin Borhani, Armin Ghanbarzadeh, Esmaeil Najafi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.06244
Pdf link: https://arxiv.org/pdf/2401.06244
Abstract The proposed YOLO-Former method seamlessly integrates the ideas of transformer and YOLOv4 to create a highly accurate and efficient object detection system. The method leverages the fast inference speed of YOLOv4 and incorporates the advantages of the transformer architecture through the integration of convolutional attention and transformer modules. The results demonstrate the effectiveness of the proposed approach, with a mean average precision (mAP) of 85.76% on the Pascal VOC dataset, while maintaining high prediction speed with a frame rate of 10.85 frames per second. The contribution of this work lies in the demonstration of how the innovative combination of these two state-of-the-art techniques can lead to further improvements in the field of object detection.

Improving the Detection of Small Oriented Objects in Aerial Images

Authors: Chandler Timm C. Doloriel, Rhandley D. Cajote
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.06503
Pdf link: https://arxiv.org/pdf/2401.06503
Abstract Small oriented objects that represent tiny pixel-area in large-scale aerial images are difficult to detect due to their size and orientation. Existing oriented aerial detectors have shown promising results but are mainly focused on orientation modeling with less regard to the size of the objects. In this work, we proposed a method to accurately detect small oriented objects in aerial images by enhancing the classification and regression tasks of the oriented object detection model. We designed the Attention-Points Network consisting of two losses: Guided-Attention Loss (GALoss) and Box-Points Loss (BPLoss). GALoss uses an instance segmentation mask as ground-truth to learn the attention features needed to improve the detection of small objects. These attention features are then used to predict box points for BPLoss, which determines the points' position relative to the target oriented bounding box. Experimental results show the effectiveness of our Attention-Points Network on a standard oriented aerial dataset with small object instances (DOTA-v1.5) and on a maritime-related dataset (HRSC2016). The code is publicly available.

Robustness-Aware 3D Object Detection in Autonomous Driving: A Review and Outlook

Authors: Ziying Song, Lin Liu, Feiyang Jia, Yadan Luo, Guoxin Zhang, Lei Yang, Li Wang, Caiyan Jia
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.06542
Pdf link: https://arxiv.org/pdf/2401.06542
Abstract In the realm of modern autonomous driving, the perception system is indispensable for accurately assessing the state of the surrounding environment, thereby enabling informed prediction and planning. Key to this system is 3D object detection methods, that utilize vehicle-mounted sensors such as LiDAR and cameras to identify the size, category, and location of nearby objects. Despite the surge in 3D object detection methods aimed at enhancing detection precision and efficiency, there is a gap in the literature that systematically examines their resilience against environmental variations, noise, and weather changes. This study emphasizes the importance of robustness, alongside accuracy and latency, in evaluating perception systems under practical scenarios. Our work presents an extensive survey of camera-based, LiDAR-based, and multimodal 3D object detection algorithms, thoroughly evaluating their trade-off between accuracy, latency, and robustness, particularly on datasets like KITTI-C and nuScenes-C to ensure fair comparisons. Among these,multimodal 3D detection approaches exhibit superior robustness and a novel taxonomy is introduced to reorganize its literature for enhanced clarity. This survey aims to offer a more practical perspective on the current capabilities and constraints of 3D object detection algorithms in real-world applications, thus steering future research towards robustness-centric advancements

Embedded Planogram Compliance Control System

Authors: M. Erkin Yücel, Serkan Topaloğlu, Cem Ünsalan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.06690
Pdf link: https://arxiv.org/pdf/2401.06690
Abstract The retail sector presents several open and challenging problems that could benefit from advanced pattern recognition and computer vision techniques. One such critical challenge is planogram compliance control. In this study, we propose a complete embedded system to tackle this issue. Our system consists of four key components as image acquisition and transfer via stand-alone embedded camera module, object detection via computer vision and deep learning methods working on single board computers, planogram compliance control method again working on single board computers, and energy harvesting and power management block to accompany the embedded camera modules. The image acquisition and transfer block is implemented on the ESP-EYE camera module. The object detection block is based on YOLOv5 as the deep learning method and local feature extraction. We implement these methods on Raspberry Pi 4, NVIDIA Jetson Orin Nano, and NVIDIA Jetson AGX Orin as single board computers. The planogram compliance control block utilizes sequence alignment through a modified Needleman-Wunsch algorithm. This block is also working along with the object detection block on the same single board computers. The energy harvesting and power management block consists of solar and RF energy harvesting modules with suitable battery pack for operation. We tested the proposed embedded planogram compliance control system on two different datasets to provide valuable insights on its strengths and weaknesses. The results show that our method achieves F1 scores of 0.997 and 1.0 in object detection and planogram compliance control blocks, respectively. Furthermore, we calculated that the complete embedded system can work in stand-alone form up to two years based on battery. This duration can be further extended with the integration of the proposed solar and RF energy harvesting options.

Keyword: transformer

YOLO-Former: YOLO Shakes Hand With ViT

Authors: Javad Khoramdel, Ahmad Moori, Yasamin Borhani, Armin Ghanbarzadeh, Esmaeil Najafi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.06244
Pdf link: https://arxiv.org/pdf/2401.06244
Abstract The proposed YOLO-Former method seamlessly integrates the ideas of transformer and YOLOv4 to create a highly accurate and efficient object detection system. The method leverages the fast inference speed of YOLOv4 and incorporates the advantages of the transformer architecture through the integration of convolutional attention and transformer modules. The results demonstrate the effectiveness of the proposed approach, with a mean average precision (mAP) of 85.76% on the Pascal VOC dataset, while maintaining high prediction speed with a frame rate of 10.85 frames per second. The contribution of this work lies in the demonstration of how the innovative combination of these two state-of-the-art techniques can lead to further improvements in the field of object detection.

The Pulse of Mood Online: Unveiling Emotional Reactions in a Dynamic Social Media Landscape

Authors: Siyi Guo, Zihao He, Ashwin Rao, Fred Morstatter, Jeffrey Brantingham, Kristina Lerman
Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2401.06275
Pdf link: https://arxiv.org/pdf/2401.06275
Abstract The rich and dynamic information environment of social media provides researchers, policy makers, and entrepreneurs with opportunities to learn about social phenomena in a timely manner. However, using these data to understand social behavior is difficult due to heterogeneity of topics and events discussed in the highly dynamic online information environment. To address these challenges, we present a method for systematically detecting and measuring emotional reactions to offline events using change point detection on the time series of collective affect, and further explaining these reactions using a transformer-based topic model. We demonstrate the utility of the method by successfully detecting major and smaller events on three different datasets, including (1) a Los Angeles Tweet dataset between Jan. and Aug. 2020, in which we revealed the complex psychological impact of the BlackLivesMatter movement and the COVID-19 pandemic, (2) a dataset related to abortion rights discussions in USA, in which we uncovered the strong emotional reactions to the overturn of Roe v. Wade and state abortion bans, and (3) a dataset about the 2022 French presidential election, in which we discovered the emotional and moral shift from positive before voting to fear and criticism after voting. The capability of our method allows for better sensing and monitoring of population's reactions during crises using online data.

Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention

Authors: Xingyu Zhou, Leheng Zhang, Xiaorui Zhao, Keze Wang, Leida Li, Shuhang Gu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.06312
Pdf link: https://arxiv.org/pdf/2401.06312
Abstract Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task.Despite its superiority in VSR accuracy, the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices.In this paper, we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter frame Attention (MIA-VSR).The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information. In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.

A Temporal-Spectral Fusion Transformer with Subject-specific Adapter for Enhancing RSVP-BCI Decoding

Authors: Xujin Li, Wei Wei, Shuang Qiu, Huiguang He
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.06340
Pdf link: https://arxiv.org/pdf/2401.06340
Abstract The Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interface (BCI) is an efficient technology for target retrieval using electroencephalography (EEG) signals. The performance improvement of traditional decoding methods relies on a substantial amount of training data from new test subjects, which increases preparation time for BCI systems. Several studies introduce data from existing subjects to reduce the dependence of performance improvement on data from new subjects, but their optimization strategy based on adversarial learning with extensive data increases training time during the preparation procedure. Moreover, most previous methods only focus on the single-view information of EEG signals, but ignore the information from other views which may further improve performance. To enhance decoding performance while reducing preparation time, we propose a Temporal-Spectral fusion transformer with Subject-specific Adapter (TSformer-SA). Specifically, a cross-view interaction module is proposed to facilitate information transfer and extract common representations across two-view features extracted from EEG temporal signals and spectrogram images. Then, an attention-based fusion module fuses the features of two views to obtain comprehensive discriminative features for classification. Furthermore, a multi-view consistency loss is proposed to maximize the feature similarity between two views of the same EEG signal. Finally, we propose a subject-specific adapter to rapidly transfer the knowledge of the model trained on data from existing subjects to decode data from new subjects. Experimental results show that TSformer-SA significantly outperforms comparison methods and achieves outstanding performance with limited training data from new subjects. This facilitates efficient decoding and rapid deployment of BCI systems in practical use.

Hyper-STTN: Social Group-aware Spatial-Temporal Transformer Network for Human Trajectory Prediction with Hypergraph Reasoning

Authors: Weizheng Wang, Le Mao, Baijian Yang, Guohua Chen, Byung-Cheol Min
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.06344
Pdf link: https://arxiv.org/pdf/2401.06344
Abstract Predicting crowded intents and trajectories is crucial in varouls real-world applications, including service robots and autonomous vehicles. Understanding environmental dynamics is challenging, not only due to the complexities of modeling pair-wise spatial and temporal interactions but also the diverse influence of group-wise interactions. To decode the comprehensive pair-wise and group-wise interactions in crowded scenarios, we introduce Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. In Hyper-STTN, crowded group-wise correlations are constructed using a set of multi-scale hypergraphs with varying group sizes, captured through random-walk robability-based hypergraph spectral convolution. Additionally, a spatial-temporal transformer is adapted to capture pedestrians' pair-wise latent interactions in spatial-temporal dimensions. These heterogeneous group-wise and pair-wise are then fused and aligned though a multimodal transformer network. Hyper-STTN outperformes other state-of-the-art baselines and ablation models on 5 real-world pedestrian motion datasets.

UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer

Authors: Ji Liu, Dehua Tang, Yuanxian Huang, Li Zhang, Xiaocheng Zeng, Dong Li, Mingjie Lu, Jinzhang Peng, Yu Wang, Fan Jiang, Lu Tian, Ashish Sirasao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.06426
Pdf link: https://arxiv.org/pdf/2401.06426
Abstract Traditional channel-wise pruning methods by reducing network channels struggle to effectively prune efficient CNN models with depth-wise convolutional layers and certain efficient modules, such as popular inverted residual blocks. Prior depth pruning methods by reducing network depths are not suitable for pruning some efficient models due to the existence of some normalization layers. Moreover, finetuning subnet by directly removing activation layers would corrupt the original model weights, hindering the pruned model from achieving high performance. To address these issues, we propose a novel depth pruning method for efficient models. Our approach proposes a novel block pruning strategy and progressive training method for the subnet. Additionally, we extend our pruning method to vision transformer models. Experimental results demonstrate that our method consistently outperforms existing depth pruning methods across various pruning configurations. We obtained three pruned ConvNeXtV1 models with our method applying on ConvNeXtV1, which surpass most SOTA efficient models with comparable inference performance. Our method also achieves state-of-the-art pruning performance on the vision transformer model.

Swin Transformer-Based CSI Feedback for Massive MIMO

Authors: Jiaming Cheng, Wei Chen, Jialong Xu, Yiran Guo, Lun Li, Bo Ai
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2401.06435
Pdf link: https://arxiv.org/pdf/2401.06435
Abstract For massive multiple-input multiple-output systems in the frequency division duplex (FDD) mode, accurate downlink channel state information (CSI) is required at the base station (BS). However, the increasing number of transmit antennas aggravates the feedback overhead of CSI. Recently, deep learning (DL) has shown considerable potential to reduce CSI feedback overhead. In this paper, we propose a Swin Transformer-based autoencoder network called SwinCFNet for the CSI feedback task. In particular, the proposed method can effectively capture the long-range dependence information of CSI. Moreover, we explore the impact of the number of Swin Transformer blocks and the dimension of feature channels on the performance of SwinCFNet. Experimental results show that SwinCFNet significantly outperforms other DL-based methods with comparable model sizes, especially for the outdoor scenario.

Improving Graph Convolutional Networks with Transformer Layer in social-based items recommendation

Authors: Thi Linh Hoang, Tuan Dung Pham, Viet Cuong Ta
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2401.06436
Pdf link: https://arxiv.org/pdf/2401.06436
Abstract In this work, we have proposed an approach for improving the GCN for predicting ratings in social networks. Our model is expanded from the standard model with several layers of transformer architecture. The main focus of the paper is on the encoder architecture for node embedding in the network. Using the embedding layer from the graph-based convolution layer, the attention mechanism could rearrange the feature space to get a more efficient embedding for the downstream task. The experiments showed that our proposed architecture achieves better performance than GCN on the traditional link prediction task.

An investigation of structures responsible for gender bias in BERT and DistilBERT

Authors: Thibaud Leteno, Antoine Gourru, Charlotte Laclau, Christophe Gravier
Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.06495
Pdf link: https://arxiv.org/pdf/2401.06495
Abstract In recent years, large Transformer-based Pre-trained Language Models (PLM) have changed the Natural Language Processing (NLP) landscape, by pushing the performance boundaries of the state-of-the-art on a wide variety of tasks. However, this performance gain goes along with an increase in complexity, and as a result, the size of such models (up to billions of parameters) represents a constraint for their deployment on embedded devices or short-inference time tasks. To cope with this situation, compressed models emerged (e.g. DistilBERT), democratizing their usage in a growing number of applications that impact our daily lives. A crucial issue is the fairness of the predictions made by both PLMs and their distilled counterparts. In this paper, we propose an empirical exploration of this problem by formalizing two questions: (1) Can we identify the neural mechanism(s) responsible for gender bias in BERT (and by extension DistilBERT)? (2) Does distillation tend to accentuate or mitigate gender bias (e.g. is DistilBERT more prone to gender bias than its uncompressed version, BERT)? Our findings are the following: (I) one cannot identify a specific layer that produces bias; (II) every attention head uniformly encodes bias; except in the context of underrepresented classes with a high imbalance of the sensitive attribute; (III) this subset of heads is different as we re-fine tune the network; (IV) bias is more homogeneously produced by the heads in the distilled model.

Domain Adaptation for Time series Transformers using One-step fine-tuning

Authors: Subina Khanal, Seshu Tirupathi, Giulio Zizzo, Ambrish Rawat, Torben Bach Pedersen
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.06524
Pdf link: https://arxiv.org/pdf/2401.06524
Abstract The recent breakthrough of Transformers in deep learning has drawn significant attention of the time series community due to their ability to capture long-range dependencies. However, like other deep learning models, Transformers face limitations in time series prediction, including insufficient temporal understanding, generalization challenges, and data shift issues for the domains with limited data. Additionally, addressing the issue of catastrophic forgetting, where models forget previously learned information when exposed to new data, is another critical aspect that requires attention in enhancing the robustness of Transformers for time series tasks. To address these limitations, in this paper, we pre-train the time series Transformer model on a source domain with sufficient data and fine-tune it on the target domain with limited data. We introduce the \emph{One-step fine-tuning} approach, adding some percentage of source domain data to the target domains, providing the model with diverse time series instances. We then fine-tune the pre-trained model using a gradual unfreezing technique. This helps enhance the model's performance in time series prediction for domains with limited data. Extensive experimental results on two real-world datasets show that our approach improves over the state-of-the-art baselines by 4.35% and 11.54% for indoor temperature and wind power prediction, respectively.

Multimodal Learning for detecting urban functional zones using remote sensing image and multi-semantic information

Authors: Chuanji Shi, Yingying Zhang, Jiaotuan Wang, Qiqi Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.06550
Pdf link: https://arxiv.org/pdf/2401.06550
Abstract Urban area-of-interest (AOI) refers to an integrated urban functional zone with defined boundaries. The rapid development of urban commerce has resulted in an increased demand for more precise requirements in defining AOIs. However, existing research primarily concentrates on broad AOI mining for urban planning or regional economic analysis, failing to cater to the precise requirements of mobile Internet online-to-offline businesses. These businesses necessitate accuracy down to a specific community, school, or hospital. In this paper, we propose an end-to-end multimodal deep learning algorithm for detecting AOI fence polygon using remote sensing images and multi-semantics reference information. We then evaluate its timeliness through a cascaded module that incorporates dynamic human mobility and logistics address information. Specifically, we begin by selecting a point-of-interest (POI) of specific category, and use it to recall corresponding remote sensing images, nearby POIs, road nodes, human mobility, and logistics addresses to build a multimodal detection model based on transformer encoder-decoder architecture, titled AOITR. In the model, in addition to the remote sensing images, multi-semantic information including core POI and road nodes is embedded and reorganized as the query content part for the transformer decoder to generate the AOI polygon. Meanwhile, relatively dynamic distribution features of human mobility, nearby POIs, and logistics addresses are used for AOI reliability evaluation through a cascaded feedforward network. The experimental results demonstrate that our algorithm significantly outperforms two existing methods.

Mapping Transformer Leveraged Embeddings for Cross-Lingual Document Representation

Authors: Tsegaye Misikir Tashu, Eduard-Raul Kontos, Matthia Sabatelli, Matias Valdenegro-Toro
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.06583
Pdf link: https://arxiv.org/pdf/2401.06583
Abstract Recommendation systems, for documents, have become tools to find relevant content on the Web. However, these systems have limitations when it comes to recommending documents in languages different from the query language, which means they might overlook resources in non-native languages. This research focuses on representing documents across languages by using Transformer Leveraged Document Representations (TLDRs) that are mapped to a cross-lingual domain. Four multilingual pre-trained transformer models (mBERT, mT5 XLM RoBERTa, ErnieM) were evaluated using three mapping methods across 20 language pairs representing combinations of five selected languages of the European Union. Metrics like Mate Retrieval Rate and Reciprocal Rank were used to measure the effectiveness of mapped TLDRs compared to non-mapped ones. The results highlight the power of cross-lingual representations achieved through pre-trained transformers and mapping approaches suggesting a promising direction for expanding beyond language connections, between two specific languages.

Ada-Retrieval: An Adaptive Multi-Round Retrieval Paradigm for Sequential Recommendations

Authors: Lei Li, Jianxun Lian, Xiao Zhou, Xing Xie
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.06633
Pdf link: https://arxiv.org/pdf/2401.06633
Abstract Retrieval models aim at selecting a small set of item candidates which match the preference of a given user. They play a vital role in large-scale recommender systems since subsequent models such as rankers highly depend on the quality of item candidates. However, most existing retrieval models employ a single-round inference paradigm, which may not adequately capture the dynamic nature of user preferences and stuck in one area in the item space. In this paper, we propose Ada-Retrieval, an adaptive multi-round retrieval paradigm for recommender systems that iteratively refines user representations to better capture potential candidates in the full item space. Ada-Retrieval comprises two key modules: the item representation adapter and the user representation adapter, designed to inject context information into items' and users' representations. The framework maintains a model-agnostic design, allowing seamless integration with various backbone models such as RNNs or Transformers. We perform experiments on three widely used public datasets, incorporating five powerful sequential recommenders as backbone models. Our results demonstrate that Ada-Retrieval significantly enhances the performance of various base models, with consistent improvements observed across different datasets. Our code and data are publicly available at: https://github.com/ll0ruc/Ada-Retrieval.

Scalable 3D Panoptic Segmentation With Superpoint Graph Clustering

Authors: Damien Robert, Hugo Raguet, Loic Landrieu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.06704
Pdf link: https://arxiv.org/pdf/2401.06704
Abstract We introduce a highly efficient method for panoptic segmentation of large 3D point clouds by redefining this task as a scalable graph clustering problem. This approach can be trained using only local auxiliary tasks, thereby eliminating the resource-intensive instance-matching step during training. Moreover, our formulation can easily be adapted to the superpoint paradigm, further increasing its efficiency. This allows our model to process scenes with millions of points and thousands of objects in a single inference. Our method, called SuperCluster, achieves a new state-of-the-art panoptic segmentation performance for two indoor scanning datasets: $50.1$ PQ ($+7.8$) for S3DIS Area~5, and $58.7$ PQ ($+25.2$) for ScanNetV2. We also set the first state-of-the-art for two large-scale mobile mapping benchmarks: KITTI-360 and DALES. With only $209$k parameters, our model is over $30$ times smaller than the best-competing method and trains up to $15$ times faster. Our code and pretrained models are available at https://github.com/drprojects/superpoint_transformer.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

Jan 15 '24 02:01 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Mon, 15 Jan 24

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection

YOLO-Former: YOLO Shakes Hand With ViT

Improving the Detection of Small Oriented Objects in Aerial Images

Robustness-Aware 3D Object Detection in Autonomous Driving: A Review and Outlook

Embedded Planogram Compliance Control System

Keyword: transformer

YOLO-Former: YOLO Shakes Hand With ViT

The Pulse of Mood Online: Unveiling Emotional Reactions in a Dynamic Social Media Landscape

Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention

A Temporal-Spectral Fusion Transformer with Subject-specific Adapter for Enhancing RSVP-BCI Decoding

Hyper-STTN: Social Group-aware Spatial-Temporal Transformer Network for Human Trajectory Prediction with Hypergraph Reasoning

UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer

Swin Transformer-Based CSI Feedback for Massive MIMO

Improving Graph Convolutional Networks with Transformer Layer in social-based items recommendation

An investigation of structures responsible for gender bias in BERT and DistilBERT

Domain Adaptation for Time series Transformers using One-step fine-tuning

Multimodal Learning for detecting urban functional zones using remote sensing image and multi-semantic information

Mapping Transformer Leveraged Embeddings for Cross-Lingual Document Representation

Ada-Retrieval: An Adaptive Multi-Round Retrieval Paradigm for Sequential Recommendations

Scalable 3D Panoptic Segmentation With Superpoint Graph Clustering

Keyword: scene understanding

Keyword: visual reasoning

arxiv-daily
arxiv-daily copied to clipboard