arxiv-daily
arxiv-daily copied to clipboard
New submissions for Wed, 19 Oct 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
Deformably-Scaled Transposed Convolution
- Authors: Stefano B. Blumberg, Daniele Raví, Mou-Cheng Xu, Matteo Figini, Iasonas Kokkinos, Daniel C. Alexander
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
- Arxiv link: https://arxiv.org/abs/2210.09446
- Pdf link: https://arxiv.org/pdf/2210.09446
- Abstract Transposed convolution is crucial for generating high-resolution outputs, yet has received little attention compared to convolution layers. In this work we revisit transposed convolution and introduce a novel layer that allows us to place information in the image selectively and choose the `stroke breadth' at which the image is synthesized, whilst incurring a small additional parameter cost. For this we introduce three ideas: firstly, we regress offsets to the positions where the transpose convolution results are placed; secondly we broadcast the offset weight locations over a learnable neighborhood; and thirdly we use a compact parametrization to share weights and restrict offsets. We show that simply substituting upsampling operators with our novel layer produces substantial improvements across tasks as diverse as instance segmentation, object detection, semantic segmentation, generative image modeling, and 3D magnetic resonance image enhancement, while outperforming all existing variants of transposed convolutions. Our novel layer can be used as a drop-in replacement for 2D and 3D upsampling operators and the code will be publicly available.
You Can't See Me: Physical Removal Attacks on LiDAR-based Autonomous Vehicles Driving Frameworks
- Authors: Yulong Cao, S. Hrushikesh Bhupathiraju, Pirouz Naghavi, Takeshi Sugawara, Z. Morley Mao, Sara Rampazzi
- Subjects: Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2210.09482
- Pdf link: https://arxiv.org/pdf/2210.09482
- Abstract Autonomous Vehicles (AVs) increasingly use LiDAR-based object detection systems to perceive other vehicles and pedestrians on the road. While existing attacks on LiDAR-based autonomous driving architectures focus on lowering the confidence score of AV object detection models to induce obstacle misdetection, our research discovers how to leverage laser-based spoofing techniques to selectively remove the LiDAR point cloud data of genuine obstacles at the sensor level before being used as input to the AV perception. The ablation of this critical LiDAR information causes autonomous driving obstacle detectors to fail to identify and locate obstacles and, consequently, induces AVs to make dangerous automatic driving decisions. In this paper, we present a method invisible to the human eye that hides objects and deceives autonomous vehicles' obstacle detectors by exploiting inherent automatic transformation and filtering processes of LiDAR sensor data integrated with autonomous driving frameworks. We call such attacks Physical Removal Attacks (PRA), and we demonstrate their effectiveness against three popular AV obstacle detectors (Apollo, Autoware, PointPillars), and we achieve 45{\deg} attack capability. We evaluate the attack impact on three fusion models (Frustum-ConvNet, AVOD, and Integrated-Semantic Level Fusion) and the consequences on the driving decision using LGSVL, an industry-grade simulator. In our moving vehicle scenarios, we achieve a 92.7% success rate removing 90% of a target obstacle's cloud points. Finally, we demonstrate the attack's success against two popular defenses against spoofing and object hiding attacks and discuss two enhanced defense strategies to mitigate our attack.
Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection
- Authors: Xin Li, Botian Shi, Yuenan Hou, Xingjiao Wu, Tianlong Ma, Yikang Li, Liang He
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09615
- Pdf link: https://arxiv.org/pdf/2210.09615
- Abstract Multi-modal 3D object detection has been an active research topic in autonomous driving. Nevertheless, it is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels. Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels. These fusion approaches often suffer from severe information loss, thus causing sub-optimal performance. To address these problems, we construct the homogeneous structure between the point cloud and images to avoid projective information loss by transforming the camera features into the LiDAR 3D space. In this paper, we propose a homogeneous multi-modal feature fusion and interaction method (HMFI) for 3D object detection. Specifically, we first design an image voxel lifter module (IVLM) to lift 2D image features into the 3D space and generate homogeneous image voxel features. Then, we fuse the voxelized point cloud features with the image features from different regions by introducing the self-attention based query fusion mechanism (QFM). Next, we propose a voxel feature interaction module (VFIM) to enforce the consistency of semantic information from identical objects in the homogeneous point cloud and image voxel representations, which can provide object-level alignment guidance for cross-modal feature fusion and strengthen the discriminative ability in complex backgrounds. We conduct extensive experiments on the KITTI and Waymo Open Dataset, and the proposed HMFI achieves better performance compared with the state-of-the-art multi-modal methods. Particularly, for the 3D detection of cyclist on the KITTI benchmark, HMFI surpasses all the published algorithms by a large margin.
Object Recognition in Different Lighting Conditions at Various Angles by Deep Learning Method
- Authors: Imran Khan Mirani, Chen Tianhua, Malak Abid Ali Khan, Syed Muhammad Aamir, Waseef Menhaj
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09618
- Pdf link: https://arxiv.org/pdf/2210.09618
- Abstract Existing computer vision and object detection methods strongly rely on neural networks and deep learning. This active research area is used for applications such as autonomous driving, aerial photography, protection, and monitoring. Futuristic object detection methods rely on rectangular, boundary boxes drawn over an object to accurately locate its location. The modern object recognition algorithms, however, are vulnerable to multiple factors, such as illumination, occlusion, viewing angle, or camera rotation as well as cost. Therefore, deep learning-based object recognition will significantly increase the recognition speed and compatible external interference. In this study, we use convolutional neural networks (CNN) to recognize items, the neural networks have the advantages of end-to-end, sparse relation, and sharing weights. This article aims to classify the name of the various object based on the position of an object's detected box. Instead, under different distances, we can get recognition results with different confidence. Through this study, we find that this model's accuracy through recognition is mainly influenced by the proportion of objects and the number of samples. When we have a small proportion of an object on camera, then we get higher recognition accuracy; if we have a much small number of samples, we can get greater accuracy in recognition. The epidemic has a great impact on the world economy where designing a cheaper object recognition system is the need of time.
1st Place Solutions for the UVO Challenge 2022
- Authors: Jiajun Zhang, Boyu Chen, Zhilong Ji, Jinfeng Bai, Zonghai Hu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09629
- Pdf link: https://arxiv.org/pdf/2210.09629
- Abstract This paper describes the approach we have taken in the challenge. We still adopted the two-stage scheme same as the last champion, that is, detection first and segmentation followed. We trained more powerful detector and segmentor separately. Besides, we also perform pseudo-label training on the test set, based on student-teacher framework and end-to-end transformer based object detection. The method ranks first on the 2nd Unidentified Video Objects (UVO) challenge, achieving AR@100 of 46.8, 64.7 and 32.2 in the limited data frame track, unlimited data frame track and video track respectively.
Virtual Reality via Object Poses and Active Learning: Realizing Telepresence Robots with Aerial Manipulation Capabilities
- Authors: Jongseok Lee, Ribin Balachandran, Konstantin Kondak, Andre Coelho, Marco De Stefano, Matthias Humt, Jianxiang Feng, Tamim Asfour, Rudolph Triebel
- Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.09678
- Pdf link: https://arxiv.org/pdf/2210.09678
- Abstract This article presents a novel telepresence system for advancing aerial manipulation in dynamic and unstructured environments. The proposed system not only features a haptic device, but also a virtual reality (VR) interface that provides real-time 3D displays of the robot's workspace as well as a haptic guidance to its remotely located operator. To realize this, multiple sensors namely a LiDAR, cameras and IMUs are utilized. For processing of the acquired sensory data, pose estimation pipelines are devised for industrial objects of both known and unknown geometries. We further propose an active learning pipeline in order to increase the sample efficiency of a pipeline component that relies on Deep Neural Networks (DNNs) based object detection. All these algorithms jointly address various challenges encountered during the execution of perception tasks in industrial scenarios. In the experiments, exhaustive ablation studies are provided to validate the proposed pipelines. Methodologically, these results commonly suggest how an awareness of the algorithms' own failures and uncertainty ("introspection") can be used tackle the encountered problems. Moreover, outdoor experiments are conducted to evaluate the effectiveness of the overall system in enhancing aerial manipulation capabilities. In particular, with flight campaigns over days and nights, from spring to winter, and with different users and locations, we demonstrate over 70 robust executions of pick-and-place, force application and peg-in-hole tasks with the DLR cable-Suspended Aerial Manipulator (SAM). As a result, we show the viability of the proposed system in future industrial applications.
Real-Time Multi-Modal Semantic Fusion on Unmanned Aerial Vehicles with Label Propagation for Cross-Domain Adaptation
- Authors: Simon Bultmann, Jan Quenzel, Sven Behnke
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2210.09739
- Pdf link: https://arxiv.org/pdf/2210.09739
- Abstract Unmanned aerial vehicles (UAVs) equipped with multiple complementary sensors have tremendous potential for fast autonomous or remote-controlled semantic scene analysis, e.g., for disaster examination. Here, we propose a UAV system for real-time semantic inference and fusion of multiple sensor modalities. Semantic segmentation of LiDAR scans and RGB images, as well as object detection on RGB and thermal images, run online onboard the UAV computer using lightweight CNN architectures and embedded inference accelerators. We follow a late fusion approach where semantic information from multiple sensor modalities augments 3D point clouds and image segmentation masks while also generating an allocentric semantic map. Label propagation on the semantic map allows for sensor-specific adaptation with cross-modality and cross-domain supervision. Our system provides augmented semantic images and point clouds with $\approx$ 9 Hz. We evaluate the integrated system in real-world experiments in an urban environment and at a disaster test site.
A Tri-Layer Plugin to Improve Occluded Detection
- Authors: Guanqi Zhan, Weidi Xie, Andrew Zisserman
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.10046
- Pdf link: https://arxiv.org/pdf/2210.10046
- Abstract Detecting occluded objects still remains a challenge for state-of-the-art object detectors. The objective of this work is to improve the detection for such objects, and thereby improve the overall performance of a modern object detector. To this end we make the following four contributions: (1) We propose a simple 'plugin' module for the detection head of two-stage object detectors to improve the recall of partially occluded objects. The module predicts a tri-layer of segmentation masks for the target object, the occluder and the occludee, and by doing so is able to better predict the mask of the target object. (2) We propose a scalable pipeline for generating training data for the module by using amodal completion of existing object detection and instance segmentation training datasets to establish occlusion relationships. (3) We also establish a COCO evaluation dataset to measure the recall performance of partially occluded and separated objects. (4) We show that the plugin module inserted into a two-stage detector can boost the performance significantly, by only fine-tuning the detection head, and with additional improvements if the entire architecture is fine-tuned. COCO results are reported for Mask R-CNN with Swin-T or Swin-S backbones, and Cascade Mask R-CNN with a Swin-B backbone.
Keyword: transformer
Deepfake Text Detection: Limitations and Opportunities
- Authors: Jiameng Pu, Zain Sarwar, Sifat Muhammad Abdullah, Abdullah Rehman, Yoonjin Kim, Parantapa Bhattacharya, Mobin Javed, Bimal Viswanath
- Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.09421
- Pdf link: https://arxiv.org/pdf/2210.09421
- Abstract Recent advances in generative models for language have enabled the creation of convincing synthetic text or deepfake text. Prior work has demonstrated the potential for misuse of deepfake text to mislead content consumers. Therefore, deepfake text detection, the task of discriminating between human and machine-generated text, is becoming increasingly critical. Several defenses have been proposed for deepfake text detection. However, we lack a thorough understanding of their real-world applicability. In this paper, we collect deepfake text from 4 online services powered by Transformer-based tools to evaluate the generalization ability of the defenses on content in the wild. We develop several low-cost adversarial attacks, and investigate the robustness of existing defenses against an adaptive attacker. We find that many defenses show significant degradation in performance under our evaluation scenarios compared to their original claimed performance. Our evaluation shows that tapping into the semantic information in the text content is a promising approach for improving the robustness and generalization performance of deepfake text detection schemes.
CAN-BERT do it? Controller Area Network Intrusion Detection System based on BERT Language Model
- Authors: Natasha Alkhatib, Maria Mushtaq, Hadi Ghauch, Jean-Luc Danger
- Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
- Arxiv link: https://arxiv.org/abs/2210.09439
- Pdf link: https://arxiv.org/pdf/2210.09439
-
Abstract
Due to the rising number of sophisticated customer functionalities, electronic control units (ECUs) are increasingly integrated into modern automotive systems. However, the high connectivity between the in-vehicle and the external networks paves the way for hackers who could exploit in-vehicle network protocols' vulnerabilities. Among these protocols, the Controller Area Network (CAN), known as the most widely used in-vehicle networking technology, lacks encryption and authentication mechanisms, making the communications delivered by distributed ECUs insecure. Inspired by the outstanding performance of bidirectional encoder representations from transformers (BERT) for improving many natural language processing tasks, we propose in this paper
CAN-BERT", a deep learning based network intrusion detection system, to detect cyber attacks on CAN bus protocol. We show that the BERT model can learn the sequence of arbitration identifiers (IDs) in the CAN bus for anomaly detection using the
masked language model" unsupervised training objective. The experimental results on theCar Hacking: Attack \& Defense Challenge 2020" dataset show that
CAN-BERT" outperforms state-of-the-art approaches. In addition to being able to identify in-vehicle intrusions in real-time within 0.8 ms to 3 ms w.r.t CAN ID sequence length, it can also detect a wide variety of cyberattacks with an F1-score of between 0.81 and 0.99.
Using Bottleneck Adapters to Identify Cancer in Clinical Notes under Low-Resource Constraints
- Authors: Omid Rohanian, Hannah Jauncey, Mohammadmahdi Nouriborji, Bronner P. Gonçalves, Christiana Kartsonaki, ISARIC Clinical Characterisation Group, Laura Merson, David Clifton
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2210.09440
- Pdf link: https://arxiv.org/pdf/2210.09440
- Abstract Processing information locked within clinical health records is a challenging task that remains an active area of research in biomedical NLP. In this work, we evaluate a broad set of machine learning techniques ranging from simple RNNs to specialised transformers such as BioBERT on a dataset containing clinical notes along with a set of annotations indicating whether a sample is cancer-related or not. Furthermore, we specifically employ efficient fine-tuning methods from NLP, namely, bottleneck adapters and prompt tuning, to adapt the models to our specialised task. Our evaluations suggest that fine-tuning a frozen BERT model pre-trained on natural language and with bottleneck adapters outperforms all other strategies, including full fine-tuning of the specialised BioBERT model. Based on our findings, we suggest that using bottleneck adapters in low-resource situations with limited access to labelled data or processing capacity could be a viable strategy in biomedical text mining. The code used in the experiments are going to be made available at https://github.com/omidrohanian/bottleneck-adapters.
Track Targets by Dense Spatio-Temporal Position Encoding
- Authors: Jinkun Cao, Hao Wu, Kris Kitani
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2210.09455
- Pdf link: https://arxiv.org/pdf/2210.09455
- Abstract In this work, we propose a novel paradigm to encode the position of targets for target tracking in videos using transformers. The proposed paradigm, Dense Spatio-Temporal (DST) position encoding, encodes spatio-temporal position information in a pixel-wise dense fashion. The provided position encoding provides location information to associate targets across frames beyond appearance matching by comparing objects in two bounding boxes. Compared to the typical transformer positional encoding, our proposed encoding is applied to the 2D CNN features instead of the projected feature vectors to avoid losing positional information. Moreover, the designed DST encoding can represent the location of a single-frame object and the evolution of the location of the trajectory among frames uniformly. Integrated with the DST encoding, we build a transformer-based multi-object tracking model. The model takes a video clip as input and conducts the target association in the clip. It can also perform online inference by associating existing trajectories with objects from the new-coming frames. Experiments on video multi-object tracking (MOT) and multi-object tracking and segmentation (MOTS) datasets demonstrate the effectiveness of the proposed DST position encoding.
Token Merging: Your ViT But Faster
- Authors: Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09461
- Pdf link: https://arxiv.org/pdf/2210.09461
- Abstract We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.
A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4
- Authors: Yiming Li, Zhifang Guo, Zhirong Ye, Xiangdong Wang, Hong Liu, Yueliang Qian, Rui Tao, Long Yan, Kazushige Ouchi
- Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2210.09529
- Pdf link: https://arxiv.org/pdf/2210.09529
- Abstract In this paper, we describe in detail our system for DCASE 2022 Task4. The system combines two considerably different models: an end-to-end Sound Event Detection Transformer (SEDT) and a frame-wise model, Metric Learning and Focal Loss CNN (MLFL-CNN). The former is an event-wise model which learns event-level representations and predicts sound event categories and boundaries directly, while the latter is based on the widely adopted frame-classification scheme, under which each frame is classified into event categories and event boundaries are obtained by post-processing such as thresholding and smoothing. For SEDT, self-supervised pre-training using unlabeled data is applied, and semi-supervised learning is adopted by using an online teacher, which is updated from the student model using the Exponential Moving Average (EMA) strategy and generates reliable pseudo labels for weakly-labeled and unlabeled data. For the frame-wise model, the ICT-TOSHIBA system of DCASE 2021 Task 4 is used. Experimental results show that the hybrid system considerably outperforms either individual model and achieves psds1 of 0.420 and psds2 of 0.783 on the validation set without external data. The code is available at https://github.com/965694547/Hybrid-system-of-frame-wise-model-and-SEDT.
Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation
- Authors: Ruijun Li, Weihua Li, Yi Yang, Hanyu Wei, Jianhua Jiang, Quan Bai
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.09549
- Pdf link: https://arxiv.org/pdf/2210.09549
- Abstract Recently, diffusion models have been proven to perform remarkably well in text-to-image synthesis tasks in a number of studies, immediately presenting new study opportunities for image generation. Google's Imagen follows this research trend and outperforms DALLE2 as the best model for text-to-image generation. However, Imagen merely uses a T5 language model for text processing, which cannot ensure learning the semantic information of the text. Furthermore, the Efficient UNet leveraged by Imagen is not the best choice in image processing. To address these issues, we propose the Swinv2-Imagen, a novel text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout. In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model, effectively improving the quality of generated images. On top of that, we also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations. Extensive experiments are conducted to evaluate the performance of the proposed model by using three real-world datasets, i.e., MSCOCO, CUB and MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen model outperforms several popular state-of-the-art methods.
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design
- Authors: Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan Zhang, Chaojian Li, Baopu Li, Yingyan Lin
- Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09573
- Pdf link: https://arxiv.org/pdf/2210.09573
- Abstract Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.
Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs
- Authors: Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, Gennady Pekhimenko
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
- Arxiv link: https://arxiv.org/abs/2210.09603
- Pdf link: https://arxiv.org/pdf/2210.09603
- Abstract As deep learning models nowadays are widely adopted by both cloud services and edge devices, the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop efficient tensor programs for deep learning operators due to the high complexity of modern accelerators (e.g., NVIDIA GPUs and Google TPUs) and the rapidly growing number of operators. Deep learning compilers, such as Apache TVM, adopt declarative scheduling primitives to lower the bar of developing tensor programs. However, we show that this approach is insufficient to cover state-of-the-art tensor program optimizations (e.g., double buffering). In this paper, we propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering directly in the tensor programs. This new approach greatly enriches the expressible optimizations by allowing developers to manipulate tensor programs at a much finer granularity (e.g., allowing program statement-level optimizations). We call the proposed method the task-mapping-oriented programming paradigm. With the proposed paradigm, we implement a deep learning compiler - Hidet. Extensive experiments on modern convolution and transformer models show that Hidet outperforms state-of-the-art DNN inference framework, ONNX Runtime, and compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average) with enriched optimizations. It also reduces the tuning time by 20x and 11x compared with AutoTVM and Ansor, respectively.
1st Place Solutions for the UVO Challenge 2022
- Authors: Jiajun Zhang, Boyu Chen, Zhilong Ji, Jinfeng Bai, Zonghai Hu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09629
- Pdf link: https://arxiv.org/pdf/2210.09629
- Abstract This paper describes the approach we have taken in the challenge. We still adopted the two-stage scheme same as the last champion, that is, detection first and segmentation followed. We trained more powerful detector and segmentor separately. Besides, we also perform pseudo-label training on the test set, based on student-teacher framework and end-to-end transformer based object detection. The method ranks first on the 2nd Unidentified Video Objects (UVO) challenge, achieving AR@100 of 46.8, 64.7 and 32.2 in the limited data frame track, unlimited data frame track and video track respectively.
Decoupling Features in Hierarchical Propagation for Video Object Segmentation
- Authors: Zongxin Yang, Yi Yang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09782
- Pdf link: https://arxiv.org/pdf/2210.09782
- Abstract This paper focuses on developing a more effective method of hierarchical propagation for semi-supervised Video Object Segmentation (VOS). Based on vision transformers, the recently-developed Associating Objects with Transformers (AOT) approach introduces hierarchical propagation into VOS and has shown promising results. The hierarchical propagation can gradually propagate information from past frames to the current frame and transfer the current frame feature from object-agnostic to object-specific. However, the increase of object-specific information will inevitably lead to the loss of object-agnostic visual information in deep propagation layers. To solve such a problem and further facilitate the learning of visual embeddings, this paper proposes a Decoupling Features in Hierarchical Propagation (DeAOT) approach. Firstly, DeAOT decouples the hierarchical propagation of object-agnostic and object-specific embeddings by handling them in two independent branches. Secondly, to compensate for the additional computation from dual-branch propagation, we propose an efficient module for constructing hierarchical propagation, i.e., Gated Propagation Module, which is carefully designed with single-head attention. Extensive experiments show that DeAOT significantly outperforms AOT in both accuracy and efficiency. On YouTube-VOS, DeAOT can achieve 86.0% at 22.4fps and 82.0% at 53.4fps. Without test-time augmentations, we achieve new state-of-the-art performance on four benchmarks, i.e., YouTube-VOS (86.2%), DAVIS 2017 (86.2%), DAVIS 2016 (92.9%), and VOT 2020 (0.622). Project page: https://github.com/z-x-yang/AOT.
Overlap-guided Gaussian Mixture Models for Point Cloud Registration
- Authors: Guofeng Mei, Fabio Poiesi, Cristiano Saltori, Jian Zhang, Elisa Ricci, Nicu Sebe
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09836
- Pdf link: https://arxiv.org/pdf/2210.09836
- Abstract Probabilistic 3D point cloud registration methods have shown competitive performance in overcoming noise, outliers, and density variations. However, registering point cloud pairs in the case of partial overlap is still a challenge. This paper proposes a novel overlap-guided probabilistic registration approach that computes the optimal transformation from matched Gaussian Mixture Model (GMM) parameters. We reformulate the registration problem as the problem of aligning two Gaussian mixtures such that a statistical discrepancy measure between the two corresponding mixtures is minimized. We introduce a Transformer-based detection module to detect overlapping regions, and represent the input point clouds using GMMs by guiding their alignment through overlap scores computed by this detection module. Experiments show that our method achieves superior registration accuracy and efficiency than state-of-the-art methods when handling point clouds with partial overlap and different densities on synthetic and real-world datasets. https://github.com/gfmei/ogmm
Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging
- Authors: Ayyoob Imani, Silvia Severini, Masoud Jalili Sabet, François Yvon, Hinrich Schütze
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2210.09840
- Pdf link: https://arxiv.org/pdf/2210.09840
- Abstract Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.
Multimodal Image Fusion based on Hybrid CNN-Transformer and Non-local Cross-modal Attention
- Authors: Yu Yuan, Jiaqi Wu, Zhongliang Jing, Henry Leung, Han Pan
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09847
- Pdf link: https://arxiv.org/pdf/2210.09847
- Abstract The fusion of images taken by heterogeneous sensors helps to enrich the information and improve the quality of imaging. In this article, we present a hybrid model consisting of a convolutional encoder and a Transformer-based decoder to fuse multimodal images. In the encoder, a non-local cross-modal attention block is proposed to capture both local and global dependencies of multiple source images. A branch fusion module is designed to adaptively fuse the features of the two branches. We embed a Transformer module with linear complexity in the decoder to enhance the reconstruction capability of the proposed network. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method by comparing it with existing state-of-the-art fusion models. The source code of our work is available at https://github.com/pandayuanyu/HCFusion.
Sequence and Circle: Exploring the Relationship Between Patches
- Authors: Zhengyang Yu, Jochen Triesch
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09871
- Pdf link: https://arxiv.org/pdf/2210.09871
- Abstract The vision transformer (ViT) has achieved state-of-the-art results in various vision tasks. It utilizes a learnable position embedding (PE) mechanism to encode the location of each image patch. However, it is presently unclear if this learnable PE is really necessary and what its benefits are. This paper explores two alternative ways of encoding the location of individual patches that exploit prior knowledge about their spatial arrangement. One is called the sequence relationship embedding (SRE), and the other is called the circle relationship embedding(CRE). Among them, the SRE considers all patches to be in order, and adjacent patches have the same interval distance. The CRE considers the central patch as the center of the circle and measures the distance of the remaining patches from the center based on the four neighborhoods principle. Multiple concentric circles with different radii combine different patches. Finally, we implemented these two relations on three classic ViTs and tested them on four popular datasets. Experiments show that SRE and CRE can replace PE to reduce the random learnable parameters while achieving the same performance. Combining SRE or CRE with PE gets better performance than only using PE.
Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation
- Authors: Yangheng Zhao, Jun Wang, Xiaolong Li, Yue Hu, Ce Zhang, Yanfeng Wang, Siheng Chen
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09948
- Pdf link: https://arxiv.org/pdf/2210.09948
- Abstract 3D point cloud semantic segmentation is one of the fundamental tasks for 3D scene understanding and has been widely used in the metaverse applications. Many recent 3D semantic segmentation methods learn a single prototype (classifier weights) for each semantic class, and classify 3D points according to their nearest prototype. However, learning only one prototype for each class limits the model's ability to describe the high variance patterns within a class. Instead of learning a single prototype for each class, in this paper, we propose to use an adaptive number of prototypes to dynamically describe the different point patterns within a semantic class. With the powerful capability of vision transformer, we design a Number-Adaptive Prototype Learning (NAPL) model for point cloud semantic segmentation. To train our NAPL model, we propose a simple yet effective prototype dropout training strategy, which enables our model to adaptively produce prototypes for each class. The experimental results on SemanticKITTI dataset demonstrate that our method achieves 2.3% mIoU improvement over the baseline model based on the point-wise classification paradigm.
Transfer-learning for video classification: Video Swin Transformer on multiple domains
- Authors: Daniel Oliveira, David Martins de Matos
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2210.09969
- Pdf link: https://arxiv.org/pdf/2210.09969
- Abstract The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.
Phenomenological Model of Superconducting Optoelectronic Loop Neurons
- Authors: Jeffrey M. Shainline, Bryce A. Primavera, Saeed Khan
- Subjects: Neural and Evolutionary Computing (cs.NE); Disordered Systems and Neural Networks (cond-mat.dis-nn); Adaptation and Self-Organizing Systems (nlin.AO)
- Arxiv link: https://arxiv.org/abs/2210.09976
- Pdf link: https://arxiv.org/pdf/2210.09976
- Abstract Superconducting optoelectronic loop neurons are a class of circuits potentially conducive to networks for large-scale artificial cognition. These circuits employ superconducting components including single-photon detectors, Josephson junctions, and transformers to achieve neuromorphic functions. To date, all simulations of loop neurons have used first-principles circuit analysis to model the behavior of synapses, dendrites, and neurons. These circuit models are computationally inefficient and leave opaque the relationship between loop neurons and other complex systems. Here we introduce a modeling framework that captures the behavior of the relevant synaptic, dendritic, and neuronal circuits at a phenomenological level without resorting to full circuit equations. Within this compact model, each dendrite is discovered to obey a single nonlinear leaky-integrator ordinary differential equation, while a neuron is modeled as a dendrite with a thresholding element and an additional feedback mechanism for establishing a refractory period. A synapse is modeled as a single-photon detector coupled to a dendrite, where the response of the single-photon detector follows a closed-form expression. We quantify the accuracy of the phenomenological model relative to circuit simulations and find that the approach reduces computational time by a factor of ten thousand while maintaining accuracy of one part in ten thousand. We demonstrate the use of the model with several basic examples. The net increase in computational efficiency enables future simulation of large networks, while the formulation provides a connection to a large body of work in applied mathematics, computational neuroscience, and physical systems such as spin glasses.
Post-hoc analysis of Arabic transformer models
- Authors: Ahmed Abdelali, Nadir Durrani, Fahim Dalvi, Hassan Sajjad
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2210.09990
- Pdf link: https://arxiv.org/pdf/2210.09990
- Abstract Arabic is a Semitic language which is widely spoken with many dialects. Given the success of pre-trained language models, many transformer models trained on Arabic and its dialects have surfaced. While there have been an extrinsic evaluation of these models with respect to downstream NLP tasks, no work has been carried out to analyze and compare their internal representations. We probe how linguistic information is encoded in the transformer models, trained on different Arabic dialects. We perform a layer and neuron analysis on the models using morphological tagging tasks for different dialects of Arabic and a dialectal identification task. Our analysis enlightens interesting findings such as: i) word morphology is learned at the lower and middle layers, ii) while syntactic dependencies are predominantly captured at the higher layers, iii) despite a large overlap in their vocabulary, the MSA-based models fail to capture the nuances of Arabic dialects, iv) we found that neurons in embedding layers are polysemous in nature, while the neurons in middle layers are exclusive to specific properties
From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data
- Authors: Zichen Jeff Cui, Yibin Wang, Nur Muhammad (Mahi)Shafiullah, Lerrel Pinto
- Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.10047
- Pdf link: https://arxiv.org/pdf/2210.10047
- Abstract While large-scale sequence modeling from offline data has led to impressive performance gains in natural language and image generation, directly translating such ideas to robotics has been challenging. One critical reason for this is that uncurated robot demonstration data, i.e. play data, collected from non-expert human demonstrators are often noisy, diverse, and distributionally multi-modal. This makes extracting useful, task-centric behaviors from such data a difficult generative modeling problem. In this work, we present Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification. On a suite of simulated benchmark tasks, we find that C-BeT improves upon prior state-of-the-art work in learning from play data by an average of 45.7%. Further, we demonstrate for the first time that useful task-centric behaviors can be learned on a real-world robot purely from play data without any task labels or reward information. Robot videos are best viewed on our project website: https://play-to-policy.github.io
Keyword: scene understanding
Pareto Manifold Learning: Tackling multiple tasks via ensembles of single-task models
- Authors: Nikolaos Dimitriadis, Pascal Frossard, François Fleuret
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2210.09759
- Pdf link: https://arxiv.org/pdf/2210.09759
- Abstract In Multi-Task Learning, tasks may compete and limit the performance achieved on each other rather than guiding the optimization trajectory to a common solution, superior to its single-task counterparts. There is often not a single solution that is optimal for all tasks, leading practitioners to balance tradeoffs between tasks' performance, and to resort to optimality in the Pareto sense. Current Multi-Task Learning methodologies either completely neglect this aspect of functional diversity, and produce one solution in the Pareto Front predefined by their optimization schemes, or produce diverse but discrete solutions, each requiring a separate training run. In this paper, we conjecture that there exist Pareto Subspaces, i.e., weight subspaces where multiple optimal functional solutions lie. We propose Pareto Manifold Learning, an ensembling method in weight space that is able to discover such a parameterization and produces a continuous Pareto Front in a single training run, allowing practitioners to modulate the performance on each task during inference on the fly. We validate the proposed method on a diverse set of multi-task learning benchmarks, ranging from image classification to tabular datasets and scene understanding, and show that Pareto Manifold Learning outperforms state-of-the-art algorithms.
Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation
- Authors: Yangheng Zhao, Jun Wang, Xiaolong Li, Yue Hu, Ce Zhang, Yanfeng Wang, Siheng Chen
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2210.09948
- Pdf link: https://arxiv.org/pdf/2210.09948
- Abstract 3D point cloud semantic segmentation is one of the fundamental tasks for 3D scene understanding and has been widely used in the metaverse applications. Many recent 3D semantic segmentation methods learn a single prototype (classifier weights) for each semantic class, and classify 3D points according to their nearest prototype. However, learning only one prototype for each class limits the model's ability to describe the high variance patterns within a class. Instead of learning a single prototype for each class, in this paper, we propose to use an adaptive number of prototypes to dynamically describe the different point patterns within a semantic class. With the powerful capability of vision transformer, we design a Number-Adaptive Prototype Learning (NAPL) model for point cloud semantic segmentation. To train our NAPL model, we propose a simple yet effective prototype dropout training strategy, which enables our model to adaptively produce prototypes for each class. The experimental results on SemanticKITTI dataset demonstrate that our method achieves 2.3% mIoU improvement over the baseline model based on the point-wise classification paradigm.
Keyword: visual reasoning
There is no result