arxiv-daily New submissions for Thu, 11 Aug 22

New submissions for Thu, 11 Aug 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Machine Learning with DBOS

Authors: Robert Redmond, Nathan W. Weckwerth, Brian S. Xia, Qian Li, Peter Kraft, Deeptaanshu Kumar, Çağatay Demiralp, Michael Stonebraker
Subjects: Cryptography and Security (cs.CR); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.05101
Pdf link: https://arxiv.org/pdf/2208.05101
Abstract We recently proposed a new cluster operating system stack, DBOS, centered on a DBMS. DBOS enables unique support for ML applications by encapsulating ML code within stored procedures, centralizing ancillary ML data, providing security built into the underlying DBMS, co-locating ML code and data, and tracking data and workflow provenance. Here we demonstrate a subset of these benefits around two ML applications. We first show that image classification and object detection models using GPUs can be served as DBOS stored procedures with performance competitive to existing systems. We then present a 1D CNN trained to detect anomalies in HTTP requests on DBOS-backed web services, achieving SOTA results. We use this model to develop an interactive anomaly detection system and evaluate it through qualitative user feedback, demonstrating its usefulness as a proof of concept for future work to develop learned real-time security services on top of DBOS.

Automatic Camera Control and Directing with an Ultra-High-Definition Collaborative Recording System

Authors: Bram Vanherle, Tim Vervoort, Nick Michiels, Philippe Bekaert
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2208.05213
Pdf link: https://arxiv.org/pdf/2208.05213
Abstract Capturing an event from multiple camera angles can give a viewer the most complete and interesting picture of that event. To be suitable for broadcasting, a human director needs to decide what to show at each point in time. This can become cumbersome with an increasing number of camera angles. The introduction of omnidirectional or wide-angle cameras has allowed for events to be captured more completely, making it even more difficult for the director to pick a good shot. In this paper, a system is presented that, given multiple ultra-high resolution video streams of an event, can generate a visually pleasing sequence of shots that manages to follow the relevant action of an event. Due to the algorithm being general purpose, it can be applied to most scenarios that feature humans. The proposed method allows for online processing when real-time broadcasting is required, as well as offline processing when the quality of the camera operation is the priority. Object detection is used to detect humans and other objects of interest in the input streams. Detected persons of interest, along with a set of rules based on cinematic conventions, are used to determine which video stream to show and what part of that stream is virtually framed. The user can provide a number of settings that determine how these rules are interpreted. The system is able to handle input from different wide-angle video streams by removing lens distortions. Using a user study it is shown, for a number of different scenarios, that the proposed automated director is able to capture an event with aesthetically pleasing video compositions and human-like shot switching behavior.

A Fresh Perspective on DNN Accelerators by Performing Holistic Analysis Across Paradigms

Authors: Tom Glint, Chandan Kumar Jha, Manu Awasthi, Joycee Mekie
Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2208.05294
Pdf link: https://arxiv.org/pdf/2208.05294
Abstract Traditional computers with von Neumann architecture are unable to meet the latency and scalability challenges of Deep Neural Network (DNN) workloads. Various DNN accelerators based on Conventional compute Hardware Accelerator (CHA), Near-Data-Processing (NDP) and Processing-in-Memory (PIM) paradigms have been proposed to meet these challenges. Our goal in this work is to perform a rigorous comparison among the state-of-the-art accelerators from DNN accelerator paradigms, we have used unique layers from MobileNet, ResNet, BERT, and DLRM of MLPerf Inference benchmark for our analysis. The detailed models are based on hardware-realized state-of-the art designs. We observe that for memory-intensive Fully Connected Layer (FCL) DNNs, NDP based accelerator is 10.6x faster than the state-of-the-art CHA and 39.9x faster than PIM based accelerator for inferencing. For compute-intensive image classification and object detection DNNs, the state-of-the-art CHA is ~10x faster than NDP and ~2000x faster than the PIM-based accelerator for inferencing. PIM-based accelerators are suitable for DNN applications where energy is a constraint (~2.7x and ~21x lower energy for CNN and FCL applications, respectively, than conventional ASIC systems). Further, we identify architectural changes (such as increasing memory bandwidth, buffer reorganization) that can increase throughput (up to linear increase) and lower energy (up to linear decrease) for ML applications with a detailed sensitivity analysis of relevant components in CHA, NDP and PIM based accelerators.

Keyword: transformer

Attention Hijacking in Trojan Transformers

Authors: Weimin Lyu, Songzhu Zheng, Tengfei Ma, Haibin Ling, Chao Chen
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2208.04946
Pdf link: https://arxiv.org/pdf/2208.04946
Abstract Trojan attacks pose a severe threat to AI systems. Recent works on Transformer models received explosive popularity and the self-attentions are now indisputable. This raises a central question: Can we reveal the Trojans through attention mechanisms in BERTs and ViTs? In this paper, we investigate the attention hijacking pattern in Trojan AIs, \ie, the trigger token ``kidnaps'' the attention weights when a specific trigger is present. We observe the consistent attention hijacking pattern in Trojan Transformers from both Natural Language Processing (NLP) and Computer Vision (CV) domains. This intriguing property helps us to understand the Trojan mechanism in BERTs and ViTs. We also propose an Attention-Hijacking Trojan Detector (AHTD) to discriminate the Trojan AIs from the clean ones.

CoViT: Real-time phylogenetics for the SARS-CoV-2 pandemic using Vision Transformers

Authors: Zuher Jahshan, Leonid Yavits
Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2208.05004
Pdf link: https://arxiv.org/pdf/2208.05004
Abstract Real-time viral genome detection, taxonomic classification and phylogenetic analysis are critical for efficient tracking and control of viral pandemics such as Covid-19. However, the unprecedented and still growing amounts of viral genome data create a computational bottleneck, which effectively prevents the real-time pandemic tracking. We are attempting to alleviate this bottleneck by modifying and applying Vision Transformer, a recently developed neural network model for image recognition, to taxonomic classification and placement of viral genomes, such as SARS-CoV-2. Our solution, CoViT, places newly acquired samples onto the tree of SARS-CoV-2 lineages. One of the two potential placements returned by CoVit is the true one with the probability of 99.0%. The probability of the correct placement to be found among five potential placements generated by CoViT is 99.8%. The placement time is 1.45ms per individual genome running on NVIDIAs GeForce RTX 2080 Ti GPU. We make CoViT available to research community through GitHub: https://github.com/zuherJahshan/covit.

Collaborative Feature Maps of Networks and Hosts for AI-driven Intrusion Detection

Authors: Jinxin Liu, Murat Simsek, Burak Kantarci, Mehran Bagheri, Petar Djukic
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2208.05085
Pdf link: https://arxiv.org/pdf/2208.05085
Abstract Intrusion Detection Systems (IDS) are critical security mechanisms that protect against a wide variety of network threats and malicious behaviors on networks or hosts. As both Network-based IDS (NIDS) or Host-based IDS (HIDS) have been widely investigated, this paper aims to present a Combined Intrusion Detection System (CIDS) that integrates network and host data in order to improve IDS performance. Due to the scarcity of datasets that include both network packet and host data, we present a novel CIDS dataset formation framework that can handle log files from a variety of operating systems and align log entities with network flows. A new CIDS dataset named SCVIC-CIDS-2021 is derived from the meta-data from the well-known benchmark dataset, CIC-IDS-2018 by utilizing the proposed framework. Furthermore, a transformer-based deep learning model named CIDS-Net is proposed that can take network flow and host features as inputs and outperform baseline models that rely on network flow features only. Experimental results to evaluate the proposed CIDS-Net under the SCVIC-CIDS-2021 dataset support the hypothesis for the benefits of combining host and flow features as the proposed CIDS-Net can improve the macro F1 score of baseline solutions by 6.36% (up to 99.89%).

Ghost-free High Dynamic Range Imaging with Context-aware Transformer

Authors: Zhen Liu, Yinglong Wang, Bing Zeng, Shuaicheng Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.05114
Pdf link: https://arxiv.org/pdf/2208.05114
Abstract High dynamic range (HDR) deghosting algorithms aim to generate ghost-free HDR images with realistic details. Restricted by the locality of the receptive field, existing CNN-based methods are typically prone to producing ghosting artifacts and intensity distortions in the presence of large motion and severe saturation. In this paper, we propose a novel Context-Aware Vision Transformer (CA-ViT) for ghost-free high dynamic range imaging. The CA-ViT is designed as a dual-branch architecture, which can jointly capture both global and local dependencies. Specifically, the global branch employs a window-based Transformer encoder to model long-range object movements and intensity variations to solve ghosting. For the local branch, we design a local context extractor (LCE) to capture short-range image features and use the channel attention mechanism to select informative local details across the extracted features to complement the global branch. By incorporating the CA-ViT as basic components, we further build the HDR-Transformer, a hierarchical network to reconstruct high-quality ghost-free HDR images. Extensive experiments on three benchmark datasets show that our approach outperforms state-of-the-art methods qualitatively and quantitatively with considerably reduced computational budgets. Codes are available at https://github.com/megvii-research/HDR-Transformer

Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

Authors: Zhengang Li, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, Xue Lin, Zhenman Fang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2208.05163
Pdf link: https://arxiv.org/pdf/2208.05163
Abstract Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, this is the first FPGA-based ViT acceleration framework exploring model quantization. Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0.47% to 1.36% higher Top-1 accuracy under the same bit-width. Compared with the 32-bit floating-point baseline FPGA accelerator, our accelerator achieves around 5.6x improvement on the frame rate (i.e., 56.8 FPS vs. 10.0 FPS) with 0.71% accuracy drop on ImageNet dataset for DeiT-base.

Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer

Authors: Zhipeng Luo, Changqing Zhou, Liang Pan, Gongjie Zhang, Tianrui Liu, Yueru Luo, Haiyu Zhao, Ziwei Liu, Shijian Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.05216
Pdf link: https://arxiv.org/pdf/2208.05216
Abstract With the prevalence of LiDAR sensors in autonomous driving, 3D object tracking has received increasing attention. In a point cloud sequence, 3D object tracking aims to predict the location and orientation of an object in consecutive frames given an object template. Motivated by the success of transformers, we propose Point Tracking TRansformer (PTTR), which efficiently predicts high-quality 3D tracking results in a coarse-to-fine manner with the help of transformer operations. PTTR consists of three novel designs. 1) Instead of random sampling, we design Relation-Aware Sampling to preserve relevant points to the given template during subsampling. 2) We propose a Point Relation Transformer for effective feature aggregation and feature matching between the template and search region. 3) Based on the coarse tracking results, we employ a novel Prediction Refinement Module to obtain the final refined prediction through local feature pooling. In addition, motivated by the favorable properties of the Bird's-Eye View (BEV) of point clouds in capturing object motion, we further design a more advanced framework named PTTR++, which incorporates both the point-wise view and BEV representation to exploit their complementary effect in generating high-quality tracking results. PTTR++ substantially boosts the tracking performance on top of PTTR with low computational overhead. Extensive experiments over multiple datasets show that our proposed approaches achieve superior 3D tracking accuracy and efficiency.

Multi-scale Feature Aggregation for Crowd Counting

Authors: Xiaoheng Jiang, Xinyi Wu, Hisham Cholakkal, Rao Muhammad Anwer, Jiale Cao Mingliang Xu, Bing Zhou, Yanwei Pang, Fahad Shahbaz Khan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.05256
Pdf link: https://arxiv.org/pdf/2208.05256
Abstract Convolutional Neural Network (CNN) based crowd counting methods have achieved promising results in the past few years. However, the scale variation problem is still a huge challenge for accurate count estimation. In this paper, we propose a multi-scale feature aggregation network (MSFANet) that can alleviate this problem to some extent. Specifically, our approach consists of two feature aggregation modules: the short aggregation (ShortAgg) and the skip aggregation (SkipAgg). The ShortAgg module aggregates the features of the adjacent convolution blocks. Its purpose is to make features with different receptive fields fused gradually from the bottom to the top of the network. The SkipAgg module directly propagates features with small receptive fields to features with much larger receptive fields. Its purpose is to promote the fusion of features with small and large receptive fields. Especially, the SkipAgg module introduces the local self-attention features from the Swin Transformer blocks to incorporate rich spatial information. Furthermore, we present a local-and-global based counting loss by considering the non-uniform crowd distribution. Extensive experiments on four challenging datasets (ShanghaiTech dataset, UCF_CC_50 dataset, UCF-QNRF Dataset, WorldExpo'10 dataset) demonstrate the proposed easy-to-implement MSFANet can achieve promising results when compared with the previous state-of-the-art approaches.

Arbitrary Point Cloud Upsampling with Spherical Mixture of Gaussians

Authors: Anthony Dell'Eva, Marco Orsingher, Massimo Bertozzi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.05274
Pdf link: https://arxiv.org/pdf/2208.05274
Abstract Generating dense point clouds from sparse raw data benefits downstream 3D understanding tasks, but existing models are limited to a fixed upsampling ratio or to a short range of integer values. In this paper, we present APU-SMOG, a Transformer-based model for Arbitrary Point cloud Upsampling (APU). The sparse input is firstly mapped to a Spherical Mixture of Gaussians (SMOG) distribution, from which an arbitrary number of points can be sampled. Then, these samples are fed as queries to the Transformer decoder, which maps them back to the target surface. Extensive qualitative and quantitative evaluations show that APU-SMOG outperforms state-of-the-art fixed-ratio methods, while effectively enabling upsampling with any scaling factor, including non-integer values, with a single trained model. The code will be made available.

Multi-task Active Learning for Pre-trained Transformer-based Models

Authors: Guy Rotman, Roi Reichart
Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.05379
Pdf link: https://arxiv.org/pdf/2208.05379
Abstract Multi-task learning, in which several tasks are jointly learned by a single model, allows NLP models to share information from multiple annotations and may facilitate better predictions when the tasks are inter-related. This technique, however, requires annotating the same text with multiple annotation schemes which may be costly and laborious. Active learning (AL) has been demonstrated to optimize annotation processes by iteratively selecting unlabeled examples whose annotation is most valuable for the NLP model. Yet, multi-task active learning (MT-AL) has not been applied to state-of-the-art pre-trained Transformer-based NLP models. This paper aims to close this gap. We explore various multi-task selection criteria in three realistic multi-task scenarios, reflecting different relations between the participating tasks, and demonstrate the effectiveness of multi-task compared to single-task selection. Our results suggest that MT-AL can be effectively used in order to minimize annotation efforts for multi-task NLP models.

Keyword: scene understanding

The Relative Importance of Depth Cues and Semantic Edges for Indoor Mobility Using Simulated Prosthetic Vision in Immersive Virtual Reality

Authors: Alex Rasla, Michael Beyeler
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2208.05066
Pdf link: https://arxiv.org/pdf/2208.05066
Abstract Visual neuroprostheses (bionic eyes) have the potential to treat degenerative eye diseases that often result in low vision or complete blindness. These devices rely on an external camera to capture the visual scene, which is then translated frame-by-frame into an electrical stimulation pattern that is sent to the implant in the eye. To highlight more meaningful information in the scene, recent studies have tested the effectiveness of deep-learning based computer vision techniques, such as depth estimation to highlight nearby obstacles (DepthOnly mode) and semantic edge detection to outline important objects in the scene (EdgesOnly mode). However, nobody has attempted to combine the two, either by presenting them together (EdgesAndDepth) or by giving the user the ability to flexibly switch between them (EdgesOrDepth). Here, we used a neurobiologically inspired model of simulated prosthetic vision (SPV) in an immersive virtual reality (VR) environment to test the relative importance of semantic edges and relative depth cues to support the ability to avoid obstacles and identify objects. We found that participants were significantly better at avoiding obstacles using depth-based cues as opposed to relying on edge information alone, and that roughly half the participants preferred the flexibility to switch between modes (EdgesOrDepth). This study highlights the relative importance of depth cues for SPV mobility and is an important first step towards a visual neuroprosthesis that uses computer vision to improve a user's scene understanding.

RWSeg: Cross-graph Competing Random Walks for Weakly Supervised 3D Instance Segmentation

Authors: Shichao Dong, Ruibo Li, Jiacheng Wei, Fayao Liu, Guosheng Lin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.05110
Pdf link: https://arxiv.org/pdf/2208.05110
Abstract Instance segmentation on 3D point clouds has been attracting increasing attention due to its wide applications, especially in scene understanding areas. However, most existing methods require training data to be fully annotated. Manually preparing ground-truth labels at point-level is very cumbersome and labor-intensive. To address this issue, we propose a novel weakly supervised method RWSeg that only requires labeling one object with one point. With these sparse weak labels, we introduce a unified framework with two branches to propagate semantic and instance information respectively to unknown regions, using self-attention and random walk. Furthermore, we propose a Cross-graph Competing Random Walks (CGCRW) algorithm which encourages competition among different instance graphs to resolve ambiguities in closely placed objects and improve the performance on instance assignment. RWSeg can generate qualitative instance-level pseudo labels. Experimental results on ScanNet-v2 and S3DIS datasets show that our approach achieves comparable performance with fully-supervised methods and outperforms previous weakly-supervised methods by large margins. This is the first work that bridges the gap between weak and full supervision in the area.

Keyword: visual reasoning

There is no result

Aug 11 '22 03:08 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Thu, 11 Aug 22

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

Machine Learning with DBOS

Automatic Camera Control and Directing with an Ultra-High-Definition Collaborative Recording System

A Fresh Perspective on DNN Accelerators by Performing Holistic Analysis Across Paradigms

Keyword: transformer

Attention Hijacking in Trojan Transformers

CoViT: Real-time phylogenetics for the SARS-CoV-2 pandemic using Vision Transformers

Collaborative Feature Maps of Networks and Hosts for AI-driven Intrusion Detection

Ghost-free High Dynamic Range Imaging with Context-aware Transformer

Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer

Multi-scale Feature Aggregation for Crowd Counting

Arbitrary Point Cloud Upsampling with Spherical Mixture of Gaussians

Multi-task Active Learning for Pre-trained Transformer-based Models

Keyword: scene understanding

The Relative Importance of Depth Cues and Semantic Edges for Indoor Mobility Using Simulated Prosthetic Vision in Immersive Virtual Reality

RWSeg: Cross-graph Competing Random Walks for Weakly Supervised 3D Instance Segmentation

Keyword: visual reasoning

arxiv-daily
arxiv-daily copied to clipboard