Paper-Daily-Notice
Paper-Daily-Notice copied to clipboard
New submissions for Thu, 7 Jul 22
New submissions for Thu, 7 Jul 22
Keyword: SLAM
There is no result
Keyword: odometry
RoVaR: Robust Multi-agent Tracking through Dual-layer Diversity in Visual and RF Sensor Fusion
- Authors: Mallesham Dasari, Ramanujan K Sheshadri, Karthikeyan Sundaresan, Samir R. Das
- Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2207.02792
- Pdf link: https://arxiv.org/pdf/2207.02792
- Abstract The plethora of sensors in our commodity devices provides a rich substrate for sensor-fused tracking. Yet, today's solutions are unable to deliver robust and high tracking accuracies across multiple agents in practical, everyday environments - a feature central to the future of immersive and collaborative applications. This can be attributed to the limited scope of diversity leveraged by these fusion solutions, preventing them from catering to the multiple dimensions of accuracy, robustness (diverse environmental conditions) and scalability (multiple agents) simultaneously. In this work, we take an important step towards this goal by introducing the notion of dual-layer diversity to the problem of sensor fusion in multi-agent tracking. We demonstrate that the fusion of complementary tracking modalities, - passive/relative (e.g., visual odometry) and active/absolute tracking (e.g., infrastructure-assisted RF localization) offer a key first layer of diversity that brings scalability while the second layer of diversity lies in the methodology of fusion, where we bring together the complementary strengths of algorithmic (for robustness) and data-driven (for accuracy) approaches. RoVaR is an embodiment of such a dual-layer diversity approach that intelligently attends to cross-modal information using algorithmic and data-driven techniques that jointly share the burden of accurately tracking multiple agents in the wild. Extensive evaluations reveal RoVaR's multi-dimensional benefits in terms of tracking accuracy (median of 15cm), robustness (in unseen environments), light weight (runs in real-time on mobile platforms such as Jetson Nano/TX2), to enable practical multi-agent immersive applications in everyday environments.
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: lidar
There is no result
Keyword: loop detection
There is no result
Keyword: nerf
SNeRF: Stylized Neural Implicit Representations for 3D Scenes
- Authors: Thu Nguyen-Phuoc, Feng Liu, Lei Xiao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02363
- Pdf link: https://arxiv.org/pdf/2207.02363
- Abstract This paper presents a stylized novel view synthesis method. Applying state-of-the-art stylization methods to novel views frame by frame often causes jittering artifacts due to the lack of cross-view consistency. Therefore, this paper investigates 3D scene stylization that provides a strong inductive bias for consistent novel view synthesis. Specifically, we adopt the emerging neural radiance fields (NeRF) as our choice of 3D scene representation for their capability to render high-quality novel views for a variety of scenes. However, as rendering a novel view from a NeRF requires a large number of samples, training a stylized NeRF requires a large amount of GPU memory that goes beyond an off-the-shelf GPU capacity. We introduce a new training method to address this problem by alternating the NeRF and stylization optimization steps. Such a method enables us to make full use of our hardware memory capacity to both generate images at higher resolution and adopt more expressive image style transfer methods. Our experiments show that our method produces stylized NeRFs for a wide range of content, including indoor, outdoor and dynamic scenes, and synthesizes high-quality novel views with cross-view consistency.
VMRF: View Matching Neural Radiance Fields
- Authors: Jiahui Zhang, Fangneng Zhan, Rongliang Wu, Yingchen Yu, Wenqing Zhang, Bai Song, Xiaoqin Zhang, Shijian Lu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02621
- Pdf link: https://arxiv.org/pdf/2207.02621
- Abstract Neural Radiance Fields (NeRF) have demonstrated very impressive performance in novel view synthesis via implicitly modelling 3D representations from multi-view 2D images. However, most existing studies train NeRF models with either reasonable camera pose initialization or manually-crafted camera pose distributions which are often unavailable or hard to acquire in various real-world data. We design VMRF, an innovative view matching NeRF that enables effective NeRF training without requiring prior knowledge in camera poses or camera pose distributions. VMRF introduces a view matching scheme, which exploits unbalanced optimal transport to produce a feature transport plan for mapping a rendered image with randomly initialized camera pose to the corresponding real image. With the feature transport plan as the guidance, a novel pose calibration technique is designed which rectifies the initially randomized camera poses by predicting relative pose transformations between the pair of rendered and real images. Extensive experiments over a number of synthetic and real datasets show that the proposed VMRF outperforms the state-of-the-art qualitatively and quantitatively by large margins.
Keyword: mapping
AI-enhanced iterative solvers for accelerating the solution of large scale parametrized linear systems of equations
- Authors: Stefanos Nikolopoulos, Ioannis Kalogeris, Vissarion Papadopoulos, George Stavroulakis
- Subjects: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2207.02543
- Pdf link: https://arxiv.org/pdf/2207.02543
- Abstract Recent advances in the field of machine learning open a new era in high performance computing. Applications of machine learning algorithms for the development of accurate and cost-efficient surrogates of complex problems have already attracted major attention from scientists. Despite their powerful approximation capabilities, however, surrogates cannot produce the `exact' solution to the problem. To address this issue, this paper exploits up-to-date ML tools and delivers customized iterative solvers of linear equation systems, capable of solving large-scale parametrized problems at any desired level of accuracy. Specifically, the proposed approach consists of the following two steps. At first, a reduced set of model evaluations is performed and the corresponding solutions are used to establish an approximate mapping from the problem's parametric space to its solution space using deep feedforward neural networks and convolutional autoencoders. This mapping serves a means to obtain very accurate initial predictions of the system's response to new query points at negligible computational cost. Subsequently, an iterative solver inspired by the Algebraic Multigrid method in combination with Proper Orthogonal Decomposition, termed POD-2G, is developed that successively refines the initial predictions towards the exact system solutions. The application of POD-2G as a standalone solver or as preconditioner in the context of preconditioned conjugate gradient methods is demonstrated on several numerical examples of large scale systems, with the results indicating its superiority over conventional iterative solution schemes.
VMRF: View Matching Neural Radiance Fields
- Authors: Jiahui Zhang, Fangneng Zhan, Rongliang Wu, Yingchen Yu, Wenqing Zhang, Bai Song, Xiaoqin Zhang, Shijian Lu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02621
- Pdf link: https://arxiv.org/pdf/2207.02621
- Abstract Neural Radiance Fields (NeRF) have demonstrated very impressive performance in novel view synthesis via implicitly modelling 3D representations from multi-view 2D images. However, most existing studies train NeRF models with either reasonable camera pose initialization or manually-crafted camera pose distributions which are often unavailable or hard to acquire in various real-world data. We design VMRF, an innovative view matching NeRF that enables effective NeRF training without requiring prior knowledge in camera poses or camera pose distributions. VMRF introduces a view matching scheme, which exploits unbalanced optimal transport to produce a feature transport plan for mapping a rendered image with randomly initialized camera pose to the corresponding real image. With the feature transport plan as the guidance, a novel pose calibration technique is designed which rectifies the initially randomized camera poses by predicting relative pose transformations between the pair of rendered and real images. Extensive experiments over a number of synthetic and real datasets show that the proposed VMRF outperforms the state-of-the-art qualitatively and quantitatively by large margins.
Deep Learning approach for Classifying Trusses and Runners of Strawberries
- Authors: Jakub Pomykala, Francisco de Lemos, Isibor Kennedy Ihianle, David Ada Adama, Pedro Machado
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2207.02721
- Pdf link: https://arxiv.org/pdf/2207.02721
- Abstract The use of artificial intelligence in the agricultural sector has been growing at a rapid rate to automate farming activities. Emergent farming technologies focus on mapping and classification of plants, fruits, diseases, and soil types. Although, assisted harvesting and pruning applications using deep learning algorithms are in the early development stages, there is a demand for solutions to automate such processes. This paper proposes the use of Deep Learning for the classification of trusses and runners of strawberry plants using semantic segmentation and dataset augmentation. The proposed approach is based on the use of noises (i.e. Gaussian, Speckle, Poisson and Salt-and-Pepper) to artificially augment the dataset and compensate the low number of data samples and increase the overall classification performance. The results are evaluated using mean average of precision, recall and F1 score. The proposed approach achieved 91%, 95% and 92% on precision, recall and F1 score, respectively, for truss detection using the ResNet101 with dataset augmentation utilising Salt-and-Pepper noise; and 83%, 53% and 65% on precision, recall and F1 score, respectively, for truss detection using the ResNet50 with dataset augmentation utilising Poisson noise.
Keyword: localization
Leveraging Trajectory Prediction for Pedestrian Video Anomaly Detection
- Authors: Asiegbu Miracle Kanu-Asiegbu, Ram Vasudevan, Xiaoxiao Du
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02279
- Pdf link: https://arxiv.org/pdf/2207.02279
- Abstract Video anomaly detection is a core problem in vision. Correctly detecting and identifying anomalous behaviors in pedestrians from video data will enable safety-critical applications such as surveillance, activity monitoring, and human-robot interaction. In this paper, we propose to leverage trajectory localization and prediction for unsupervised pedestrian anomaly event detection. Different than previous reconstruction-based approaches, our proposed framework rely on the prediction errors of normal and abnormal pedestrian trajectories to detect anomalies spatially and temporally. We present experimental results on real-world benchmark datasets on varying timescales and show that our proposed trajectory-predictor-based anomaly detection pipeline is effective and efficient at identifying anomalous activities of pedestrians in videos. Code will be made available at https://github.com/akanuasiegbu/Leveraging-Trajectory-Prediction-for-Pedestrian-Video-Anomaly-Detection.
White Matter Tracts are Point Clouds: Neuropsychological Score Prediction and Critical Region Localization via Geometric Deep Learning
- Authors: Yuqian Chen, Fan Zhang, Chaoyi Zhang, Tengfei Xue, Leo R. Zekelman, Jianzhong He, Yang Song, Nikos Makris, Yogesh Rathi, Alexandra J. Golby, Weidong Cai, Lauren J. O'Donnell
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02402
- Pdf link: https://arxiv.org/pdf/2207.02402
- Abstract White matter tract microstructure has been shown to influence neuropsychological scores of cognitive performance. However, prediction of these scores from white matter tract data has not been attempted. In this paper, we propose a deep-learning-based framework for neuropsychological score prediction using microstructure measurements estimated from diffusion magnetic resonance imaging (dMRI) tractography, focusing on predicting performance on a receptive vocabulary assessment task based on a critical fiber tract for language, the arcuate fasciculus (AF). We directly utilize information from all points in a fiber tract, without the need to average data along the fiber as is traditionally required by diffusion MRI tractometry methods. Specifically, we represent the AF as a point cloud with microstructure measurements at each point, enabling adoption of point-based neural networks. We improve prediction performance with the proposed Paired-Siamese Loss that utilizes information about differences between continuous neuropsychological scores. Finally, we propose a Critical Region Localization (CRL) algorithm to localize informative anatomical regions containing points with strong contributions to the prediction results. Our method is evaluated on data from 806 subjects from the Human Connectome Project dataset. Results demonstrate superior neuropsychological score prediction performance compared to baseline methods. We discover that critical regions in the AF are strikingly consistent across subjects, with the highest number of strongly contributing points located in frontal cortical regions (i.e., the rostral middle frontal, pars opercularis, and pars triangularis), which are strongly implicated as critical areas for language processes.
GAMa: Cross-view Video Geo-localization
- Authors: Shruti Vyas, Chen Chen, Mubarak Shah
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2207.02431
- Pdf link: https://arxiv.org/pdf/2207.02431
- Abstract The existing work in cross-view geo-localization is based on images where a ground panorama is matched to an aerial image. In this work, we focus on ground videos instead of images which provides additional contextual cues which are important for this task. There are no existing datasets for this problem, therefore we propose GAMa dataset, a large-scale dataset with ground videos and corresponding aerial images. We also propose a novel approach to solve this problem. At clip-level, a short video clip is matched with corresponding aerial image and is later used to get video-level geo-localization of a long video. Moreover, we propose a hierarchical approach to further improve the clip-level geolocalization. It is a challenging dataset, unaligned and limited field of view, and our proposed method achieves a Top-1 recall rate of 19.4% and 45.1% @1.0mile. Code and dataset are available at following link: https://github.com/svyas23/GAMa.
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation
- Authors: Yifan Zhang, Qijian Zhang, Zhiyu Zhu, Junhui Hou, Yixuan Yuan
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02466
- Pdf link: https://arxiv.org/pdf/2207.02466
- Abstract The inherent ambiguity in ground-truth annotations of 3D bounding boxes caused by occlusions, signal missing, or manual annotation errors can confuse deep 3D object detectors during training, thus deteriorating the detection accuracy. However, existing methods overlook such issues to some extent and treat the labels as deterministic. In this paper, we propose GLENet, a generative label uncertainty estimation framework adapted from conditional variational autoencoders, to model the one-to-many relationship between a typical 3D object and its potential ground-truth bounding boxes with latent variables. The label uncertainty generated by GLENet is a plug-and-play module and can be conveniently integrated into existing deep 3D detectors to build probabilistic detectors and supervise the learning of the localization uncertainty. Besides, we propose an uncertainty-aware quality estimator architecture in probabilistic detectors to guide the training of IoU-branch with predicted localization uncertainty. We incorporate the proposed methods into various popular base 3D detectors and observe that their performance is significantly boosted to the current state-of-the-art over the Waymo Open dataset and KITTI dataset.
PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning
- Authors: Yifan Lu, Ziqi Zhang, Yuxin Chen, Chunfeng Yuan, Bing Li, Weiming Hu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02583
- Pdf link: https://arxiv.org/pdf/2207.02583
- Abstract The task of Dense Video Captioning (DVC) aims to generate captions with timestamps for multiple events in one video. Semantic information plays an important role for both localization and description of DVC. We present a semantic-assisted dense video captioning model based on the encoding-decoding framework. In the encoding stage, we design a concept detector to extract semantic information, which is then fused with multi-modal visual features to sufficiently represent the input video. In the decoding stage, we design a classification head, paralleled with the localization and captioning heads, to provide semantic supervision. Our method achieves significant improvements on the YouMakeup dataset under DVC evaluation metrics and achieves high performance in the Makeup Dense Video Captioning (MDVC) task of PIC 4th Challenge.
Team PKU-WICT-MIPL PIC Makeup Temporal Video Grounding Challenge 2022 Technical Report
- Authors: Minghang Zheng, Dejie Yang, Zhongjie Ye, Ting Lei, Yuxin Peng, Yang Liu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02687
- Pdf link: https://arxiv.org/pdf/2207.02687
- Abstract In this technical report, we briefly introduce the solutions of our team `PKU-WICT-MIPL' for the PIC Makeup Temporal Video Grounding (MTVG) Challenge in ACM-MM 2022. Given an untrimmed makeup video and a step query, the MTVG aims to localize a temporal moment of the target makeup step in the video. To tackle this task, we propose a phrase relationship mining framework to exploit the temporal localization relationship relevant to the fine-grained phrase and the whole sentence. Besides, we propose to constrain the localization results of different step sentence queries to not overlap with each other through a dynamic programming algorithm. The experimental results demonstrate the effectiveness of our method. Our final submission ranked 2nd on the leaderboard, with only a 0.55% gap from the first.
RoVaR: Robust Multi-agent Tracking through Dual-layer Diversity in Visual and RF Sensor Fusion
- Authors: Mallesham Dasari, Ramanujan K Sheshadri, Karthikeyan Sundaresan, Samir R. Das
- Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2207.02792
- Pdf link: https://arxiv.org/pdf/2207.02792
- Abstract The plethora of sensors in our commodity devices provides a rich substrate for sensor-fused tracking. Yet, today's solutions are unable to deliver robust and high tracking accuracies across multiple agents in practical, everyday environments - a feature central to the future of immersive and collaborative applications. This can be attributed to the limited scope of diversity leveraged by these fusion solutions, preventing them from catering to the multiple dimensions of accuracy, robustness (diverse environmental conditions) and scalability (multiple agents) simultaneously. In this work, we take an important step towards this goal by introducing the notion of dual-layer diversity to the problem of sensor fusion in multi-agent tracking. We demonstrate that the fusion of complementary tracking modalities, - passive/relative (e.g., visual odometry) and active/absolute tracking (e.g., infrastructure-assisted RF localization) offer a key first layer of diversity that brings scalability while the second layer of diversity lies in the methodology of fusion, where we bring together the complementary strengths of algorithmic (for robustness) and data-driven (for accuracy) approaches. RoVaR is an embodiment of such a dual-layer diversity approach that intelligently attends to cross-modal information using algorithmic and data-driven techniques that jointly share the burden of accurately tracking multiple agents in the wild. Extensive evaluations reveal RoVaR's multi-dimensional benefits in terms of tracking accuracy (median of 15cm), robustness (in unseen environments), light weight (runs in real-time on mobile platforms such as Jetson Nano/TX2), to enable practical multi-agent immersive applications in everyday environments.
Keyword: transformer
Video-based Surgical Skills Assessment using Long term Tool Tracking
- Authors: Mona Fathollahi, Mohammad Hasan Sarhan, Ramon Pena, Lela DiMonte, Anshu Gupta, Aishani Ataliwala, Jocelyn Barker
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02247
- Pdf link: https://arxiv.org/pdf/2207.02247
- Abstract Mastering the technical skills required to perform surgery is an extremely challenging task. Video-based assessment allows surgeons to receive feedback on their technical skills to facilitate learning and development. Currently, this feedback comes primarily from manual video review, which is time-intensive and limits the feasibility of tracking a surgeon's progress over many cases. In this work, we introduce a motion-based approach to automatically assess surgical skills from surgical case video feed. The proposed pipeline first tracks surgical tools reliably to create motion trajectories and then uses those trajectories to predict surgeon technical skill levels. The tracking algorithm employs a simple yet effective re-identification module that improves ID-switch compared to other state-of-the-art methods. This is critical for creating reliable tool trajectories when instruments regularly move on- and off-screen or are periodically obscured. The motion-based classification model employs a state-of-the-art self-attention transformer network to capture short- and long-term motion patterns that are essential for skill evaluation. The proposed method is evaluated on an in-vivo (Cholec80) dataset where an expert-rated GOALS skill assessment of the Calot Triangle Dissection is used as a quantitative skill measure. We compare transformer-based skill assessment with traditional machine learning approaches using the proposed and state-of-the-art tracking. Our result suggests that using motion trajectories from reliable tracking methods is beneficial for assessing surgeon skills based solely on video streams.
Array Camera Image Fusion using Physics-Aware Transformers
- Authors: Qian Huang, Minghao Hu, David Jones Brady
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2207.02250
- Pdf link: https://arxiv.org/pdf/2207.02250
- Abstract We demonstrate a physics-aware transformer for feature-based data fusion from cameras with diverse resolution, color spaces, focal planes, focal lengths, and exposure. We also demonstrate a scalable solution for synthetic training data generation for the transformer using open-source computer graphics software. We demonstrate image synthesis on arrays with diverse spectral responses, instantaneous field of view and frame rate.
OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers
- Authors: Jialun Pei, Tianyang Cheng, Deng-Ping Fan, He Tang, Chuanbo Chen, Luc Van Gool
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02255
- Pdf link: https://arxiv.org/pdf/2207.02255
- Abstract We present OSFormer, the first one-stage transformer framework for camouflaged instance segmentation (CIS). OSFormer is based on two key designs. First, we design a location-sensing transformer (LST) to obtain the location label and instance-aware parameters by introducing the location-guided queries and the blend-convolution feedforward network. Second, we develop a coarse-to-fine fusion (CFF) to merge diverse context information from the LST encoder and CNN backbone. Coupling these two components enables OSFormer to efficiently blend local features and long-range context dependencies for predicting camouflaged instances. Compared with two-stage frameworks, our OSFormer reaches 41% AP and achieves good convergence efficiency without requiring enormous training data, i.e., only 3,040 samples under 60 epochs. Code link: https://github.com/PJLallen/OSFormer.
Ultra-Low-Bitrate Speech Coding with Pretrained Transformers
- Authors: Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, Jan Skoglund
- Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2207.02262
- Pdf link: https://arxiv.org/pdf/2207.02262
- Abstract Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the convolutional encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. Subjective human evaluations suggest that the quality of the resulting codec is comparable or better than that of conventional codecs operating at three to four times the rate.
Weakly Supervised Grounding for VQA in Vision-Language Transformers
- Authors: Aisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels Da Vitoria Lobo, Mubarak Shah
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02334
- Pdf link: https://arxiv.org/pdf/2207.02334
- Abstract Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. But most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, the following paper focuses on the problem of weakly supervised grounding in context of visual question answering in transformers. The approach leverages capsules by grouping each visual token in the visual encoder and uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field.
Multi-Label Retinal Disease Classification using Transformers
- Authors: M. A. Rodriguez, H. AlMarzouqi, P. Liatsis (Department of Electrical Engineering and Computer Science, Khalifa University)
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2207.02335
- Pdf link: https://arxiv.org/pdf/2207.02335
- Abstract Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. In this research, a novel multi-label classification system is proposed for the detection of multiple retinal diseases, using fundus images collected from a variety of sources. First, a new multi-label retinal disease dataset, the MuReD dataset, is constructed, using a number of publicly available datasets for fundus disease classification. Next, a sequence of post-processing steps is applied to ensure the quality of the image data and the range of diseases, present in the dataset. For the first time in fundus multi-label disease classification, a transformer-based model optimized through extensive experimentation is used for image analysis and decision making. Numerous experiments are performed to optimize the configuration of the proposed system. It is shown that the approach performs better than state-of-the-art works on the same task by 7.9% and 8.1% in terms of AUC score for disease detection and disease classification, respectively. The obtained results further support the potential applications of transformer-based architectures in the medical imaging field.
Generalization to translation shifts: a study in architectures and augmentations
- Authors: Suriya Gunasekar
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2207.02349
- Pdf link: https://arxiv.org/pdf/2207.02349
- Abstract We provide a detailed evaluation of various image classification architectures (convolutional, vision transformer, and fully connected MLP networks) and data augmentation techniques towards generalization to large spacial translation shifts. We make the following observations: (a) In the absence of data augmentation, all architectures, including convolutional networks suffer degradation in performance when evaluated on translated test distributions. Understandably, both the in-distribution accuracy as well as degradation to shifts is significantly worse for non-convolutional architectures. (b) Across all architectures, even a minimal augmentation of $4$ pixel random crop improves the robustness of performance to much larger magnitude shifts of up to $1/4$ of image size ($8$-$16$ pixels) in the test data -- suggesting a form of meta generalization from augmentation. For non-convolutional architectures, while the absolute accuracy is still low, we see dramatic improvements in robustness to large translation shifts. (c) With sufficiently advanced augmentation ($4$ pixel crop+RandAugmentation+Erasing+MixUp) pipeline all architectures can be trained to have competitive performance, both in terms of in-distribution accuracy as well as generalization to large translation shifts.
3DG-STFM: 3D Geometric Guided Student-Teacher Feature Matching
- Authors: Runyu Mao, Chen Bai, Yatong An, Fengqing Zhu, Cheng Lu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02375
- Pdf link: https://arxiv.org/pdf/2207.02375
- Abstract We tackle the essential task of finding dense visual correspondences between a pair of images. This is a challenging problem due to various factors such as poor texture, repetitive patterns, illumination variation, and motion blur in practical scenarios. In contrast to methods that use dense correspondence ground-truths as direct supervision for local feature matching training, we train 3DG-STFM: a multi-modal matching model (Teacher) to enforce the depth consistency under 3D dense correspondence supervision and transfer the knowledge to 2D unimodal matching model (Student). Both teacher and student models consist of two transformer-based matching modules that obtain dense correspondences in a coarse-to-fine manner. The teacher model guides the student model to learn RGB-induced depth information for the matching purpose on both coarse and fine branches. We also evaluate 3DG-STFM on a model compression task. To the best of our knowledge, 3DG-STFM is the first student-teacher learning method for the local feature matching task. The experiments show that our method outperforms state-of-the-art methods on indoor and outdoor camera pose estimations, and homography estimation problems. Code is available at: https://github.com/Ryan-prime/3DG-STFM.
Swin Deformable Attention U-Net Transformer (SDAUT) for Explainable Fast MRI
- Authors: Jiahao Huang, Xiaodan Xing, Zhifan Gao, Guang Yang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2207.02390
- Pdf link: https://arxiv.org/pdf/2207.02390
- Abstract Fast MRI aims to reconstruct a high fidelity image from partially observed measurements. Exuberant development in fast MRI using deep learning has been witnessed recently. Meanwhile, novel deep learning paradigms, e.g., Transformer based models, are fast-growing in natural language processing and promptly developed for computer vision and medical image analysis due to their prominent performance. Nevertheless, due to the complexity of the Transformer, the application of fast MRI may not be straightforward. The main obstacle is the computational cost of the self-attention layer, which is the core part of the Transformer, can be expensive for high resolution MRI inputs. In this study, we propose a new Transformer architecture for solving fast MRI that coupled Shifted Windows Transformer with U-Net to reduce the network complexity. We incorporate deformable attention to construe the explainability of our reconstruction model. We empirically demonstrate that our method achieves consistently superior performance on the fast MRI task. Besides, compared to state-of-the-art Transformer models, our method has fewer network parameters while revealing explainability. The code is publicly available at https://github.com/ayanglab/SDAUT.
Compute Cost Amortized Transformer for Streaming ASR
- Authors: Yi Xie, Jonathan Macoskey, Martin Radfar, Feng-Ju Chang, Brian King, Ariya Rastrow, Athanasios Mouchtaris, Grant P. Strimel
- Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2207.02393
- Pdf link: https://arxiv.org/pdf/2207.02393
- Abstract We present a streaming, Transformer-based end-to-end automatic speech recognition (ASR) architecture which achieves efficient neural inference through compute cost amortization. Our architecture creates sparse computation pathways dynamically at inference time, resulting in selective use of compute resources throughout decoding, enabling significant reductions in compute with minimal impact on accuracy. The fully differentiable architecture is trained end-to-end with an accompanying lightweight arbitrator mechanism operating at the frame-level to make dynamic decisions on each input while a tunable loss function is used to regularize the overall level of compute against predictive performance. We report empirical results from experiments using the compute amortized Transformer-Transducer (T-T) model conducted on LibriSpeech data. Our best model can achieve a 60% compute cost reduction with only a 3% relative word error rate (WER) increase.
Aspect-Based Sentiment Analysis using Local Context Focus Mechanism with DeBERTa
- Authors: Tianyu Zhao, Junping Du, Zhe Xu, Ang Li, Zeli Guan
- Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2207.02424
- Pdf link: https://arxiv.org/pdf/2207.02424
- Abstract Text sentiment analysis, also known as opinion mining, is research on the calculation of people's views, evaluations, attitude and emotions expressed by entities. Text sentiment analysis can be divided into text-level sentiment analysis, sen-tence-level sentiment analysis and aspect-level sentiment analysis. Aspect-Based Sentiment Analysis (ABSA) is a fine-grained task in the field of sentiment analysis, which aims to predict the polarity of aspects. The research of pre-training neural model has significantly improved the performance of many natural language processing tasks. In recent years, pre training model (PTM) has been applied in ABSA. Therefore, there has been a question, which is whether PTMs contain sufficient syntactic information for ABSA. In this paper, we explored the recent DeBERTa model (Decoding-enhanced BERT with disentangled attention) to solve Aspect-Based Sentiment Analysis problem. DeBERTa is a kind of neural language model based on transformer, which uses self-supervised learning to pre-train on a large number of original text corpora. Based on the Local Context Focus (LCF) mechanism, by integrating DeBERTa model, we purpose a multi-task learning model for aspect-based sentiment analysis. The experiments result on the most commonly used the laptop and restaurant datasets of SemEval-2014 and the ACL twitter dataset show that LCF mechanism with DeBERTa has significant improvement.
Transformers are Adaptable Task Planners
- Authors: Vidhi Jain, Yixin Lin, Eric Undersander, Yonatan Bisk, Akshara Rai
- Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2207.02442
- Pdf link: https://arxiv.org/pdf/2207.02442
- Abstract Every home is different, and every person likes things done in their particular way. Therefore, home robots of the future need to both reason about the sequential nature of day-to-day tasks and generalize to user's preferences. To this end, we propose a Transformer Task Planner(TTP) that learns high-level actions from demonstrations by leveraging object attribute-based representations. TTP can be pre-trained on multiple preferences and shows generalization to unseen preferences using a single demonstration as a prompt in a simulated dishwasher loading task. Further, we demonstrate real-world dish rearrangement using TTP with a Franka Panda robotic arm, prompted using a single human demonstration.
Gender Biases and Where to Find Them: Exploring Gender Bias in Pre-Trained Transformer-based Language Models Using Movement Pruning
- Authors: Przemyslaw Joniak, Akiko Aizawa
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2207.02463
- Pdf link: https://arxiv.org/pdf/2207.02463
- Abstract Language model debiasing has emerged as an important field of study in the NLP community. Numerous debiasing techniques were proposed, but bias ablation remains an unaddressed issue. We demonstrate a novel framework for inspecting bias in pre-trained transformer-based language models via movement pruning. Given a model and a debiasing objective, our framework finds a subset of the model containing less bias than the original model. We implement our framework by pruning the model while fine-tuning it on the debiasing objective. Optimized are only the pruning scores - parameters coupled with the model's weights that act as gates. We experiment with pruning attention heads, an important building block of transformers: we prune square blocks, as well as establish a new way of pruning the entire heads. Lastly, we demonstrate the usage of our framework using gender bias, and based on our findings, we propose an improvement to an existing debiasing method. Additionally, we re-discover a bias-performance trade-off: the better the model performs, the more bias it contains.
Pure Transformers are Powerful Graph Learners
- Authors: Jinwoo Kim, Tien Dat Nguyen, Seonwoo Min, Sungjun Cho, Moontae Lee, Honglak Lee, Seunghoon Hong
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2207.02505
- Pdf link: https://arxiv.org/pdf/2207.02505
- Abstract We show that standard Transformers without graph-specific modifications can lead to promising results in graph learning both in theory and practice. Given a graph, we simply treat all nodes and edges as independent tokens, augment them with token embeddings, and feed them to a Transformer. With an appropriate choice of token embeddings, we prove that this approach is theoretically at least as expressive as an invariant graph network (2-IGN) composed of equivariant linear layers, which is already more expressive than all message-passing Graph Neural Networks (GNN). When trained on a large-scale graph dataset (PCQM4Mv2), our method coined Tokenized Graph Transformer (TokenGT) achieves significantly better results compared to GNN baselines and competitive results compared to Transformer variants with sophisticated graph-specific inductive bias. Our implementation is available at https://github.com/jw9730/tokengt.
Learning to Diversify for Product Question Generation
- Authors: Haggai Roitman, Uriel Singer, Yotam Eshel, Alexander Nus, Eliyahu Kiperwasser
- Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2207.02534
- Pdf link: https://arxiv.org/pdf/2207.02534
- Abstract We address the product question generation task. For a given product description, our goal is to generate questions that reflect potential user information needs that are either missing or not well covered in the description. Moreover, we wish to cover diverse user information needs that may span a multitude of product types. To this end, we first show how the T5 pre-trained Transformer encoder-decoder model can be fine-tuned for the task. Yet, while the T5 generated questions have a reasonable quality compared to the state-of-the-art method for the task (KPCNet), many of such questions are still too general, resulting in a sub-optimal global question diversity. As an alternative, we propose a novel learning-to-diversify (LTD) fine-tuning approach that allows to enrich the language learned by the underlying Transformer model. Our empirical evaluation shows that, using our approach significantly improves the global diversity of the underlying Transformer model, while preserves, as much as possible, its generation relevance.
Transformers discover an elementary calculation system exploiting local attention and grid-like problem representation
- Authors: Samuel Cognolato, Alberto Testolin
- Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2207.02536
- Pdf link: https://arxiv.org/pdf/2207.02536
- Abstract Mathematical reasoning is one of the most impressive achievements of human intellect but remains a formidable challenge for artificial intelligence systems. In this work we explore whether modern deep learning architectures can learn to solve a symbolic addition task by discovering effective arithmetic procedures. Although the problem might seem trivial at first glance, generalizing arithmetic knowledge to operations involving a higher number of terms, possibly composed by longer sequences of digits, has proven extremely challenging for neural networks. Here we show that universal transformers equipped with local attention and adaptive halting mechanisms can learn to exploit an external, grid-like memory to carry out multi-digit addition. The proposed model achieves remarkable accuracy even when tested with problems requiring extrapolation outside the training distribution; most notably, it does so by discovering human-like calculation strategies such as place value alignment.
Simple and Efficient Heterogeneous Graph Neural Network
- Authors: Xiaocheng Yang, Mingyu Yan, Shirui Pan, Xiaochun Ye, Dongrui Fan
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2207.02547
- Pdf link: https://arxiv.org/pdf/2207.02547
- Abstract Heterogeneous graph neural networks (HGNNs) deliver the powerful capability to embed rich structural and semantic information of a heterogeneous graph into low-dimensional node representations. Existing HGNNs usually learn to embed information using hierarchy attention mechanism and repeated neighbor aggregation, suffering from unnecessary complexity and redundant computation. This paper proposes Simple and Efficient Heterogeneous Graph Neural Network (SeHGNN) which reduces this excess complexity through avoiding overused node-level attention within the same relation and pre-computing the neighbor aggregation in the pre-processing stage. Unlike previous work, SeHGNN utilizes a light-weight parameter-free neighbor aggregator to learn structural information for each metapath, and a transformer-based semantic aggregator to combine semantic information across metapaths for the final embedding of each node. As a result, SeHGNN offers the simple network structure, high prediction accuracy, and fast training speed. Extensive experiments on five real-world heterogeneous graphs demonstrate the superiority of SeHGNN over the state-of-the-arts on both the accuracy and training speed. Codes are available at https://github.com/ICT-GIMLab/SeHGNN.
Cascaded Deep Hybrid Models for Multistep Household Energy Consumption Forecasting
- Authors: Lyes Saad Saoud, Hasan AlMarzouqi, Ramy Hussein
- Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
- Arxiv link: https://arxiv.org/abs/2207.02589
- Pdf link: https://arxiv.org/pdf/2207.02589
- Abstract Sustainability requires increased energy efficiency with minimal waste. The future power systems should thus provide high levels of flexibility iin controling energy consumption. Precise projections of future energy demand/load at the aggregate and on the individual site levels are of great importance for decision makers and professionals in the energy industry. Forecasting energy loads has become more advantageous for energy providers and customers, allowing them to establish an efficient production strategy to satisfy demand. This study introduces two hybrid cascaded models for forecasting multistep household power consumption in different resolutions. The first model integrates Stationary Wavelet Transform (SWT), as an efficient signal preprocessing technique, with Convolutional Neural Networks and Long Short Term Memory (LSTM). The second hybrid model combines SWT with a self-attention based neural network architecture named transformer. The major constraint of using time-frequency analysis methods such as SWT in multistep energy forecasting problems is that they require sequential signals, making signal reconstruction problematic in multistep forecasting applications.The cascaded models can efficiently address this problem through using the recursive outputs. Experimental results show that the proposed hybrid models achieve superior prediction performance compared to the existing multistep power consumption prediction methods. The results will pave the way for more accurate and reliable forecasting of household power consumption.
FAST-VQA: Efficient End-to-end Video Quality Assessment with Fragment Sampling
- Authors: Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
- Arxiv link: https://arxiv.org/abs/2207.02595
- Pdf link: https://arxiv.org/pdf/2207.02595
- Abstract Current deep video quality assessment (VQA) methods are usually with high computational costs when evaluating high-resolution videos. This cost hinders them from learning better video-quality-related representations via end-to-end training. Existing approaches typically consider naive sampling to reduce the computational cost, such as resizing and cropping. However, they obviously corrupt quality-related information in videos and are thus not optimal for learning good representations for VQA. Therefore, there is an eager need to design a new quality-retained sampling scheme for VQA. In this paper, we propose Grid Mini-patch Sampling (GMS), which allows consideration of local quality by sampling patches at their raw resolution and covers global quality with contextual relations via mini-patches sampled in uniform grids. These mini-patches are spliced and aligned temporally, named as fragments. We further build the Fragment Attention Network (FANet) specially designed to accommodate fragments as inputs. Consisting of fragments and FANet, the proposed FrAgment Sample Transformer for VQA (FAST-VQA) enables efficient end-to-end deep VQA and learns effective video-quality-related representations. It improves state-of-the-art accuracy by around 10% while reducing 99.5% FLOPs on 1080P high-resolution videos. The newly learned video-quality-related representations can also be transferred into smaller VQA datasets, boosting performance in these scenarios. Extensive experiments show that FAST-VQA has good performance on inputs of various resolutions while retaining high efficiency. We publish our code at https://github.com/timothyhtimothy/FAST-VQA.
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
- Authors: Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02696
- Pdf link: https://arxiv.org/pdf/2207.02696
- Abstract YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms both transformer-based detector SWIN-L Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by 509% in speed and 2% in accuracy, and convolutional-based detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy, as well as YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy. Moreover, we train YOLOv7 only on MS COCO dataset from scratch without using any other datasets or pre-trained weights. Source code is released in https://github.com/WongKinYiu/yolov7.
Scaling Private Deep Learning with Low-Rank and Sparse Gradients
- Authors: Ryuichi Ito, Seng Pei Liew, Tsubasa Takahashi, Yuya Sasaki, Makoto Onizuka
- Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2207.02699
- Pdf link: https://arxiv.org/pdf/2207.02699
- Abstract Applying Differentially Private Stochastic Gradient Descent (DPSGD) to training modern, large-scale neural networks such as transformer-based models is a challenging task, as the magnitude of noise added to the gradients at each iteration scales with model dimension, hindering the learning capability significantly. We propose a unified framework, $\textsf{LSG}$, that fully exploits the low-rank and sparse structure of neural networks to reduce the dimension of gradient updates, and hence alleviate the negative impacts of DPSGD. The gradient updates are first approximated with a pair of low-rank matrices. Then, a novel strategy is utilized to sparsify the gradients, resulting in low-dimensional, less noisy updates that are yet capable of retaining the performance of neural networks. Empirical evaluation on natural language processing and computer vision tasks shows that our method outperforms other state-of-the-art baselines.
Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction
- Authors: Johan Broberg, Maria Bånkestad, Erik Ylipää
- Subjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
- Arxiv link: https://arxiv.org/abs/2207.02724
- Pdf link: https://arxiv.org/pdf/2207.02724
- Abstract Molecular property prediction is essential in chemistry, especially for drug discovery applications. However, available molecular property data is often limited, encouraging the transfer of information from related data. Transfer learning has had a tremendous impact in fields like Computer Vision and Natural Language Processing signaling for its potential in molecular property prediction. We present a pre-training procedure for molecular representation learning using reaction data and use it to pre-train a SMILES Transformer. We fine-tune and evaluate the pre-trained model on 12 molecular property prediction tasks from MoleculeNet within physical chemistry, biophysics, and physiology and show a statistically significant positive effect on 5 of the 12 tasks compared to a non-pre-trained baseline model.
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding
- Authors: Zihang Lin, Chaolei Tan, Jian-Fang Hu, Zhi Jin, Tiancai Ye, Wei-Shi Zheng
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02756
- Pdf link: https://arxiv.org/pdf/2207.02756
- Abstract In this technical report, we introduce our solution to human-centric spatio-temporal video grounding task. We propose a concise and effective framework named STVGFormer, which models spatiotemporal visual-linguistic dependencies with a static branch and a dynamic branch. The static branch performs cross-modal understanding in a single frame and learns to localize the target object spatially according to intra-frame visual cues like object appearances. The dynamic branch performs cross-modal understanding across multiple frames. It learns to predict the starting and ending time of the target moment according to dynamic visual cues like motions. Both the static and dynamic branches are designed as cross-modal transformers. We further design a novel static-dynamic interaction block to enable the static and dynamic branches to transfer useful and complementary information from each other, which is shown to be effective to improve the prediction on hard cases. Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG track of the 4th Person in Context Challenge.
Cross-receptive Focused Inference Network for Lightweight Image Super-Resolution
- Authors: Wenjie Li, Juncheng Li, Guangwei Gao, Jiantao Zhou, Jian Yang, Guo-Jun Qi
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02796
- Pdf link: https://arxiv.org/pdf/2207.02796
- Abstract With the development of deep learning, single image super-resolution (SISR) has achieved significant breakthroughs. Recently, methods to enhance the performance of SISR networks based on global feature interactions have been proposed. However, the capabilities of neurons that need to adjust their function in response to the context dynamically are neglected. To address this issue, we propose a lightweight Cross-receptive Focused Inference Network (CFIN), a hybrid network composed of a Convolutional Neural Network (CNN) and a Transformer. Specifically, a novel Cross-receptive Field Guide Transformer (CFGT) is designed to adaptively modify the network weights by using modulated convolution kernels combined with local representative semantic information. In addition, a CNN-based Cross-scale Information Aggregation Module (CIAM) is proposed to make the model better focused on potentially practical information and improve the efficiency of the Transformer stage. Extensive experiments show that our proposed CFIN is a lightweight and efficient SISR model, which can achieve a good balance between computational cost and model performance.
Delving into Sequential Patches for Deepfake Detection
- Authors: Jiazhi Guan, Hang Zhou, Zhibin Hong, Errui Ding, Jingdong Wang, Chengbin Quan, Youjian Zhao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02803
- Pdf link: https://arxiv.org/pdf/2207.02803
- Abstract Recent advances in face forgery techniques produce nearly visually untraceable deepfake videos, which could be leveraged with malicious intentions. As a result, researchers have been devoted to deepfake detection. Previous studies has identified the importance of local low-level cues and temporal information in pursuit to generalize well across deepfake methods, however, they still suffer from robustness problem against post-processings. In this work, we propose the Local- & Temporal-aware Transformer-based Deepfake Detection (LTTD) framework, which adopts a local-to-global learning protocol with a particular focus on the valuable temporal information within local sequences. Specifically, we propose a Local Sequence Transformer (LST), which models the temporal consistency on sequences of restricted spatial regions, where low-level information is hierarchically enhanced with shallow layers of learned 3D filters. Based on the local temporal embeddings, we then achieve the final classification in a global contrastive way. Extensive experiments on popular datasets validate that our approach effectively spots local forgery cues and achieves state-of-the-art performance.
Keyword: autonomous driving
BiPOCO: Bi-Directional Trajectory Prediction with Pose Constraints for Pedestrian Anomaly Detection
- Authors: Asiegbu Miracle Kanu-Asiegbu, Ram Vasudevan, Xiaoxiao Du
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2207.02281
- Pdf link: https://arxiv.org/pdf/2207.02281
- Abstract We present BiPOCO, a Bi-directional trajectory predictor with POse COnstraints, for detecting anomalous activities of pedestrians in videos. In contrast to prior work based on feature reconstruction, our work identifies pedestrian anomalous events by forecasting their future trajectories and comparing the predictions with their expectations. We introduce a set of novel compositional pose-based losses with our predictor and leverage prediction errors of each body joint for pedestrian anomaly detection. Experimental results show that our BiPOCO approach can detect pedestrian anomalous activities with a high detection rate (up to 87.0%) and incorporating pose constraints helps distinguish normal and anomalous poses in prediction. This work extends current literature of using prediction-based methods for anomaly detection and can benefit safety-critical applications such as autonomous driving and surveillance. Code is available at https://github.com/akanuasiegbu/BiPOCO.