Paper-Daily-Notice copied to clipboard
New submissions for Thu, 31 Mar 22
Keyword: SLAM
Indoor SLAM Using a Foot-mounted IMU and the local Magnetic Field
- Authors: Mostafa Osman, Frida Viset, Manon Kok
- Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link:
- Pdf link:
- Abstract In this paper, a simultaneous localization and mapping (SLAM) algorithm for tracking the motion of a pedestrian with a foot-mounted inertial measurement unit (IMU) is proposed. The algorithm uses two maps, namely, a motion map and a magnetic field map. The motion map captures typical motion patterns of pedestrians in buildings that are constrained by e.g. corridors and doors. The magnetic map models local magnetic field anomalies in the environment using a Gaussian process (GP) model and uses them as position information. These maps are used in a Rao-Blackwellized particle filter (RBPF) to correct the pedestrian position and orientation estimates from the pedestrian dead-reckoning (PDR). The PDR is computed using an extended Kalman filter with zero-velocity updates (ZUPT-EKF). The algorithm is validated using real experimental sequences and the results show the efficacy of the algorithm in localizing pedestrians in indoor environments.
Keyword: Visual inertial
There is no result
Keyword: livox
There is no result
Keyword: loam
There is no result
Keyword: Visual inertial odometry
There is no result
Keyword: lidar
Sensor Data Validation and Driving Safety in Autonomous Driving Systems
- Authors: Jindi Zhang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link:
- Pdf link:
- Abstract Autonomous driving technology has drawn a lot of attention due to its fast development and extremely high commercial values. The recent technological leap of autonomous driving can be primarily attributed to the progress in the environment perception. Good environment perception provides accurate high-level environment information which is essential for autonomous vehicles to make safe and precise driving decisions and strategies. Moreover, such progress in accurate environment perception would not be possible without deep learning models and advanced onboard sensors, such as optical sensors (LiDARs and cameras), radars, GPS. However, the advanced sensors and deep learning models are prone to recently invented attack methods. For example, LiDARs and cameras can be compromised by optical attacks, and deep learning models can be attacked by adversarial examples. The attacks on advanced sensors and deep learning models can largely impact the accuracy of the environment perception, posing great threats to the safety and security of autonomous vehicles. In this thesis, we study the detection methods against the attacks on onboard sensors and the linkage between attacked deep learning models and driving safety for autonomous vehicles. To detect the attacks, redundant data sources can be exploited, since information distortions caused by attacks in victim sensor data result in inconsistency with the information from other redundant sources. To study the linkage between attacked deep learning models and driving safety...
Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data
- Authors: Corentin Sautier, Gilles Puy, Spyros Gidaris, Alexandre Boulch, Andrei Bursuc, Renaud Marlet
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link:
- Pdf link:
- Abstract Segmenting or detecting objects in sparse Lidar point clouds are two important tasks in autonomous driving to allow a vehicle to act safely in its 3D environment. The best performing methods in 3D semantic segmentation or object detection rely on a large amount of annotated data. Yet annotating 3D Lidar data for these tasks is tedious and costly. In this context, we propose a self-supervised pre-training method for 3D perception models that is tailored to autonomous driving data. Specifically, we leverage the availability of synchronized and calibrated image and Lidar sensors in autonomous driving setups for distilling self-supervised pre-trained image representations into 3D models. Hence, our method does not require any point cloud nor image annotations. The key ingredient of our method is the use of superpixels which are used to pool 3D point features and 2D pixel features in visually similar regions. We then train a 3D network on the self-supervised task of matching these pooled point features with the corresponding pooled image pixel features. The advantages of contrasting regions obtained by superpixels are that: (1) grouping together pixels and points of visually coherent regions leads to a more meaningful contrastive task that produces features well adapted to 3D semantic segmentation and 3D object detection; (2) all the different regions have the same weight in the contrastive loss regardless of the number of 3D points sampled in these regions; (3) it mitigates the noise produced by incorrect matching of points and pixels due to occlusions between the different sensors. Extensive experiments on autonomous driving datasets demonstrate the ability of our image-to-Lidar distillation strategy to produce 3D representations that transfer well on semantic segmentation and object detection tasks.
Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object Tracking
- Authors: Guangming Wang, Chensheng Peng, Jinpeng Zhang, Hesheng Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link:
- Pdf link:
- Abstract Multiple object tracking (MOT) is a significant task in achieving autonomous driving. Traditional works attempt to complete this task, either based on point clouds (PC) collected by LiDAR, or based on images captured from cameras. However, relying on one single sensor is not robust enough, because it might fail during the tracking process. On the other hand, feature fusion from multiple modalities contributes to the improvement of accuracy. As a result, new techniques based on different sensors integrating features from multiple modalities are being developed. Texture information from RGB cameras and 3D structure information from Lidar have respective advantages under different circumstances. However, it's not easy to achieve effective feature fusion because of completely distinct information modalities. Previous fusion methods usually fuse the top-level features after the backbones extract the features from different modalities. In this paper, we first introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion between multi-scale features of images and point clouds. Specifically, through multi-scale interactive query and fusion between pixel-level and point-level features, our method, can obtain more distinguishing features to improve the performance of multiple object tracking. Besides, we explore the effectiveness of pre-training on each single modality and fine-tuning on the fusion-based model. The experimental results demonstrate that our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion. Moreover, the ablation studies indicates the effectiveness of multi-scale feature fusion and pre-training on single modality.
Keyword: loop detection
There is no result
Keyword: autonomous driving
Learning to Detect Mobile Objects from LiDAR Scans Without Labels
- Authors: Yurong You, Katie Z Luo, Cheng Perng Phoo, Wei-Lun Chao, Wen Sun, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link:
- Pdf link:
- Abstract Current 3D object detectors for autonomous driving are almost entirely trained on human-annotated data. Although of high quality, the generation of such data is laborious and costly, restricting them to a few specific locations and object types. This paper proposes an alternative approach entirely based on unlabeled data, which can be collected cheaply and in abundance almost everywhere on earth. Our approach leverages several simple common sense heuristics to create an initial set of approximate seed labels. For example, relevant traffic participants are generally not persistent across multiple traversals of the same route, do not fly, and are never under ground. We demonstrate that these seed labels are highly effective to bootstrap a surprisingly accurate detector through repeated self-training without a single human annotated label.
Sensor Data Validation and Driving Safety in Autonomous Driving Systems
- Authors: Jindi Zhang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link:
- Pdf link:
- Abstract Autonomous driving technology has drawn a lot of attention due to its fast development and extremely high commercial values. The recent technological leap of autonomous driving can be primarily attributed to the progress in the environment perception. Good environment perception provides accurate high-level environment information which is essential for autonomous vehicles to make safe and precise driving decisions and strategies. Moreover, such progress in accurate environment perception would not be possible without deep learning models and advanced onboard sensors, such as optical sensors (LiDARs and cameras), radars, GPS. However, the advanced sensors and deep learning models are prone to recently invented attack methods. For example, LiDARs and cameras can be compromised by optical attacks, and deep learning models can be attacked by adversarial examples. The attacks on advanced sensors and deep learning models can largely impact the accuracy of the environment perception, posing great threats to the safety and security of autonomous vehicles. In this thesis, we study the detection methods against the attacks on onboard sensors and the linkage between attacked deep learning models and driving safety for autonomous vehicles. To detect the attacks, redundant data sources can be exploited, since information distortions caused by attacks in victim sensor data result in inconsistency with the information from other redundant sources. To study the linkage between attacked deep learning models and driving safety...
Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data
- Authors: Corentin Sautier, Gilles Puy, Spyros Gidaris, Alexandre Boulch, Andrei Bursuc, Renaud Marlet
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link:
- Pdf link:
- Abstract Segmenting or detecting objects in sparse Lidar point clouds are two important tasks in autonomous driving to allow a vehicle to act safely in its 3D environment. The best performing methods in 3D semantic segmentation or object detection rely on a large amount of annotated data. Yet annotating 3D Lidar data for these tasks is tedious and costly. In this context, we propose a self-supervised pre-training method for 3D perception models that is tailored to autonomous driving data. Specifically, we leverage the availability of synchronized and calibrated image and Lidar sensors in autonomous driving setups for distilling self-supervised pre-trained image representations into 3D models. Hence, our method does not require any point cloud nor image annotations. The key ingredient of our method is the use of superpixels which are used to pool 3D point features and 2D pixel features in visually similar regions. We then train a 3D network on the self-supervised task of matching these pooled point features with the corresponding pooled image pixel features. The advantages of contrasting regions obtained by superpixels are that: (1) grouping together pixels and points of visually coherent regions leads to a more meaningful contrastive task that produces features well adapted to 3D semantic segmentation and 3D object detection; (2) all the different regions have the same weight in the contrastive loss regardless of the number of 3D points sampled in these regions; (3) it mitigates the noise produced by incorrect matching of points and pixels due to occlusions between the different sensors. Extensive experiments on autonomous driving datasets demonstrate the ability of our image-to-Lidar distillation strategy to produce 3D representations that transfer well on semantic segmentation and object detection tasks.
Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object Tracking
- Authors: Guangming Wang, Chensheng Peng, Jinpeng Zhang, Hesheng Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link:
- Pdf link:
- Abstract Multiple object tracking (MOT) is a significant task in achieving autonomous driving. Traditional works attempt to complete this task, either based on point clouds (PC) collected by LiDAR, or based on images captured from cameras. However, relying on one single sensor is not robust enough, because it might fail during the tracking process. On the other hand, feature fusion from multiple modalities contributes to the improvement of accuracy. As a result, new techniques based on different sensors integrating features from multiple modalities are being developed. Texture information from RGB cameras and 3D structure information from Lidar have respective advantages under different circumstances. However, it's not easy to achieve effective feature fusion because of completely distinct information modalities. Previous fusion methods usually fuse the top-level features after the backbones extract the features from different modalities. In this paper, we first introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion between multi-scale features of images and point clouds. Specifically, through multi-scale interactive query and fusion between pixel-level and point-level features, our method, can obtain more distinguishing features to improve the performance of multiple object tracking. Besides, we explore the effectiveness of pre-training on each single modality and fine-tuning on the fusion-based model. The experimental results demonstrate that our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion. Moreover, the ablation studies indicates the effectiveness of multi-scale feature fusion and pre-training on single modality.
Keyword: mapping
Learning to Collide: Recommendation System Model Compression with Learned Hash Functions
- Authors: Benjamin Ghaemmaghami, Mustafa Ozdal, Rakesh Komuravelli, Dmitriy Korchev, Dheevatsa Mudigere, Krishnakumar Nair, Maxim Naumov
- Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
- Arxiv link:
- Pdf link:
- Abstract A key characteristic of deep recommendation models is the immense memory requirements of their embedding tables. These embedding tables can often reach hundreds of gigabytes which increases hardware requirements and training cost. A common technique to reduce model size is to hash all of the categorical variable identifiers (ids) into a smaller space. This hashing reduces the number of unique representations that must be stored in the embedding table; thus decreasing its size. However, this approach introduces collisions between semantically dissimilar ids that degrade model quality. We introduce an alternative approach, Learned Hash Functions, which instead learns a new mapping function that encourages collisions between semantically similar ids. We derive this learned mapping from historical data and embedding access patterns. We experiment with this technique on a production model and find that a mapping informed by the combination of access frequency and a learned low dimension embedding is the most effective. We demonstrate a small improvement relative to the hashing trick and other collision related compression techniques. This is ongoing work that explores the impact of categorical id collisions on recommendation model quality and how those collisions may be controlled to improve model performance.
Indoor SLAM Using a Foot-mounted IMU and the local Magnetic Field
- Authors: Mostafa Osman, Frida Viset, Manon Kok
- Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link:
- Pdf link:
- Abstract In this paper, a simultaneous localization and mapping (SLAM) algorithm for tracking the motion of a pedestrian with a foot-mounted inertial measurement unit (IMU) is proposed. The algorithm uses two maps, namely, a motion map and a magnetic field map. The motion map captures typical motion patterns of pedestrians in buildings that are constrained by e.g. corridors and doors. The magnetic map models local magnetic field anomalies in the environment using a Gaussian process (GP) model and uses them as position information. These maps are used in a Rao-Blackwellized particle filter (RBPF) to correct the pedestrian position and orientation estimates from the pedestrian dead-reckoning (PDR). The PDR is computed using an extended Kalman filter with zero-velocity updates (ZUPT-EKF). The algorithm is validated using real experimental sequences and the results show the efficacy of the algorithm in localizing pedestrians in indoor environments.
Tampered VAE for Improved Satellite Image Time Series Classification
- Authors: Xin Cai, Yaxin Bi, Peter Nicholl
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link:
- Pdf link:
- Abstract The unprecedented availability of spatial and temporal high-resolution satellite image time series (SITS) for crop type mapping is believed to necessitate deep learning architectures to accommodate challenges arising from both dimensions. Recent state-of-the-art deep learning models have shown promising results by stacking spatial and temporal encoders. However, we present a Pyramid Time-Series Transformer (PTST) that operates solely on the temporal dimension, i.e., neglecting the spatial dimension, can produce superior results with a drastic reduction in GPU memory consumption and easy extensibility. Furthermore, we augment it to perform semi-supervised learning by proposing a classification-friendly VAE framework that introduces clustering mechanisms into latent space and can promote linear separability therein. Consequently, a few principal axes of the latent space can explain the majority of variance in raw data. Meanwhile, the VAE framework with proposed tweaks can maintain competitive classification performance as its purely discriminative counterpart when only $40%$ of labelled data is used. We hope the proposed framework can serve as a baseline for crop classification with SITS for its modularity and simplicity.
Spline-Based Space-Time Finite Element Approach for Fluid-Structure Interaction Problems With a Focus on Fully Enclosed Domains
- Authors: Michel Make, Thomas Spenke, Norbert Hosters, Marek Behr
- Subjects: Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
- Arxiv link:
- Pdf link:
- Abstract Non-Uniform Rational B-Spline (NURBS) surfaces are commonly used within Computer-Aided Design (CAD) tools to represent geometric objects. When using isogeometric analysis (IGA), it is possible to use such NURBS geometries for numerical analysis directly. Analyzing fluid flows, however, requires complex three-dimensional geometries to represent flow domains. Defining a parametrization of such volumetric domains using NURBS can be challenging and is still an ongoing topic in the IGA community. With the recently developed NURBS-enhanced finite element method (NEFEM), the favorable geometric characteristics of NURBS are used within a standard finite element method. This is achieved by enhancing the elements touching the boundary by using the NURBS geometry itself. In the current work, a new variation of NEFEM is introduced, which is suitable for three-dimensional space-time finite element formulations. The proposed method makes use of a new mapping which results in a non-Cartesian formulation suitable for fluid-structure interaction (FSI). This is demonstrated by combining the method with an IGA formulation in a strongly-coupled partitioned framework for solving FSI problems. The framework yields a fully spline-based representation of the fluid-structure interface through a single NURBS. The coupling conditions at the fluid-structure interface are enforced through a Robin-Neumann type coupling scheme. This scheme is particularly useful when considering incompressible fluids in fully Dirichlet-bounded and curved problems, as it satisfies the incompressibility constraint on the fluid for each step within the coupling procedure. The accuracy and performance of the introduced spline-based space-time finite element approach and its use within the proposed coupled FSI framework are demonstrated using a series of two- and three-dimensional benchmark problems.
Multi-Robot Active Mapping via Neural Bipartite Graph Matching
- Authors: Kai Ye, Siyan Dong, Qingnan Fan, He Wang, Li Yi, Fei Xia, Jue Wang, Baoquan Chen
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
- Arxiv link:
- Pdf link:
- Abstract We study the problem of multi-robot active mapping, which aims for complete scene map construction in minimum time steps. The key to this problem lies in the goal position estimation to enable more efficient robot movements. Previous approaches either choose the frontier as the goal position via a myopic solution that hinders the time efficiency, or maximize the long-term value via reinforcement learning to directly regress the goal position, but does not guarantee the complete map construction. In this paper, we propose a novel algorithm, namely NeuralCoMapping, which takes advantage of both approaches. We reduce the problem to bipartite graph matching, which establishes the node correspondences between two graphs, denoting robots and frontiers. We introduce a multiplex graph neural network (mGNN) that learns the neural distance to fill the affinity matrix for more effective graph matching. We optimize the mGNN with a differentiable linear assignment layer by maximizing the long-term values that favor time efficiency and map completeness via reinforcement learning. We compare our algorithm with several state-of-the-art multi-robot active mapping approaches and adapted reinforcement-learning baselines. Experimental results demonstrate the superior performance and exceptional generalization ability of our algorithm on various indoor scenes and unseen number of robots, when only trained with 9 indoor scenes.
Keyword: localization
Neural Inertial Localization
- Authors: Sachini Herath, David Caruso, Chen Liu, Yufan Chen, Yasutaka Furukawa
- Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link:
- Pdf link:
- Abstract This paper proposes the inertial localization problem, the task of estimating the absolute location from a sequence of inertial sensor measurements. This is an exciting and unexplored area of indoor localization research, where we present a rich dataset with 53 hours of inertial sensor data and the associated ground truth locations. We developed a solution, dubbed neural inertial localization (NILoc) which 1) uses a neural inertial navigation technique to turn inertial sensor history to a sequence of velocity vectors; then 2) employs a transformer-based neural architecture to find the device location from the sequence of velocities. We only use an IMU sensor, which is energy efficient and privacy preserving compared to WiFi, cameras, and other data sources. Our approach is significantly faster and achieves competitive results even compared with state-of-the-art methods that require a floorplan and run 20 to 30 times slower. We share our code, model and data at
Indoor SLAM Using a Foot-mounted IMU and the local Magnetic Field
- Authors: Mostafa Osman, Frida Viset, Manon Kok
- Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link:
- Pdf link:
- Abstract In this paper, a simultaneous localization and mapping (SLAM) algorithm for tracking the motion of a pedestrian with a foot-mounted inertial measurement unit (IMU) is proposed. The algorithm uses two maps, namely, a motion map and a magnetic field map. The motion map captures typical motion patterns of pedestrians in buildings that are constrained by e.g. corridors and doors. The magnetic map models local magnetic field anomalies in the environment using a Gaussian process (GP) model and uses them as position information. These maps are used in a Rao-Blackwellized particle filter (RBPF) to correct the pedestrian position and orientation estimates from the pedestrian dead-reckoning (PDR). The PDR is computed using an extended Kalman filter with zero-velocity updates (ZUPT-EKF). The algorithm is validated using real experimental sequences and the results show the efficacy of the algorithm in localizing pedestrians in indoor environments.
SeqTR: A Simple yet Universal Network for Visual Grounding
- Authors: Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, Rongrong Ji
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link:
- Pdf link:
- Abstract In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES). The canonical paradigms for visual grounding often require substantial expertise in designing network architectures and loss functions, making them hard to generalize across tasks. To simplify and unify the modeling, we cast visual grounding as a point prediction problem conditioned on image and text inputs, where either the bounding box or binary mask is represented as a sequence of discrete coordinate tokens. Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads, e.g., the convolutional mask decoder for RES, which greatly reduces the complexity of multi-task modeling. In addition, SeqTR also shares the same optimization objective for all tasks with a simple cross-entropy loss, further reducing the complexity of deploying hand-crafted loss functions. Experiments on five benchmark datasets demonstrate that the proposed SeqTR outperforms (or is on par with) the existing state-of-the-arts, proving that a simple yet universal approach for visual grounding is indeed feasible.
CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis
- Authors: Shifu Yan, Caihua Shan, Wenyi Yang, Bixiong Xu, Dongsheng Li, Lili Qiu, Jie Tong, Qi Zhang
- Subjects: Artificial Intelligence (cs.AI)
- Arxiv link:
- Pdf link:
- Abstract In large-scale online services, crucial metrics, a.k.a., key performance indicators (KPIs), are monitored periodically to check their running statuses. Generally, KPIs are aggregated along multiple dimensions and derived by complex calculations among fundamental metrics from the raw data. Once abnormal KPI values are observed, root cause analysis (RCA) can be applied to identify the reasons for anomalies, so that we can troubleshoot quickly. Recently, several automatic RCA techniques were proposed to localize the related dimensions (or a combination of dimensions) to explain the anomalies. However, their analyses are limited to the data on the abnormal metric and ignore the data of other metrics which may be also related to the anomalies, leading to imprecise or even incorrect root causes. To this end, we propose a cross-metric multi-dimensional root cause analysis method, named CMMD, which consists of two key components: 1) relationship modeling, which utilizes graph neural network (GNN) to model the unknown complex calculation among metrics and aggregation function among dimensions from historical data; 2) root cause localization, which adopts the genetic algorithm to efficiently and effectively dive into the raw data and localize the abnormal dimension(s) once the KPI anomalies are detected. Experiments on synthetic datasets, public datasets and online production environment demonstrate the superiority of our proposed CMMD method compared with baselines. Currently, CMMD is running as an online service in Microsoft Azure.
PseCo: Pseudo Labeling and Consistency Training for Semi-Supervised Object Detection
- Authors: Gang Li, Xiang Li, Yujie Wang, Shanshan Zhang, Yichao Wu, Ding Liang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link:
- Pdf link:
- Abstract In this paper, we delve into two key techniques in Semi-Supervised Object Detection (SSOD), namely pseudo labeling and consistency training. We observe that these two techniques currently neglect some important properties of object detection, hindering efficient learning on unlabeled data. Specifically, for pseudo labeling, existing works only focus on the classification score yet fail to guarantee the localization precision of pseudo boxes; For consistency training, the widely adopted random-resize training only considers the label-level consistency but misses the feature-level one, which also plays an important role in ensuring the scale invariance. To address the problems incurred by noisy pseudo boxes, we design Noisy Pseudo box Learning (NPL) that includes Prediction-guided Label Assignment (PLA) and Positive-proposal Consistency Voting (PCV). PLA relies on model predictions to assign labels and makes it robust to even coarse pseudo boxes; while PCV leverages the regression consistency of positive proposals to reflect the localization quality of pseudo boxes. Furthermore, in consistency training, we propose Multi-view Scale-invariant Learning (MSL) that includes mechanisms of both label- and feature-level consistency, where feature consistency is achieved by aligning shifted feature pyramids between two images with identical content but varied scales. On COCO benchmark, our method, termed PSEudo labeling and COnsistency training (PseCo), outperforms the SOTA (Soft Teacher) by 2.0, 1.8, 2.0 points under 1%, 5%, and 10% labelling ratios, respectively. It also significantly improves the learning efficiency for SSOD, e.g., PseCo halves the training time of the SOTA approach but achieves even better performance.
TubeDETR: Spatio-Temporal Video Grounding with Transformers
- Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link:
- Pdf link:
- Abstract We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks. Code and trained models are publicly available at
An Improved Lightweight YOLOv5 Model Based on Attention Mechanism for Face Mask Detection
- Authors: Sheng Xu
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link:
- Pdf link:
- Abstract Coronavirus 2019 has brought severe challenges to social stability and public health worldwide. One effective way of curbing the epidemic is to require people to wear masks in public places and monitor mask-wearing states by utilizing suitable automatic detectors. However, existing deep learning based models struggle to simultaneously achieve the requirements of both high precision and real-time performance. To solve this problem, we propose an improved lightweight face mask detector based on YOLOv5, which can achieve an excellent balance of precision and speed. Firstly, a novel backbone ShuffleCANet that combines ShuffleNetV2 network with Coordinate Attention mechanism is proposed as the backbone. Then we use BiFPN as the feature fusion neck. Furthermore, we replace the loss function of localization with -CIoU to obtain higher-quality anchors. Some valuable strategies such as data augmentation, adaptive image scaling, and anchor cluster operation are also utilized. Experimental results show the performance and effectiveness of the proposed model. On the basis of the original YOLOv5 model, our work increases the inference speed by 28.3% while still improving the precision by 0.58% on the AIZOO face mask dataset. It achieves a mean average precision of 95.2%, which is 4.4% higher than the baseline and is also more accurate compared with other existing models.