arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Thu, 25 Aug 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Comparison of Object Detection Algorithms for Street-level Objects

  • Authors: Martinus Grady Naftali, Jason Sebastian Sulistyawan, Kelvin Julian
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.11315
  • Pdf link: https://arxiv.org/pdf/2208.11315
  • Abstract Object detection for street-level objects can be applied to various use cases, from car and traffic detection to the self-driving car system. Therefore, finding the best object detection algorithm is essential to apply it effectively. Many object detection algorithms have been released, and many have compared object detection algorithms, but few have compared the latest algorithms, such as YOLOv5, primarily which focus on street-level objects. This paper compares various one-stage detector algorithms; SSD MobileNetv2 FPN-lite 320x320, YOLOv3, YOLOv4, YOLOv5l, and YOLOv5s for street-level object detection within real-time images. The experiment utilizes a modified Udacity Self Driving Car Dataset with 3,169 images. Dataset is split into train, validation, and test; Then, it is preprocessed and augmented using rescaling, hue shifting, and noise. Each algorithm is then trained and evaluated. Based on the experiments, the algorithms have produced decent results according to the inference time and the values of their precision, recall, F1-Score, and Mean Average Precision (mAP). The results also shows that YOLOv5l outperforms the other algorithms in terms of accuracy with a [email protected] of 0.593, MobileNetv2 FPN-lite has the fastest inference time among the others with only 3.20ms inference time. It is also found that YOLOv5s is the most efficient, with it having a YOLOv5l accuracy and a speed almost as quick as the MobileNetv2 FPN-lite. This shows that various algorithm are suitable for street-level object detection and viable enough to be used in self-driving car.

Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors

  • Authors: Gongjie Zhang, Zhipeng Luo, Yingchen Yu, Zichen Tian, Jingyi Zhang, Shijian Lu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2208.11356
  • Pdf link: https://arxiv.org/pdf/2208.11356
  • Abstract Multi-scale features have been proven highly effective for object detection, and most ConvNet-based object detectors adopt Feature Pyramid Network (FPN) as a basic component for exploiting multi-scale features. However, for the recently proposed Transformer-based object detectors, directly incorporating multi-scale features leads to prohibitive computational overhead due to the high complexity of the attention mechanism for processing high-resolution features. This paper presents Iterative Multi-scale Feature Aggregation (IMFA) -- a generic paradigm that enables the efficient use of multi-scale features in Transformer-based object detectors. The core idea is to exploit sparse multi-scale features from just a few crucial locations, and it is achieved with two novel designs. First, IMFA rearranges the Transformer encoder-decoder pipeline so that the encoded features can be iteratively updated based on the detection predictions. Second, IMFA sparsely samples scale-adaptive features for refined detection from just a few keypoint locations under the guidance of prior detection predictions. As a result, the sampled multi-scale features are sparse yet still highly beneficial for object detection. Extensive experiments show that the proposed IMFA boosts the performance of multiple Transformer-based object detectors significantly yet with slight computational overhead. Project page: https://github.com/ZhangGongjie/IMFA.

YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception

  • Authors: Cheng Han, Qichao Zhao, Shuyi Zhang, Yinzi Chen, Zhenlin Zhang, Jinwei Yuan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.11434
  • Pdf link: https://arxiv.org/pdf/2208.11434
  • Abstract Over the last decade, multi-tasking learning approaches have achieved promising results in solving panoptic driving perception problems, providing both high-precision and high-efficiency performance. It has become a popular paradigm when designing networks for real-time practical autonomous driving system, where computation resources are limited. This paper proposed an effective and efficient multi-task learning network to simultaneously perform the task of traffic object detection, drivable road area segmentation and lane detection. Our model achieved the new state-of-the-art (SOTA) performance in terms of accuracy and speed on the challenging BDD100K dataset. Especially, the inference time is reduced by half compared to the previous SOTA model. Code will be released in the near future.

A novel method for data augmentation: Nine Dot Moving Least Square (ND-MLS)

  • Authors: Wen Yang, Rui Wang, Yanchao Zhang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.11532
  • Pdf link: https://arxiv.org/pdf/2208.11532
  • Abstract Data augmentation greatly increases the amount of data obtained based on labeled data to save on expenses and labor for data collection and labeling. We present a new approach for data augmentation called nine-dot MLS (ND-MLS). This approach is proposed based on the idea of image defor-mation. Images are deformed based on control points, which are calculated by ND-MLS. The method can generate over 2000 images for one exist-ing dataset in a short time. To verify this data augmentation method, extensive tests were performed covering 3 main tasks of computer vision, namely, classification, detection and segmentation. The results show that 1) in classification, 10 images per category were used for training, and VGGNet can obtain 92% top-1 acc on the MNIST dataset of handwritten digits by ND-MLS. In the Omniglot dataset, the few-shot accuracy usu-ally decreases with the increase in character categories. However, the ND-MLS method has stable performance and obtains 96.5 top-1 acc in Res-Net on 100 different handwritten character classification tasks; 2) in segmentation, under the premise of only ten original images, DeepLab obtains 93.5%, 85%, and 73.3% m_IOU(10) on the bottle, horse, and grass test datasets, respectively, while the cat test dataset obtains 86.7% m_IOU(10) with the SegNet model; 3) with only 10 original images from each category in object detection, YOLO v4 obtains 100% and 97.2% bottle and horse detection, respectively, while the cat dataset obtains 93.6% with YOLO v3. In summary, ND-MLS can perform well on classification, object detec-tion, and semantic segmentation tasks by using only a few data.

ssFPN: Scale Sequence (S^2) Feature Based Feature Pyramid Network for Object Detection

  • Authors: Hye-Jin Park, Young-Ju Choi, Young-Woon Lee, Byung-Gyu Kim
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.11533
  • Pdf link: https://arxiv.org/pdf/2208.11533
  • Abstract Feature Pyramid Network (FPN) has been an essential module for object detection models to consider various scales of an object. However, average precision (AP) on small objects is relatively lower than AP on medium and large objects. The reason is why the deeper layer of CNN causes information loss as feature extraction level. We propose a new scale sequence (S^2) feature extraction of FPN to strengthen feature information of small objects. We consider FPN structure as scale-space and extract scale sequence (S^2) feature by 3D convolution on the level axis of FPN. It is basically scale invariant feature and is built on high-resolution pyramid feature map for small objects. Furthermore, the proposed S^2 feature can be extended to most object detection models based on FPN. We demonstrate the proposed S2 feature can improve the performance of both one-stage and two-stage detectors on MS COCO dataset. Based on the proposed S2 feature, we achieve upto 1.3% and 1.1% of AP improvement for YOLOv4-P5 and YOLOv4-P6, respectively. For Faster RCNN and Mask R-CNN, we observe upto 2.0% and 1.6% of AP improvement with the suggested S^2 feature, respectively.

Motion Robust High-Speed Light-weighted Object Detection with Event Camera

  • Authors: Bingde Liu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.11602
  • Pdf link: https://arxiv.org/pdf/2208.11602
  • Abstract The event camera produces a large dynamic range event stream with a very high temporal resolution discarding redundant visual information, thus bringing new possibilities for object detection tasks. However, the existing methods of applying the event camera to object detection tasks using deep learning methods still have many problems. First, existing methods cannot take into account objects with different velocities relative to the motion of the event camera due to the global synchronized time window and temporal resolution. Second, most of the existing methods rely on large parameter neural networks, which implies a large computational burden and low inference speed, thus contrary to the high temporal resolution of the event stream. In our work, we design a high-speed lightweight detector called Agile Event Detector (AED) with a simple but effective data augmentation method. Also, we propose an event stream representation tensor called Temporal Active Focus (TAF), which takes full advantage of the asynchronous generation of event stream data and is robust to the motion of moving objects. It can also be constructed without much time-consuming. We further propose a module called the Bifurcated Folding Module (BFM) to extract the rich temporal information in the TAF tensor at the input layer of the AED detector. We conduct our experiments on two typical real-scene event camera object detection datasets: the complete Prophesee GEN1 Automotive Detection Dataset and the Prophesee 1 MEGAPIXEL Automotive Detection Dataset with partial annotation. Experiments show that our method is competitive in terms of accuracy, speed, and the number of parameters simultaneously. Also by classifying the objects into multiple motion levels based on the optical flow density metric, we illustrated the robustness of our method for objects with different velocities relative to the camera.

Detecting the unknown in Object Detection

  • Authors: Dario Fontanel, Matteo Tarantino, Fabio Cermelli, Barbara Caputo
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.11641
  • Pdf link: https://arxiv.org/pdf/2208.11641
  • Abstract Object detection methods have witnessed impressive improvements in the last years thanks to the design of novel neural network architectures and the availability of large scale datasets. However, current methods have a significant limitation: they are able to detect only the classes observed during training time, that are only a subset of all the classes that a detector may encounter in the real world. Furthermore, the presence of unknown classes is often not considered at training time, resulting in methods not even able to detect that an unknown object is present in the image. In this work, we address the problem of detecting unknown objects, known as open-set object detection. We propose a novel training strategy, called UNKAD, able to predict unknown objects without requiring any annotation of them, exploiting non annotated objects that are already present in the background of training images. In particular, exploiting the four-steps training strategy of Faster R-CNN, UNKAD first identifies and pseudo-labels unknown objects and then uses the pseudo-annotations to train an additional unknown class. While UNKAD can directly detect unknown objects, we further combine it with previous unknown detection techniques, showing that it improves their performance at no costs.

AGO-Net: Association-Guided 3D Point Cloud Object Detection Network

  • Authors: Liang Du, Xiaoqing Ye, Xiao Tan, Edward Johns, Bo Chen, Errui Ding, Xiangyang Xue, Jianfeng Feng
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2208.11658
  • Pdf link: https://arxiv.org/pdf/2208.11658
  • Abstract The human brain can effortlessly recognize and localize objects, whereas current 3D object detection methods based on LiDAR point clouds still report inferior performance for detecting occluded and distant objects: the point cloud appearance varies greatly due to occlusion, and has inherent variance in point densities along the distance to sensors. Therefore, designing feature representations robust to such point clouds is critical. Inspired by human associative recognition, we propose a novel 3D detection framework that associates intact features for objects via domain adaptation. We bridge the gap between the perceptual domain, where features are derived from real scenes with sub-optimal representations, and the conceptual domain, where features are extracted from augmented scenes that consist of non-occlusion objects with rich detailed information. A feasible method is investigated to construct conceptual scenes without external datasets. We further introduce an attention-based re-weighting module that adaptively strengthens the feature adaptation of more informative regions. The network's feature enhancement ability is exploited without introducing extra cost during inference, which is plug-and-play in various 3D detection frameworks. We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed. Experiments on nuScenes and Waymo datasets also validate the versatility of our method.

Bugs in the Data: How ImageNet Misrepresents Biodiversity

  • Authors: Alexandra Sasha Luccioni, David Rolnick
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.11695
  • Pdf link: https://arxiv.org/pdf/2208.11695
  • Abstract ImageNet-1k is a dataset often used for benchmarking machine learning (ML) models and evaluating tasks such as image recognition and object detection. Wild animals make up 27% of ImageNet-1k but, unlike classes representing people and objects, these data have not been closely scrutinized. In the current paper, we analyze the 13,450 images from 269 classes that represent wild animals in the ImageNet-1k validation set, with the participation of expert ecologists. We find that many of the classes are ill-defined or overlapping, and that 12% of the images are incorrectly labeled, with some classes having >90% of images incorrect. We also find that both the wildlife-related labels and images included in ImageNet-1k present significant geographical and cultural biases, as well as ambiguities such as artificial animals, multiple species in the same image, or the presence of humans. Our findings highlight serious issues with the extensive use of this dataset for evaluating ML systems, the use of such algorithms in wildlife-related tasks, and more broadly the ways in which ML datasets are commonly created and curated.

Keyword: transformer

The Effects of Correctly Modeling Generator Step-Up Transformer Status in Geomagnetic Disturbance Studies

  • Authors: Jessica L. Wert, Pooria Dehghanian, Jonathan Snodgrass, Thomas J. Overbye
  • Subjects: Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2208.11232
  • Pdf link: https://arxiv.org/pdf/2208.11232
  • Abstract In order to correctly model the impacts of geomagnetically induced current (GIC) flows, the generator step-up (GSU) transformer status must be properly modeled. In power flow studies, generators are typically removed from service without disconnecting their GSU transformers since the GSU transformer status has little to no impact on the power flow result. In reality, removing a generator from service involves also removing its GSU transformer from service. This difference presents a discrepancy between simulated behavior and system observations during geomagnetic disturbance (GMD) events. This paper presents GMD case studies on 2000-bus and 24,000-bus systems in which reactive power losses and geomagnetically induced currents are compared across scenarios in which the GSU transformers for the disconnected generators are either in-service or out-of-service. The results demonstrate a 3.2 to 15.5 percent error in reactive power losses when the GSU status is modeled incorrectly. Discrepancies of up to 95 A per phase for branch GIC flows and 450 A for transformer neutral GIC flows are also observed and visualized.

SwinFIR: Revisiting the SwinIR with Fast Fourier Convolution and Improved Training for Image Super-Resolution

  • Authors: Dafeng Zhang, Feiyu Huang, Shizhuo Liu, Xiaobing Wang, Zhezhu Jin
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.11247
  • Pdf link: https://arxiv.org/pdf/2208.11247
  • Abstract Transformer-based methods have achieved impressive image restoration performance due to their capacities to model long-range dependency compared to CNN-based methods. However, advances like SwinIR adopts the window-based and local attention strategy to balance the performance and computational overhead, which restricts employing large receptive fields to capture global information and establish long dependencies in the early layers. To further improve the efficiency of capturing global information, in this work, we propose SwinFIR to extend SwinIR by replacing Fast Fourier Convolution (FFC) components, which have the image-wide receptive field. We also revisit other advanced techniques, i.e, data augmentation, pre-training, and feature ensemble to improve the effect of image reconstruction. And our feature ensemble method enables the performance of the model to be considerably enhanced without increasing the training and testing time. We applied our algorithm on multiple popular large-scale benchmarks and achieved state-of-the-art performance comparing to the existing methods. For example, our SwinFIR achieves the PSNR of 32.83 dB on Manga109 dataset, which is 0.8 dB higher than the state-of-the-art SwinIR method.

FashionVQA: A Domain-Specific Visual Question Answering System

  • Authors: Min Wang, Ata Mahjoubfar, Anupama Joshi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.11253
  • Pdf link: https://arxiv.org/pdf/2208.11253
  • Abstract Humans apprehend the world through various sensory modalities, yet language is their predominant communication channel. Machine learning systems need to draw on the same multimodal richness to have informed discourses with humans in natural language; this is particularly true for systems specialized in visually-dense information, such as dialogue, recommendation, and search engines for clothing. To this end, we train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images. The key to the successful training of our VQA model is the automatic creation of a visual question-answering dataset with 168 million samples from item attributes of 207 thousand images using diverse templates. The sample generation employs a strategy that considers the difficulty of the question-answer pairs to emphasize challenging concepts. Contrary to the recent trends in using several datasets for pretraining the visual question answering models, we focused on keeping the dataset fixed while training various models from scratch to isolate the improvements from model architecture changes. We see that using the same transformer for encoding the question and decoding the answer, as in language models, achieves maximum accuracy, showing that visual language models (VLMs) make the best visual question answering systems for our dataset. The accuracy of the best model surpasses the human expert level, even when answering human-generated questions that are not confined to the template formats. Our approach for generating a large-scale multimodal domain-specific dataset provides a path for training specialized models capable of communicating in natural language. The training of such domain-expert models, e.g., our fashion VLM model, cannot rely solely on the large-scale general-purpose datasets collected from the web.

Molecular Substructure-Aware Network for Drug-Drug Interaction Prediction

  • Authors: Xinyu Zhu, Yongliang Shen, Weiming Lu
  • Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.11267
  • Pdf link: https://arxiv.org/pdf/2208.11267
  • Abstract Concomitant administration of drugs can cause drug-drug interactions (DDIs). Some drug combinations are beneficial, but other ones may cause negative effects which are previously unrecorded. Previous works on DDI prediction usually rely on hand-engineered domain knowledge, which is laborious to obtain. In this work, we propose a novel model, Molecular Substructure-Aware Network (MSAN), to effectively predict potential DDIs from molecular structures of drug pairs. We adopt a Transformer-like substructure extraction module to acquire a fixed number of representative vectors that are associated with various substructure patterns of the drug molecule. Then, interaction strength between the two drugs' substructures will be captured by a similarity-based interaction module. We also perform a substructure dropping augmentation before graph encoding to alleviate overfitting. Experimental results from a real-world dataset reveal that our proposed model achieves the state-of-the-art performance. We also show that the predictions of our model are highly interpretable through a case study.

Long Code for Code Search

  • Authors: Fan Hu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Xirong Li
  • Subjects: Software Engineering (cs.SE)
  • Arxiv link: https://arxiv.org/abs/2208.11271
  • Pdf link: https://arxiv.org/pdf/2208.11271
  • Abstract Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the restriction of multi-head self-attention and GPU memory, there is a limit on the input token length. The existing pretrained code models, such as GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code (i.e., code that is greater than 256 tokens). Unlike the long text document that can be regarded as a whole with complete semantics, the semantics of long code is discontinuous as a piece of long code may contain different code modules. Therefore, it is unreasonable to directly apply the long text processing methods to long code. To tackle the long code problem, we propose MLCS (Modeling Long Code for Code Search) to obtain a better representation for long code. Our experimental results show the effectiveness of MLCS for long code retrieval. With MLCS, we could use Transformer-based pretraining models to model long code without changing their internal structure and re-pretraining. Through AST-based splitting and attention-based fusion methods, MLCS achieves an overall mean reciprocal ranking (MRR) score of 0.785, outperforming the previous state-of-the-art result of 0.713 on the public CodeSearchNet benchmark.

Federated Self-Supervised Contrastive Learning and Masked Autoencoder for Dermatological Disease Diagnosis

  • Authors: Yawen Wu, Dewen Zeng, Zhepeng Wang, Yi Sheng, Lei Yang, Alaina J. James, Yiyu Shi, Jingtong Hu
  • Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2208.11278
  • Pdf link: https://arxiv.org/pdf/2208.11278
  • Abstract In dermatological disease diagnosis, the private data collected by mobile dermatology assistants exist on distributed mobile devices of patients. Federated learning (FL) can use decentralized data to train models while keeping data local. Existing FL methods assume all the data have labels. However, medical data often comes without full labels due to high labeling costs. Self-supervised learning (SSL) methods, contrastive learning (CL) and masked autoencoders (MAE), can leverage the unlabeled data to pre-train models, followed by fine-tuning with limited labels. However, combining SSL and FL has unique challenges. For example, CL requires diverse data but each device only has limited data. For MAE, while Vision Transformer (ViT) based MAE has higher accuracy over CNNs in centralized learning, MAE's performance in FL with unlabeled data has not been investigated. Besides, the ViT synchronization between the server and clients is different from traditional CNNs. Therefore, special synchronization methods need to be designed. In this work, we propose two federated self-supervised learning frameworks for dermatological disease diagnosis with limited labels. The first one features lower computation costs, suitable for mobile devices. The second one features high accuracy and fits high-performance servers. Based on CL, we proposed federated contrastive learning with feature sharing (FedCLF). Features are shared for diverse contrastive information without sharing raw data for privacy. Based on MAE, we proposed FedMAE. Knowledge split separates the global and local knowledge learned from each client. Only global knowledge is aggregated for higher generalization performance. Experiments on dermatological disease datasets show superior accuracy of the proposed frameworks over state-of-the-arts.

K-Order Graph-oriented Transformer with GraAttention for 3D Pose and Shape Estimation

  • Authors: Weixi Zhao, Weiqiang Wang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2208.11328
  • Pdf link: https://arxiv.org/pdf/2208.11328
  • Abstract We propose a novel attention-based 2D-to-3D pose estimation network for graph-structured data, named KOG-Transformer, and a 3D pose-to-shape estimation network for hand data, named GASE-Net. Previous 3D pose estimation methods have focused on various modifications to the graph convolution kernel, such as abandoning weight sharing or increasing the receptive field. Some of these methods employ attention-based non-local modules as auxiliary modules. In order to better model the relationship between nodes in graph-structured data and fuse the information of different neighbor nodes in a differentiated way, we make targeted modifications to the attention module and propose two modules designed for graph-structured data, graph relative positional encoding multi-head self-attention (GR-MSA) and K-order graph-oriented multi-head self-attention (KOG-MSA). By stacking GR-MSA and KOG-MSA, we propose a novel network KOG-Transformer for 2D-to-3D pose estimation. Furthermore, we propose a network for shape estimation on hand data, called GraAttention shape estimation network (GASE-Net), which takes a 3D pose as input and gradually models the shape of the hand from sparse to dense. We have empirically shown the superiority of KOG-Transformer through extensive experiments. Experimental results show that KOG-Transformer significantly outperforms the previous state-of-the-art methods on the benchmark dataset Human3.6M. We evaluate the effect of GASE-Net on two public available hand datasets, ObMan and InterHand2.6M. GASE-Net can predict the corresponding shape for input pose with strong generalization ability.

Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors

  • Authors: Gongjie Zhang, Zhipeng Luo, Yingchen Yu, Zichen Tian, Jingyi Zhang, Shijian Lu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2208.11356
  • Pdf link: https://arxiv.org/pdf/2208.11356
  • Abstract Multi-scale features have been proven highly effective for object detection, and most ConvNet-based object detectors adopt Feature Pyramid Network (FPN) as a basic component for exploiting multi-scale features. However, for the recently proposed Transformer-based object detectors, directly incorporating multi-scale features leads to prohibitive computational overhead due to the high complexity of the attention mechanism for processing high-resolution features. This paper presents Iterative Multi-scale Feature Aggregation (IMFA) -- a generic paradigm that enables the efficient use of multi-scale features in Transformer-based object detectors. The core idea is to exploit sparse multi-scale features from just a few crucial locations, and it is achieved with two novel designs. First, IMFA rearranges the Transformer encoder-decoder pipeline so that the encoded features can be iteratively updated based on the detection predictions. Second, IMFA sparsely samples scale-adaptive features for refined detection from just a few keypoint locations under the guidance of prior detection predictions. As a result, the sampled multi-scale features are sparse yet still highly beneficial for object detection. Extensive experiments show that the proposed IMFA boosts the performance of multiple Transformer-based object detectors significantly yet with slight computational overhead. Project page: https://github.com/ZhangGongjie/IMFA.

Transformer-Boosted Anomaly Detection with Fuzzy Hashes

  • Authors: Frieder Uhlig, Lukas Struppek, Dominik Hintersdorf, Kristian Kersting
  • Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.11367
  • Pdf link: https://arxiv.org/pdf/2208.11367
  • Abstract Fuzzy hashes are an important tool in digital forensics and are used in approximate matching to determine the similarity between digital artifacts. They translate the byte code of files into computable strings, which makes them particularly interesting for intelligent machine processing. In this work, we propose deep learning approximate matching (DLAM), which achieves much higher accuracy in detecting anomalies in fuzzy hashes than conventional approaches. In addition to the well-known application for clustering malware, we show that fuzzy hashes and deep learning are indeed well-suited to classify files according to the presence of certain content, e.g., malware. DLAM relies on transformer-based models from the field of natural language processing and outperforms existing methods. Traditional fuzzy hashes like TLSH and ssdeep have a limited size and fail to detect file anomalies if they are relatively small compared to the overall file size. DLAM, however, enables the detection of such file correlations in the computed fuzzy hashes of TLSH and ssdeep, even for anomaly sizes of less than 15%. It achieves comparable results to state-of-the-art fuzzy hashing algorithms while relying on more efficient hash computations and can, therefore, be used at a much larger scale.

Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers

  • Authors: Paul Primus, Gerhard Widmer
  • Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2208.11402
  • Pdf link: https://arxiv.org/pdf/2208.11402
  • Abstract Standard machine learning models for tagging and classifying acoustic signals cannot handle classes that were not seen during training. Zero-Shot (ZS) learning overcomes this restriction by predicting classes based on adaptable class descriptions. This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning. To this end, we compare the very recent patchout spectrogram transformer with two classic convolutional architectures. We evaluate these three architectures on three tasks and on three different benchmark datasets: general-purpose tagging on AudioSet, environmental sound classification on ESC-50, and instrument tagging on OpenMIC. Our results show that the self-attention-based embedding methods outperform both compared convolutional architectures in all of these settings. By designing training and test data accordingly, we observe that prediction performance suffers significantly when the `semantic distance' between training and new test classes is large, an effect that will deserve more detailed investigations.

Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models

  • Authors: Mirelle Bueno, Carlos Gemmel, Jeffrey Dalton, Roberto Lotufo, Rodrigo Nogueira
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2208.11445
  • Pdf link: https://arxiv.org/pdf/2208.11445
  • Abstract The ability to extrapolate, i.e., to make predictions on sequences that are longer than those presented as training examples, is a challenging problem for current deep learning models. Recent work shows that this limitation persists in state-of-the-art Transformer-based models. Most solutions to this problem use specific architectures or training methods that do not generalize to other tasks. We demonstrate that large language models can succeed in extrapolation without modifying their architecture or training procedure. Experimental results show that generating step-by-step rationales and introducing marker tokens are both required for effective extrapolation. First, we induce it to produce step-by-step rationales before outputting the answer to effectively communicate the task to the model. However, as sequences become longer, we find that current models struggle to keep track of token positions. To address this issue, we interleave output tokens with markup tokens that act as explicit positional and counting symbols. Our findings show how these two complementary approaches enable remarkable sequence extrapolation and highlight a limitation of current architectures to effectively generalize without explicit surface form guidance. Code available at https://github.com/MirelleB/induced-rationales-markup-tokens

An End-to-End OCR Framework for Robust Arabic-Handwriting Recognition using a Novel Transformers-based Model and an Innovative 270 Million-Words Multi-Font Corpus of Classical Arabic with Diacritics

  • Authors: Aly Mostafa, Omar Mohamed, Ali Ashraf, Ahmed Elbehery, Salma Jamal, Anas Salah, Amr S. Ghoneim
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2208.11484
  • Pdf link: https://arxiv.org/pdf/2208.11484
  • Abstract This research is the second phase in a series of investigations on developing an Optical Character Recognition (OCR) of Arabic historical documents and examining how different modeling procedures interact with the problem. The first research studied the effect of Transformers on our custom-built Arabic dataset. One of the downsides of the first research was the size of the training data, a mere 15000 images from our 30 million images, due to lack of resources. Also, we add an image enhancement layer, time and space optimization, and Post-Correction layer to aid the model in predicting the correct word for the correct context. Notably, we propose an end-to-end text recognition approach using Vision Transformers as an encoder, namely BEIT, and vanilla Transformer as a decoder, eliminating CNNs for feature extraction and reducing the model's complexity. The experiments show that our end-to-end model outperforms Convolutions Backbones. The model attained a CER of 4.46%.

Diverse Title Generation for Stack Overflow Posts with Multiple Sampling Enhanced Transformer

  • Authors: Fengji Zhang, Jin Liu, Yao Wan, Xiao Yu, Xiao Liu, Jacky Keung
  • Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2208.11523
  • Pdf link: https://arxiv.org/pdf/2208.11523
  • Abstract Stack Overflow is one of the most popular programming communities where developers can seek help for their encountered problems. Nevertheless, if inexperienced developers fail to describe their problems clearly, it is hard for them to attract sufficient attention and get the anticipated answers. We propose M$_3$NSCT5, a novel approach to automatically generate multiple post titles from the given code snippets. Developers may use the generated titles to find closely related posts and complete their problem descriptions. M$_3$NSCT5 employs the CodeT5 backbone, which is a pre-trained Transformer model having an excellent language understanding and generation ability. To alleviate the ambiguity issue that the same code snippets could be aligned with different titles under varying contexts, we propose the maximal marginal multiple nucleus sampling strategy to generate multiple high-quality and diverse title candidates at a time for the developers to choose from. We build a large-scale dataset with 890,000 question posts covering eight programming languages to validate the effectiveness of M$_3$NSCT5. The automatic evaluation results on the BLEU and ROUGE metrics demonstrate the superiority of M$_3$NSCT5 over six state-of-the-art baseline models. Moreover, a human evaluation with trustworthy results also demonstrates the great potential of our approach for real-world application.

A model-based approach to meta-Reinforcement Learning: Transformers and tree search

  • Authors: Brieuc Pinon, Jean-Charles Delvenne, Raphaël Jungers
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2208.11535
  • Pdf link: https://arxiv.org/pdf/2208.11535
  • Abstract Meta-learning is a line of research that develops the ability to leverage past experiences to efficiently solve new learning problems. Meta-Reinforcement Learning (meta-RL) methods demonstrate a capability to learn behaviors that efficiently acquire and exploit information in several meta-RL problems. In this context, the Alchemy benchmark has been proposed by Wang et al. [2021]. Alchemy features a rich structured latent space that is challenging for state-of-the-art model-free RL methods. These methods fail to learn to properly explore then exploit. We develop a model-based algorithm. We train a model whose principal block is a Transformer Encoder to fit the symbolic Alchemy environment dynamics. Then we define an online planner with the learned model using a tree search method. This algorithm significantly outperforms previously applied model-free RL methods on the symbolic Alchemy problem. Our results reveal the relevance of model-based approaches with online planning to perform exploration and exploitation successfully in meta-RL. Moreover, we show the efficiency of the Transformer architecture to learn complex dynamics that arise from latent spaces present in meta-RL problems.

On a Built-in Conflict between Deep Learning and Systematic Generalization

  • Authors: Yuanpeng Li
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2208.11633
  • Pdf link: https://arxiv.org/pdf/2208.11633
  • Abstract In this paper, we hypothesize that internal function sharing is one of the reasons to weaken o.o.d. or systematic generalization in deep learning for classification tasks. Under equivalent prediction, a model partitions an input space into multiple parts separated by boundaries. The function sharing prefers to reuse boundaries, leading to fewer parts for new outputs, which conflicts with systematic generalization. We show such phenomena in standard deep learning models, such as fully connected, convolutional, residual networks, LSTMs, and (Vision) Transformers. We hope this study provides novel insights into systematic generalization and forms a basis for new research directions.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

DongZhouGu avatar Aug 25 '22 04:08 DongZhouGu