arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Mon, 7 Nov 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Efficient Information Sharing in ICT Supply Chain Social Network via Table Structure Recognition

  • Authors: Bin Xiao, Yakup Akkaya, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
  • Arxiv link: https://arxiv.org/abs/2211.02128
  • Pdf link: https://arxiv.org/pdf/2211.02128
  • Abstract The global Information and Communications Technology (ICT) supply chain is a complex network consisting of all types of participants. It is often formulated as a Social Network to discuss the supply chain network's relations, properties, and development in supply chain management. Information sharing plays a crucial role in improving the efficiency of the supply chain, and datasheets are the most common data format to describe e-component commodities in the ICT supply chain because of human readability. However, with the surging number of electronic documents, it has been far beyond the capacity of human readers, and it is also challenging to process tabular data automatically because of the complex table structures and heterogeneous layouts. Table Structure Recognition (TSR) aims to represent tables with complex structures in a machine-interpretable format so that the tabular data can be processed automatically. In this paper, we formulate TSR as an object detection problem and propose to generate an intuitive representation of a complex table structure to enable structuring of the tabular data related to the commodities. To cope with border-less and small layouts, we propose a cost-sensitive loss function by considering the detection difficulty of each class. Besides, we propose a novel anchor generation method using the character of tables that columns in a table should share an identical height, and rows in a table should share the same width. We implement our proposed method based on Faster-RCNN and achieve 94.79% on mean Average Precision (AP), and consistently improve more than 1.5% AP for different benchmark models.

Deep Learning based Defect classification and detection in SEM images: A Mask R-CNN approach

  • Authors: Bappaditya Dey, Enrique Dehaerne, Kasem Khalil, Sandip Halder, Philippe Leray, Magdy A. Bayoumi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2211.02185
  • Pdf link: https://arxiv.org/pdf/2211.02185
  • Abstract In this research work, we have demonstrated the application of Mask-RCNN (Regional Convolutional Neural Network), a deep-learning algorithm for computer vision and specifically object detection, to semiconductor defect inspection domain. Stochastic defect detection and classification during semiconductor manufacturing has grown to be a challenging task as we continuously shrink circuit pattern dimensions (e.g., for pitches less than 32 nm). Defect inspection and analysis by state-of-the-art optical and e-beam inspection tools is generally driven by some rule-based techniques, which in turn often causes to misclassification and thereby necessitating human expert intervention. In this work, we have revisited and extended our previous deep learning-based defect classification and detection method towards improved defect instance segmentation in SEM images with precise extent of defect as well as generating a mask for each defect category/instance. This also enables to extract and calibrate each segmented mask and quantify the pixels that make up each mask, which in turn enables us to count each categorical defect instances as well as to calculate the surface area in terms of pixels. We are aiming at detecting and segmenting different types of inter-class stochastic defect patterns such as bridge, break, and line collapse as well as to differentiate accurately between intra-class multi-categorical defect bridge scenarios (as thin/single/multi-line/horizontal/non-horizontal) for aggressive pitches as well as thin resists (High NA applications). Our proposed approach demonstrates its effectiveness both quantitatively and qualitatively.

SSDA-YOLO: Semi-supervised Domain Adaptive YOLO for Cross-Domain Object Detection

  • Authors: Huayi Zhou, Fei Jiang, Hongtao Lu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.02213
  • Pdf link: https://arxiv.org/pdf/2211.02213
  • Abstract Domain adaptive object detection (DAOD) aims to alleviate transfer performance degradation caused by the cross-domain discrepancy. However, most existing DAOD methods are dominated by computationally intensive two-stage detectors, which are not the first choice for industrial applications. In this paper, we propose a novel semi-supervised domain adaptive YOLO (SSDA-YOLO) based method to improve cross-domain detection performance by integrating the compact one-stage detector YOLOv5 with domain adaptation. Specifically, we adapt the knowledge distillation framework with the Mean Teacher model to assist the student model in obtaining instance-level features of the unlabeled target domain. We also utilize the scene style transfer to cross-generate pseudo images in different domains for remedying image-level differences. In addition, an intuitive consistency loss is proposed to further align cross-domain predictions. We evaluate our proposed SSDA-YOLO on public benchmarks including PascalVOC, Clipart1k, Cityscapes, and Foggy Cityscapes. Moreover, to verify its generalization, we conduct experiments on yawning detection datasets collected from various classrooms. The results show considerable improvements of our method in these DAOD tasks. Our code is available on \url{https://github.com/hnuzhy/SSDA-YOLO}.

Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models

  • Authors: Chengcheng Ma, Yang Liu, Jiankang Deng, LingXi Xie, Weiming Dong, Changsheng Xu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.02219
  • Pdf link: https://arxiv.org/pdf/2211.02219
  • Abstract Pre-trained Vision-Language Models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous prompts using task-specific training data. Despite the performance improvements on downstream tasks, several studies have reported that CoOp suffers from the overfitting issue in two aspects: (i) the test accuracy on base classes first gets better and then gets worse during training; (ii) the test accuracy on novel classes keeps decreasing. However, none of the existing studies can understand and mitigate such overfitting problem effectively. In this paper, we first explore the cause of overfitting by analyzing the gradient flow. Comparative experiments reveal that CoOp favors generalizable and spurious features in the early and later training stages respectively, leading to the non-overfitting and overfitting phenomenon. Given those observations, we propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process, and successfully eliminate the overfitting problem. Besides, we equip CoOp with Novel Feature Learner (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set, needless of image training data. Extensive experiments on 11 classification datasets demonstrate that SubPT+NFL consistently boost the performance of CoOp and outperform the state-of-the-art approach CoCoOp. Experiments on more challenging vision downstream tasks including open-vocabulary object detection and zero-shot semantic segmentation also verify the effectiveness of the proposed method. Codes can be found at https://tinyurl.com/mpe64f89.

PP-YOLOE-R: An Efficient Anchor-Free Rotated Object Detector

  • Authors: Xinxin Wang, Guanzhong Wang, Qingqing Dang, Yi Liu, Xiaoguang Hu, Dianhai Yu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.02386
  • Pdf link: https://arxiv.org/pdf/2211.02386
  • Abstract Arbitrary-oriented object detection is a fundamental task in visual scenes involving aerial images and scene text. In this report, we present PP-YOLOE-R, an efficient anchor-free rotated object detector based on PP-YOLOE. We introduce a bag of useful tricks in PP-YOLOE-R to improve detection precision with marginal extra parameters and computational cost. As a result, PP-YOLOE-R-l and PP-YOLOE-R-x achieve 78.14 and 78.28 mAP respectively on DOTA 1.0 dataset with single-scale training and testing, which outperform almost all other rotated object detectors. With multi-scale training and testing, PP-YOLOE-R-l and PP-YOLOE-R-x further improve the detection precision to 80.02 and 80.73 mAP. In this case, PP-YOLOE-R-x surpasses all anchor-free methods and demonstrates competitive performance to state-of-the-art anchor-based two-stage models. Further, PP-YOLOE-R is deployment friendly and PP-YOLOE-R-s/m/l/x can reach 69.8/55.1/48.3/37.1 FPS respectively on RTX 2080 Ti with TensorRT and FP16-precision. Source code and pre-trained models are available at https://github.com/PaddlePaddle/PaddleDetection, which is powered by https://github.com/PaddlePaddle/Paddle.

Keyword: transformer

Real-Time Target Sound Extraction

  • Authors: Bandhav Veluri, Justin Chan, Malek Itani, Tuochao Chen, Takuya Yoshioka, Shyamnath Gollakota
  • Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2211.02250
  • Pdf link: https://arxiv.org/pdf/2211.02250
  • Abstract We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner, while also benefiting from the performance transformer-based architectures provide. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. Open-source code and datasets: https://github.com/vb000/Waveformer

OSIC: A New One-Stage Image Captioner Coined

  • Authors: Bo Wang, Zhao Zhang, Mingbo Zhao, Xiaojie Jin, Mingliang Xu, Meng Wang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.02321
  • Pdf link: https://arxiv.org/pdf/2211.02321
  • Abstract Mainstream image caption models are usually two-stage captioners, i.e., calculating object features by pre-trained detector, and feeding them into a language model to generate text descriptions. However, such an operation will cause a task-based information gap to decrease the performance, since the object features in detection task are suboptimal representation and cannot provide all necessary information for subsequent text generation. Besides, object features are usually represented by the last layer features that lose the local details of input images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms input image into descriptive sentences in one stage. As a result, the task-based information gap can be greatly reduced. To obtain rich features, we use the Swin Transformer to calculate multi-level features, and then feed them into a novel dynamic multi-sight embedding module to exploit both global structure and local texture of input images. To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. Finally, OSIC can obtain rich and useful information to improve the image caption task. Extensive comparisons on benchmark MS-COCO dataset verified the superior performance of our method.

SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker Embedding and Vision Transformers

  • Authors: A. Arezzo, S. Berretti
  • Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2211.02366
  • Pdf link: https://arxiv.org/pdf/2211.02366
  • Abstract In recent years, Speech Emotion Recognition (SER) has been investigated mainly transforming the speech signal into spectrograms that are then classified using Convolutional Neural Networks pretrained on generic images and fine tuned with spectrograms. In this paper, we start from the general idea above and develop a new learning solution for SER, which is based on Compact Convolutional Transformers (CCTs) combined with a speaker embedding. With CCTs, the learning power of Vision Transformers (ViT) is combined with a diminished need for large volume of data as made possible by the convolution. This is important in SER, where large corpora of data are usually not available. The speaker embedding allows the network to extract an identity representation of the speaker, which is then integrated by means of a self-attention mechanism with the features that the CCT extracts from the spectrogram. Overall, the solution is capable of operating in real-time showing promising results in a cross-corpus scenario, where training and test datasets are kept separate. Experiments have been performed on several benchmarks in a cross-corpus setting as rarely used in the literature, with results that are comparable or superior to those obtained with state-of-the-art network architectures. Our code is available at https://github.com/JabuMlDev/Speaker-VGG-CCT.

Patch DCT vs LeNet

  • Authors: David Sinclair
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.02392
  • Pdf link: https://arxiv.org/pdf/2211.02392
  • Abstract This paper compares the performance of a NN taking the output of a DCT (Discrete Cosine Transform) of an image patch with leNet for classifying MNIST hand written digits. The basis functions underlying the DCT bear a passing resemblance to some of the learned basis function of the Visual Transformer but are an order of magnitude faster to apply.

Multilingual Name Entity Recognition and Intent Classification Employing Deep Learning Architectures

  • Authors: Sofia Rizou, Antonia Paflioti, Angelos Theofilatos, Athena Vakali, George Sarigiannidis, Konstantinos Ch. Chatzisavvas
  • Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
  • Arxiv link: https://arxiv.org/abs/2211.02415
  • Pdf link: https://arxiv.org/pdf/2211.02415
  • Abstract Named Entity Recognition and Intent Classification are among the most important subfields of the field of Natural Language Processing. Recent research has lead to the development of faster, more sophisticated and efficient models to tackle the problems posed by those two tasks. In this work we explore the effectiveness of two separate families of Deep Learning networks for those tasks: Bidirectional Long Short-Term networks and Transformer-based networks. The models were trained and tested on the ATIS benchmark dataset for both English and Greek languages. The purpose of this paper is to present a comparative study of the two groups of networks for both languages and showcase the results of our experiments. The models, being the current state-of-the-art, yielded impressive results and achieved high performance.

RCDPT: Radar-Camera fusion Dense Prediction Transformer

  • Authors: Chen-Chou Lo, Patrick Vandewalle
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2211.02432
  • Pdf link: https://arxiv.org/pdf/2211.02432
  • Abstract Recently, transformer networks have outperformed traditional deep neural networks in natural language processing and show a large potential in many computer vision tasks compared to convolutional backbones. In the original transformer, readout tokens are used as designated vectors for aggregating information from other tokens. However, the performance of using readout tokens in a vision transformer is limited. Therefore, we propose a novel fusion strategy to integrate radar data into a dense prediction transformer network by reassembling camera representations with radar representations. Instead of using readout tokens, radar representations contribute additional depth information to a monocular depth estimation model and improve performance. We further investigate different fusion approaches that are commonly used for integrating additional modality in a dense prediction transformer network. The experiments are conducted on the nuScenes dataset, which includes camera images, lidar, and radar data. The results show that our proposed method yields better performance than the commonly used fusion strategies and outperforms existing convolutional depth estimation models that fuse camera images and radar.

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

  • Authors: Jian Xue, Peidong Wang, Jinyu Li, Eric Sun
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2211.02499
  • Pdf link: https://arxiv.org/pdf/2211.02499
  • Abstract In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during training.

BERT for Long Documents: A Case Study of Automated ICD Coding

  • Authors: Arash Afkanpour, Shabir Adeel, Hansenclever Bassani, Arkady Epshteyn, Hongbo Fan, Isaac Jones, Mahan Malihi, Adrian Nauth, Raj Sinha, Sanjana Woonna, Shiva Zamani, Elli Kanal, Mikhail Fomitchev, Donny Cheung
  • Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.02519
  • Pdf link: https://arxiv.org/pdf/2211.02519
  • Abstract Transformer models have achieved great success across many NLP problems. However, previous studies in automated ICD coding concluded that these models fail to outperform some of the earlier solutions such as CNN-based models. In this paper we challenge this conclusion. We present a simple and scalable method to process long text with the existing transformer models such as BERT. We show that this method significantly improves the previous results reported for transformer models in ICD coding, and is able to outperform one of the prominent CNN-based methods.

A Transformer-Based Substitute Recommendation Model Incorporating Weakly Supervised Customer Behavior Data

  • Authors: Wenting Ye, Hongfei Yang, Shuai Zhao, Haoyang Fang, Xingjian Shi, Naveen Neppalli
  • Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2211.02533
  • Pdf link: https://arxiv.org/pdf/2211.02533
  • Abstract The substitute-based recommendation is widely used in E-commerce to provide better alternatives to customers. However, existing research typically uses the customer behavior signals like co-view and view-but-purchase-another to capture the substitute relationship. Despite its intuitive soundness, we find that such an approach might ignore the functionality and characteristics of products. In this paper, we adapt substitute recommendation into language matching problem by taking product title description as model input to consider product functionality. We design a new transformation method to de-noise the signals derived from production data. In addition, we consider multilingual support from the engineering point of view. Our proposed end-to-end transformer-based model achieves both successes from offline and online experiments. The proposed model has been deployed in a large-scale E-commerce website for 11 marketplaces in 6 languages. Our proposed model is demonstrated to increase revenue by 19% based on an online A/B experiment.

A Transformer Architecture for Online Gesture Recognition of Mathematical Expressions

  • Authors: Mirco Ramo, Guénolé C.M. Silvestre
  • Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2211.02643
  • Pdf link: https://arxiv.org/pdf/2211.02643
  • Abstract The Transformer architecture is shown to provide a powerful framework as an end-to-end model for building expression trees from online handwritten gestures corresponding to glyph strokes. In particular, the attention mechanism was successfully used to encode, learn and enforce the underlying syntax of expressions creating latent representations that are correctly decoded to the exact mathematical expression tree, providing robustness to ablated inputs and unseen glyphs. For the first time, the encoder is fed with spatio-temporal data tokens potentially forming an infinitely large vocabulary, which finds applications beyond that of online gesture recognition. A new supervised dataset of online handwriting gestures is provided for training models on generic handwriting recognition tasks and a new metric is proposed for the evaluation of the syntactic correctness of the output expression trees. A small Transformer model suitable for edge inference was successfully trained to an average normalised Levenshtein accuracy of 94%, resulting in valid postfix RPN tree representation for 94% of predictions.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

DongZhouGu avatar Nov 07 '22 03:11 DongZhouGu