arxiv-daily New submissions for Tue, 1 Nov 22

New submissions for Tue, 1 Nov 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

Pair DETR: Contrastive Learning Speeds Up DETR Training

Authors: Mehdi Iranmanesh, Xiaotong Chen, Kuo-Chin Lien
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.16476
Pdf link: https://arxiv.org/pdf/2210.16476
Abstract The DETR object detection approach applies the transformer encoder and decoder architecture to detect objects and achieves promising performance. In this paper, we present a simple approach to address the main problem of DETR, the slow convergence, by using representation learning technique. In this approach, we detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders. By detecting objects as paired keypoints, the model builds up a joint classification and pair association on the output queries from two decoders. For the pair association we propose utilizing contrastive self-supervised learning algorithm without requiring specialized architecture. Experimental results on MS COCO dataset show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training, while having consistently higher Average Precision scores.

iSmallNet: Densely Nested Network with Label Decoupling for Infrared Small Target Detection

Authors: Zhiheng Hu, Yongzhen Wang, Peng Li, Jie Qin, Haoran Xie, Mingqiang Wei
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.16561
Pdf link: https://arxiv.org/pdf/2210.16561
Abstract Small targets are often submerged in cluttered backgrounds of infrared images. Conventional detectors tend to generate false alarms, while CNN-based detectors lose small targets in deep layers. To this end, we propose iSmallNet, a multi-stream densely nested network with label decoupling for infrared small object detection. On the one hand, to fully exploit the shape information of small targets, we decouple the original labeled ground-truth (GT) map into an interior map and a boundary one. The GT map, in collaboration with the two additional maps, tackles the unbalanced distribution of small object boundaries. On the other hand, two key modules are delicately designed and incorporated into the proposed network to boost the overall performance. First, to maintain small targets in deep layers, we develop a multi-scale nested interaction module to explore a wide range of context information. Second, we develop an interior-boundary fusion module to integrate multi-granularity information. Experiments on NUAA-SIRST and NUDT-SIRST clearly show the superiority of iSmallNet over 11 state-of-the-art detectors.

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

Authors: Youngseok Kim, Sanmin Kim, Sangmin Sim, Jun Won Choi, Dongsuk Kum
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.16574
Pdf link: https://arxiv.org/pdf/2210.16574
Abstract Recent advances in monocular 3D detection leverage a depth estimation network explicitly as an intermediate stage of the 3D detection network. Depth map approaches yield more accurate depth to objects than other methods thanks to the depth estimation network trained on a large-scale dataset. However, depth map approaches can be limited by the accuracy of the depth map, and sequentially using two separated networks for depth estimation and 3D detection significantly increases computation cost and inference time. In this work, we propose a method to boost the RGB image-based 3D detector by jointly training the detection network with a depth prediction loss analogous to the depth estimation task. In this way, our 3D detection network can be supervised by more depth supervision from raw LiDAR points, which does not require any human annotation cost, to estimate accurate depth without explicitly predicting the depth map. Our novel object-centric depth prediction loss focuses on depth around foreground objects, which is important for 3D object detection, to leverage pixel-wise depth supervision in an object-centric manner. Our depth regression model is further trained to predict the uncertainty of depth to represent the 3D confidence of objects. To effectively train the 3D detector with raw LiDAR points and to enable end-to-end training, we revisit the regression target of 3D objects and design a network architecture. Extensive experiments on KITTI and nuScenes benchmarks show that our method can significantly boost the monocular image-based 3D detector to outperform depth map approaches while maintaining the real-time inference speed.

On-the-fly Object Detection using StyleGAN with CLIP Guidance

Authors: Yuzhe Lu, Shusen Liu, Jayaraman J. Thiagarajan, Wesam Sakla, Rushil Anirudh
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.16742
Pdf link: https://arxiv.org/pdf/2210.16742
Abstract We present a fully automated framework for building object detectors on satellite imagery without requiring any human annotation or intervention. We achieve this by leveraging the combined power of modern generative models (e.g., StyleGAN) and recent advances in multi-modal learning (e.g., CLIP). While deep generative models effectively encode the key semantics pertinent to a data distribution, this information is not immediately accessible for downstream tasks, such as object detection. In this work, we exploit CLIP's ability to associate image features with text descriptions to identify neurons in the generator network, which are subsequently used to build detectors on-the-fly.

SL3D: Self-supervised-Self-labeled 3D Recognition

Authors: Fernando Julio Cendra, Lan Ma, Jiajun Shen, Xiaojuan Qi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.16810
Pdf link: https://arxiv.org/pdf/2210.16810
Abstract There are a lot of promising results in 3D recognition, including classification, object detection, and semantic segmentation. However, many of these results rely on manually collecting densely annotated real-world 3D data, which is highly time-consuming and expensive to obtain, limiting the scalability of 3D recognition tasks. Thus in this paper, we study unsupervised 3D recognition and propose a Self-supervised-Self-Labeled 3D Recognition (SL3D) framework. SL3D simultaneously solves two coupled objectives, i.e., clustering and learning feature representation to generate pseudo labeled data for unsupervised 3D recognition. SL3D is a generic framework and can be applied to solve different 3D recognition tasks, including classification, object detection, and semantic segmentation. Extensive experiments demonstrate its effectiveness. Code is available at https://github.com/fcendra/sl3d.

Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot Object Detection

Authors: Shan Zhang, Naila Murray, Lei Wang, Piotr Koniusz
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.16897
Pdf link: https://arxiv.org/pdf/2210.16897
Abstract In this paper, we tackle the challenging problem of Few-shot Object Detection. Existing FSOD pipelines (i) use average-pooled representations that result in information loss; and/or (ii) discard position information that can help detect object instances. Consequently, such pipelines are sensitive to large intra-class appearance and geometric variations between support and query images. To address these drawbacks, we propose a Time-rEversed diffusioN tEnsor Transformer (TENET), which i) forms high-order tensor representations that capture multi-way feature occurrences that are highly discriminative, and ii) uses a transformer that dynamically extracts correlations between the query image and the entire support set, instead of a single average-pooled support embedding. We also propose a Transformer Relation Head (TRH), equipped with higher-order representations, which encodes correlations between query regions and the entire support set, while being sensitive to the positional variability of object instances. Our model achieves state-of-the-art results on PASCAL VOC, FSOD, and COCO.

Foreign Object Debris Detection for Airport Pavement Images based on Self-supervised Localization and Vision Transformer

Authors: Travis Munyer, Daniel Brinkman, Xin Zhong, Chenyu Huang, Iason Konstantzos
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.16901
Pdf link: https://arxiv.org/pdf/2210.16901
Abstract Supervised object detection methods provide subpar performance when applied to Foreign Object Debris (FOD) detection because FOD could be arbitrary objects according to the Federal Aviation Administration (FAA) specification. Current supervised object detection algorithms require datasets that contain annotated examples of every to-be-detected object. While a large and expensive dataset could be developed to include common FOD examples, it is infeasible to collect all possible FOD examples in the dataset representation because of the open-ended nature of FOD. Limitations of the dataset could cause FOD detection systems driven by those supervised algorithms to miss certain FOD, which can become dangerous to airport operations. To this end, this paper presents a self-supervised FOD localization by learning to predict the runway images, which avoids the enumeration of FOD annotation examples. The localization method utilizes the Vision Transformer (ViT) to improve localization performance. The experiments show that the method successfully detects arbitrary FOD in real-world runway situations. The paper also provides an extension to the localization result to perform classification; a feature that can be useful to downstream tasks. To train the localization, this paper also presents a simple and realistic dataset creation framework that only collects clean runway images. The training and testing data for this method are collected at a local airport using unmanned aircraft systems (UAS). Additionally, the developed dataset is provided for public use and further studies.

Tech Report: One-stage Lightweight Object Detectors

Authors: Deokki Hong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2210.17151
Pdf link: https://arxiv.org/pdf/2210.17151
Abstract This work is for designing one-stage lightweight detectors which perform well in terms of mAP and latency. With baseline models each of which targets on GPU and CPU respectively, various operations are applied instead of the main operations in backbone networks of baseline models. In addition to experiments about backbone networks and operations, several feature pyramid network (FPN) architectures are investigated. Benchmarks and proposed detectors are analyzed in terms of the number of parameters, Gflops, GPU latency, CPU latency and mAP, on MS COCO dataset which is a benchmark dataset in object detection. This work propose similar or better network architectures considering the trade-off between accuracy and latency. For example, our proposed GPU-target backbone network outperforms that of YOLOX-tiny which is selected as the benchmark by 1.43x in speed and 0.5 mAP in accuracy on NVIDIA GeForce RTX 2080 Ti GPU.

Keyword: transformer

ImplantFormer: Vision Transformer based Implant Position Regression Using Dental CBCT Data

Authors: Xinquan Yang, Xuguang Li, Xuechen Li, Peixi Wu, Linlin Shen, Xin Li, Yongqiang Deng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.16467
Pdf link: https://arxiv.org/pdf/2210.16467
Abstract Implant prosthesis is the most optimum treatment of dentition defect or dentition loss, which usually involves a surgical guide design process to decide the position of implant. However, such design heavily relies on the subjective experiences of dentist. To relieve this problem, in this paper, a transformer based Implant Position Regression Network, ImplantFormer, is proposed to automatically predict the implant position based on the oral CBCT data. The 3D CBCT data is firstly transformed into a series of 2D transverse plane slice views. ImplantFormer is then proposed to predict the position of implant based on the 2D slices of crown images. Convolutional stem and decoder are designed to coarsely extract image feature before the operation of patch embedding and integrate multi-levels feature map for robust prediction. The predictions of our network at tooth crown area are finally projected back to the positions at tooth root. As both long-range relationship and local features are involved, our approach can better represent global information and achieves better location performance than the state-of-the-art detectors. Experimental results on a dataset of 128 patients, collected from Shenzhen University General Hospital, show that our ImplantFormer achieves superior performance than benchmarks.

Pair DETR: Contrastive Learning Speeds Up DETR Training

Authors: Mehdi Iranmanesh, Xiaotong Chen, Kuo-Chin Lien
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.16476
Pdf link: https://arxiv.org/pdf/2210.16476
Abstract The DETR object detection approach applies the transformer encoder and decoder architecture to detect objects and achieves promising performance. In this paper, we present a simple approach to address the main problem of DETR, the slow convergence, by using representation learning technique. In this approach, we detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders. By detecting objects as paired keypoints, the model builds up a joint classification and pair association on the output queries from two decoders. For the pair association we propose utilizing contrastive self-supervised learning algorithm without requiring specialized architecture. Experimental results on MS COCO dataset show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training, while having consistently higher Average Precision scores.

Empirical Evaluation of Post-Training Quantization Methods for Language Tasks

Authors: Ting Hu, Christoph Meinel, Haojin Yang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.16621
Pdf link: https://arxiv.org/pdf/2210.16621
Abstract Transformer-based architectures like BERT have achieved great success in a wide range of Natural Language tasks. Despite their decent performance, the models still have numerous parameters and high computational complexity, impeding their deployment in resource-constrained environments. Post-Training Quantization (PTQ), which enables low-bit computations without extra training, could be a promising tool. In this work, we conduct an empirical evaluation of three PTQ methods on BERT-Base and BERT-Large: Linear Quantization (LQ), Analytical Clipping for Integer Quantization (ACIQ), and Outlier Channel Splitting (OCS). OCS theoretically surpasses the others in minimizing the Mean Square quantization Error and avoiding distorting the weights' outliers. That is consistent with the evaluation results of most language tasks of GLUE benchmark and a reading comprehension task, SQuAD. Moreover, low-bit quantized BERT models could outperform the corresponding 32-bit baselines on several small language tasks, which we attribute to the alleviation of over-parameterization. We further explore the limit of quantization bit and show that OCS could quantize BERT-Base and BERT-Large to 3-bits and retain 98% and 96% of the performance on the GLUE benchmark accordingly. Moreover, we conduct quantization on the whole BERT family, i.e., BERT models in different configurations, and comprehensively evaluate their performance on the GLUE benchmark and SQuAD, hoping to provide valuable guidelines for their deployment in various computation environments.

TFormer: 3D Tooth Segmentation in Mesh Scans with Geometry Guided Transformer

Authors: Huimin Xiong, Kunle Li, Kaiyuan Tan, Yang Feng, Joey Tianyi Zhou, Jin Hao, Zuozhu Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.16627
Pdf link: https://arxiv.org/pdf/2210.16627
Abstract Optical Intra-oral Scanners (IOS) are widely used in digital dentistry, providing 3-Dimensional (3D) and high-resolution geometrical information of dental crowns and the gingiva. Accurate 3D tooth segmentation, which aims to precisely delineate the tooth and gingiva instances in IOS, plays a critical role in a variety of dental applications. However, segmentation performance of previous methods are error-prone in complicated tooth-tooth or tooth-gingiva boundaries, and usually exhibit unsatisfactory results across various patients, yet the clinically applicability is not verified with large-scale dataset. In this paper, we propose a novel method based on 3D transformer architectures that is evaluated with large-scale and high-resolution 3D IOS datasets. Our method, termed TFormer, captures both local and global dependencies among different teeth to distinguish various types of teeth with divergent anatomical structures and confusing boundaries. Moreover, we design a geometry guided loss based on a novel point curvature to exploit boundary geometric features, which helps refine the boundary predictions for more accurate and smooth segmentation. We further employ a multi-task learning scheme, where an additional teeth-gingiva segmentation head is introduced to improve the performance. Extensive experimental results in a large-scale dataset with 16,000 IOS, the largest IOS dataset to our best knowledge, demonstrate that our TFormer can surpass existing state-of-the-art baselines with a large margin, with its utility in real-world scenarios verified by a clinical applicability test.

XNOR-FORMER: Learning Accurate Approximations in Long Speech Transformers

Authors: Roshan Sharma, Bhiksha Raj
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2210.16643
Pdf link: https://arxiv.org/pdf/2210.16643
Abstract Transformers are among the state of the art for many tasks in speech, vision, and natural language processing, among others. Self-attentions, which are crucial contributors to this performance have quadratic computational complexity, which makes training on longer input sequences challenging. Prior work has produced state-of-the-art transformer variants with linear attention, however, current models sacrifice performance to achieve efficient implementations. In this work, we develop a novel linear transformer by examining the properties of the key-query product within self-attentions. Our model outperforms state of the art approaches on speech recognition and speech summarization, resulting in 1 % absolute WER improvement on the Librispeech-100 speech recognition benchmark and a new INTERVIEW speech recognition benchmark, and 5 points on ROUGE for summarization with How2.

Exemplar Guided Deep Neural Network for Spatial Transcriptomics Analysis of Gene Expression Prediction

Authors: Yan Yang, Md Zakir Hossain, Eric A Stone, Shafin Rahman
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.16721
Pdf link: https://arxiv.org/pdf/2210.16721
Abstract Spatial transcriptomics (ST) is essential for understanding diseases and developing novel treatments. It measures gene expression of each fine-grained area (i.e., different windows) in the tissue slide with low throughput. This paper proposes an Exemplar Guided Network (EGN) to accurately and efficiently predict gene expression directly from each window of a tissue slide image. We apply exemplar learning to dynamically boost gene expression prediction from nearest/similar exemplars of a given tissue slide image window. Our EGN framework composes of three main components: 1) an extractor to structure a representation space for unsupervised exemplar retrievals; 2) a vision transformer (ViT) backbone to progressively extract representations of the input window; and 3) an Exemplar Bridging (EB) block to adaptively revise the intermediate ViT representations by using the nearest exemplars. Finally, we complete the gene expression prediction task with a simple attention-based prediction block. Experiments on standard benchmark datasets indicate the superiority of our approach when comparing with the past state-of-the-art (SOTA) methods.

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Authors: Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2210.16755
Pdf link: https://arxiv.org/pdf/2210.16755
Abstract Self-supervised pre-training has been successful in both text and speech processing. Speech and text offer different but complementary information. The question is whether we are able to perform a speech-text joint pre-training on unpaired speech and text. In this paper, we take the idea of self-supervised pre-training one step further and propose token2vec, a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech. Firstly, due to the distinct characteristics between speech and text modalities, where speech is continuous while text is discrete, we first discretize speech into a sequence of discrete speech tokens to solve the modality mismatch problem. Secondly, to solve the length mismatch problem, where the speech sequence is usually much longer than text sequence, we convert the words of text into phoneme sequences and randomly repeat each phoneme in the sequences. Finally, we feed the discrete speech and text tokens into a modality-agnostic Transformer encoder and pre-train with token-level masking language modeling (tMLM). Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction. Token2vec model is also validated on a non-ASR task, i.e., spoken intent classification, and shows good transferability.

An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks

Authors: Yuxiang Wu, Yu Zhao, Baotian Hu, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.16773
Pdf link: https://arxiv.org/pdf/2210.16773
Abstract Access to external knowledge is essential for many natural language processing tasks, such as question answering and dialogue. Existing methods often rely on a parametric model that stores knowledge in its parameters, or use a retrieval-augmented model that has access to an external knowledge source. Parametric and retrieval-augmented models have complementary strengths in terms of computational efficiency and predictive accuracy. To combine the strength of both approaches, we propose the Efficient Memory-Augmented Transformer (EMAT) -- it encodes external knowledge into a key-value memory and exploits the fast maximum inner product search for memory querying. We also introduce pre-training tasks that allow EMAT to encode informative key-value representations, and to learn an implicit strategy to integrate multiple memory slots into the transformer. Experiments on various knowledge-intensive tasks such as question answering and dialogue datasets show that, simply augmenting parametric models (T5-base) using our method produces more accurate results (e.g., 25.8 -> 44.3 EM on NQ) while retaining a high throughput (e.g., 1000 queries/s on NQ). Compared to retrieval-augmented models, EMAT runs substantially faster across the board and produces more accurate results on WoW and ELI5. Our code and datasets are available at https://github. com/uclnlp/EMAT.

Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition

Authors: Lei Wang, Piotr Koniusz
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.16820
Pdf link: https://arxiv.org/pdf/2210.16820
Abstract We propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE). To factor out misalignment between query and support sequences of 3D body joints, we propose an advanced variant of Dynamic Time Warping which jointly models each smooth path between the query and support frames to achieve simultaneously the best alignment in the temporal and simulated camera viewpoint spaces for end-to-end learning under the limited few-shot training data. Sequences are encoded with a temporal block encoder based on Simple Spectral Graph Convolution, a lightweight linear Graph Neural Network backbone. We also include a setting with a transformer. Finally, we propose a similarity-based loss which encourages the alignment of sequences of the same class while preventing the alignment of unrelated sequences. We show state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II.

Combining Attention Module and Pixel Shuffle for License Plate Super-Resolution

Authors: Valfride Nascimento, Rayson Laroca, Jorge de A. Lambert, William Robson Schwartz, David Menotti
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.16836
Pdf link: https://arxiv.org/pdf/2210.16836
Abstract The License Plate Recognition (LPR) field has made impressive advances in the last decade due to novel deep learning approaches combined with the increased availability of training data. However, it still has some open issues, especially when the data come from low-resolution (LR) and low-quality images/videos, as in surveillance systems. This work focuses on license plate (LP) reconstruction in LR and low-quality images. We present a Single-Image Super-Resolution (SISR) approach that extends the attention/transformer module concept by exploiting the capabilities of PixelShuffle layers and that has an improved loss function based on LPR predictions. For training the proposed architecture, we use synthetic images generated by applying heavy Gaussian noise in terms of Structural Similarity Index Measure (SSIM) to the original high-resolution (HR) images. In our experiments, the proposed method outperformed the baselines both quantitatively and qualitatively. The datasets we created for this work are publicly available to the research community at https://github.com/valfride/lpr-rsr/

TT-Net: Dual-path transformer based sound field translation in the spherical harmonic domain

Authors: Yiwen Wang, Zijian Lan, Xihong Wu, Tianshu Qu
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2210.16849
Pdf link: https://arxiv.org/pdf/2210.16849
Abstract In the current method for the sound field translation tasks based on spherical harmonic (SH) analysis, the solution based on the additive theorem usually faces the problem of singular values caused by large matrix condition numbers. The influence of different distances and frequencies of the spherical radial function on the stability of the translation matrix will affect the accuracy of the SH coefficients at the selected point. Due to the problems mentioned above, we propose a neural network scheme based on the dual-path transformer. More specifically, the dual-path network is constructed by the self-attention module along the two dimensions of the frequency and order axes. The transform-average-concatenate layer and upscaling layer are introduced in the network, which provides solutions for multiple sampling points and upscaling. Numerical simulation results indicate that both the working frequency range and the distance range of the translation are extended. More accurate higher-order SH coefficients are obtained with the proposed dual-path network.

Medical Codes Prediction from Clinical Notes: From Human Coders to Machines

Authors: Byung-Hak Kim
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.16850
Pdf link: https://arxiv.org/pdf/2210.16850
Abstract Prediction of medical codes from clinical notes is a practical and essential need for every healthcare delivery organization within current medical systems. Automating annotation will save significant time and excessive effort that human coders spend today. However, the biggest challenge is directly identifying appropriate medical codes from several thousands of high-dimensional codes from unstructured free-text clinical notes. This complex medical codes prediction problem from clinical notes has received substantial interest in the NLP community, and several recent studies have shown the state-of-the-art code prediction results of full-fledged deep learning-based methods. This progress raises the fundamental question of how far automated machine learning systems are from human coders' working performance, as well as the important question of how well current explainability methods apply to advanced neural network models such as transformers. This is to predict correct codes and present references in clinical notes that support code prediction, as this level of explainability and accuracy of the prediction outcomes is critical to gaining trust from professional medical coders.

Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts

Authors: Ben Zhou, Kyle Richardson, Xiaodong Yu, Dan Roth
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.16865
Pdf link: https://arxiv.org/pdf/2210.16865
Abstract Explicit decomposition modeling, which involves breaking down complex tasks into more straightforward and often more interpretable sub-tasks, has long been a central theme in developing robust and interpretable NLU systems. However, despite the many datasets and resources built as part of this effort, the majority have small-scale annotations and limited scope, which is insufficient to solve general decomposition tasks. In this paper, we look at large-scale intermediate pre-training of decomposition-based transformers using distant supervision from comparable texts, particularly large-scale parallel news. We show that with such intermediate pre-training, developing robust decomposition-based models for a diverse range of tasks becomes more feasible. For example, on semantic parsing, our model, DecompT5, improves 20% to 30% on two datasets, Overnight and TORQUE, over the baseline language model. We further use DecompT5 to build a novel decomposition-based QA system named DecompEntail, improving over state-of-the-art models, including GPT-3, on both HotpotQA and StrategyQA by 8% and 4%, respectively.

Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot Object Detection

Authors: Shan Zhang, Naila Murray, Lei Wang, Piotr Koniusz
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.16897
Pdf link: https://arxiv.org/pdf/2210.16897
Abstract In this paper, we tackle the challenging problem of Few-shot Object Detection. Existing FSOD pipelines (i) use average-pooled representations that result in information loss; and/or (ii) discard position information that can help detect object instances. Consequently, such pipelines are sensitive to large intra-class appearance and geometric variations between support and query images. To address these drawbacks, we propose a Time-rEversed diffusioN tEnsor Transformer (TENET), which i) forms high-order tensor representations that capture multi-way feature occurrences that are highly discriminative, and ii) uses a transformer that dynamically extracts correlations between the query image and the entire support set, instead of a single average-pooled support embedding. We also propose a Transformer Relation Head (TRH), equipped with higher-order representations, which encodes correlations between query regions and the entire support set, while being sensitive to the positional variability of object instances. Our model achieves state-of-the-art results on PASCAL VOC, FSOD, and COCO.

Foreign Object Debris Detection for Airport Pavement Images based on Self-supervised Localization and Vision Transformer

Authors: Travis Munyer, Daniel Brinkman, Xin Zhong, Chenyu Huang, Iason Konstantzos
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.16901
Pdf link: https://arxiv.org/pdf/2210.16901
Abstract Supervised object detection methods provide subpar performance when applied to Foreign Object Debris (FOD) detection because FOD could be arbitrary objects according to the Federal Aviation Administration (FAA) specification. Current supervised object detection algorithms require datasets that contain annotated examples of every to-be-detected object. While a large and expensive dataset could be developed to include common FOD examples, it is infeasible to collect all possible FOD examples in the dataset representation because of the open-ended nature of FOD. Limitations of the dataset could cause FOD detection systems driven by those supervised algorithms to miss certain FOD, which can become dangerous to airport operations. To this end, this paper presents a self-supervised FOD localization by learning to predict the runway images, which avoids the enumeration of FOD annotation examples. The localization method utilizes the Vision Transformer (ViT) to improve localization performance. The experiments show that the method successfully detects arbitrary FOD in real-world runway situations. The paper also provides an extension to the localization result to perform classification; a feature that can be useful to downstream tasks. To train the localization, this paper also presents a simple and realistic dataset creation framework that only collects clean runway images. The training and testing data for this method are collected at a local airport using unmanned aircraft systems (UAS). Additionally, the developed dataset is provided for public use and further studies.

ViTASD: Robust Vision Transformer Baselines for Autism Spectrum Disorder Facial Diagnosis

Authors: Xu Cao, Wenqian Ye, Elena Sizikova, Xue Bai, Megan Coffee, Hongwu Zeng, Jianguo Cao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2210.16943
Pdf link: https://arxiv.org/pdf/2210.16943
Abstract Autism spectrum disorder (ASD) is a lifelong neurodevelopmental disorder with very high prevalence around the world. Research progress in the field of ASD facial analysis in pediatric patients has been hindered due to a lack of well-established baselines. In this paper, we propose the use of the Vision Transformer (ViT) for the computational analysis of pediatric ASD. The presented model, known as ViTASD, distills knowledge from large facial expression datasets and offers model structure transferability. Specifically, ViTASD employs a vanilla ViT to extract features from patients' face images and adopts a lightweight decoder with a Gaussian Process layer to enhance the robustness for ASD analysis. Extensive experiments conducted on standard ASD facial analysis benchmarks show that our method outperforms all of the representative approaches in ASD facial analysis, while the ViTASD-L achieves a new state-of-the-art. Our code and pretrained models are available at https://github.com/IrohXu/ViTASD.

Character-level White-Box Adversarial Attacks against Transformers via Attachable Subwords Substitution

Authors: Aiwei Liu, Honghai Yu, Xuming Hu, Shu'ang Li, Li Lin, Fukun Ma, Yawen Yang, Lijie Wen
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.17004
Pdf link: https://arxiv.org/pdf/2210.17004
Abstract We propose the first character-level white-box adversarial attack method against transformer models. The intuition of our method comes from the observation that words are split into subtokens before being fed into the transformer models and the substitution between two close subtokens has a similar effect to the character modification. Our method mainly contains three steps. First, a gradient-based method is adopted to find the most vulnerable words in the sentence. Then we split the selected words into subtokens to replace the origin tokenization result from the transformer tokenizer. Finally, we utilize an adversarial loss to guide the substitution of attachable subtokens in which the Gumbel-softmax trick is introduced to ensure gradient propagation. Meanwhile, we introduce the visual and length constraint in the optimization process to achieve minimum character modifications. Extensive experiments on both sentence-level and token-level tasks demonstrate that our method could outperform the previous attack methods in terms of success rate and edit distance. Furthermore, human evaluation verifies our adversarial examples could preserve their origin labels.

Structured State Space Decoder for Speech Recognition and Synthesis

Authors: Koichi Miyazaki, Masato Murata, Tomoki Koriyama
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2210.17098
Pdf link: https://arxiv.org/pdf/2210.17098
Abstract Automatic speech recognition (ASR) systems developed in recent years have shown promising results with self-attention models (e.g., Transformer and Conformer), which are replacing conventional recurrent neural networks. Meanwhile, a structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks, including raw speech classification. The S4 model can be trained in parallel, same as the Transformer model. In this study, we applied S4 as a decoder for ASR and text-to-speech (TTS) tasks by comparing it with the Transformer decoder. For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25% on LibriSpeech test-clean/test-other set and a character error rate (CER) of 3.80%/2.63%/2.98% on the CSJ eval1/eval2/eval3 set. Furthermore, the proposed model is more robust than the standard Transformer model, particularly for long-form speech on both the datasets. For the TTS task, the proposed method outperforms the Transformer baseline.

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Authors: Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2210.17114
Pdf link: https://arxiv.org/pdf/2210.17114
Abstract Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, the performance of these models drops as we reduce the number of layers, notably in advanced NLP tasks such as span question answering. In addition, a separate model must be trained for each inference scenario with its distinct computational budget. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to generate a much more highly efficient model. We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1% accuracy loss). The code to reproduce this work will be publicly released on Github soon.

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

Authors: Zhenzhe Hechen, Wei Huang, Yixin Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.17115
Pdf link: https://arxiv.org/pdf/2210.17115
Abstract Transformers have demonstrated a competitive performance across a wide range of vision tasks, while it is very expensive to compute the global self-attention. Many methods limit the range of attention within a local window to reduce computation complexity. However, their approaches cannot save the number of parameters; meanwhile, the self-attention and inner position bias (inside the softmax function) cause each query to focus on similar and close patches. Consequently, this paper presents a light self-limited-attention (LSLA) consisting of a light self-attention mechanism (LSA) to save the computation cost and the number of parameters, and a self-limited-attention mechanism (SLA) to improve the performance. Firstly, the LSA replaces the K (Key) and V (Value) of self-attention with the X(origin input). Applying it in vision Transformers which have encoder architecture and self-attention mechanism, can simplify the computation. Secondly, the SLA has a positional information module and a limited-attention module. The former contains a dynamic scale and an inner position bias to adjust the distribution of the self-attention scores and enhance the positional information. The latter uses an outer position bias after the softmax function to limit some large values of attention weights. Finally, a hierarchical Vision Transformer with Light self-Limited-attention (ViT-LSLA) is presented. The experiments show that ViT-LSLA achieves 71.6% top-1 accuracy on IP102 (2.4% absolute improvement of Swin-T); 87.2% top-1 accuracy on Mini-ImageNet (3.7% absolute improvement of Swin-T). Furthermore, it greatly reduces FLOPs (3.5GFLOPs vs. 4.5GFLOPs of Swin-T) and parameters (18.9M vs. 27.6M of Swin-T).

Improving Cause-of-Death Classification from Verbal Autopsy Reports

Authors: Thokozile Manaka, Terence van Zyl, Deepak Kar
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.17161
Pdf link: https://arxiv.org/pdf/2210.17161
Abstract In many lower-and-middle income countries including South Africa, data access in health facilities is restricted due to patient privacy and confidentiality policies. Further, since clinical data is unique to individual institutions and laboratories, there are insufficient data annotation standards and conventions. As a result of the scarcity of textual data, natural language processing (NLP) techniques have fared poorly in the health sector. A cause of death (COD) is often determined by a verbal autopsy (VA) report in places without reliable death registration systems. A non-clinician field worker does a VA report using a set of standardized questions as a guide to uncover symptoms of a COD. This analysis focuses on the textual part of the VA report as a case study to address the challenge of adapting NLP techniques in the health domain. We present a system that relies on two transfer learning paradigms of monolingual learning and multi-source domain adaptation to improve VA narratives for the target task of the COD classification. We use the Bidirectional Encoder Representations from Transformers (BERT) and Embeddings from Language Models (ELMo) models pre-trained on the general English and health domains to extract features from the VA narratives. Our findings suggest that this transfer learning system improves the COD classification tasks and that the narrative text contains valuable information for figuring out a COD. Our results further show that combining binary VA features and narrative text features learned via this framework boosts the classification task of COD.

Tables to LaTeX: structure and content extraction from scientific tables

Authors: Pratik Kayal, Mrinal Anand, Harsh Desai, Mayank Singh
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.17246
Pdf link: https://arxiv.org/pdf/2210.17246
Abstract Scientific documents contain tables that list important information in a concise fashion. Structure and content extraction from tables embedded within PDF research documents is a very challenging task due to the existence of visual features like spanning cells and content features like mathematical symbols and equations. Most existing table structure identification methods tend to ignore these academic writing features. In this paper, we adapt the transformer-based language modeling paradigm for scientific table structure and content extraction. Specifically, the proposed model converts a tabular image to its corresponding LaTeX source code. Overall, we outperform the current state-of-the-art baselines and achieve an exact match accuracy of 70.35 and 49.69% on table structure and content extraction, respectively. Further analysis demonstrates that the proposed models efficiently identify the number of rows and columns, the alphanumeric characters, the LaTeX tokens, and symbols.

Multi-Camera Calibration Free BEV Representation for 3D Object Detection

Authors: Hongxiang Jiang, Wenming Meng, Hongmei Zhu, Qian Zhang, Jihao Yin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.17252
Pdf link: https://arxiv.org/pdf/2210.17252
Abstract In advanced paradigms of autonomous driving, learning Bird's Eye View (BEV) representation from surrounding views is crucial for multi-task framework. However, existing methods based on depth estimation or camera-driven attention are not stable to obtain transformation under noisy camera parameters, mainly with two challenges, accurate depth prediction and calibration. In this work, we present a completely Multi-Camera Calibration Free Transformer (CFT) for robust BEV representation, which focuses on exploring implicit mapping, not relied on camera intrinsics and extrinsics. To guide better feature learning from image views to BEV, CFT mines potential 3D information in BEV via our designed position-aware enhancement (PA). Instead of camera-driven point-wise or global transformation, for interaction within more effective region and lower computation cost, we propose a view-aware attention which also reduces redundant computation and promotes converge. CFT achieves 49.7% NDS on the nuScenes detection task leaderboard, which is the first work removing camera parameters, comparable to other geometry-guided methods. Without temporal input and other modal information, CFT achieves second highest performance with a smaller image input 1600 * 640. Thanks to view-attention variant, CFT reduces memory and transformer FLOPs for vanilla attention by about 12% and 60%, respectively, with improved NDS by 1.0%. Moreover, its natural robustness to noisy camera parameters makes CFT more competitive.

QNet: A Quantum-native Sequence Encoder Architecture

Authors: Wei Day, Hao-Sheng Chen, Min-Te Sun
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.17262
Pdf link: https://arxiv.org/pdf/2210.17262
Abstract This work investigates how current quantum computers can improve the performance of natural language processing tasks. To achieve this goal, we proposed QNet, a novel sequence encoder model entirely inferences on the quantum computer using a minimum number of qubits. QNet is inspired by Transformer, the state-of-the-art neural network model based on the attention mechanism to relate the tokens. While the attention mechanism requires time complexity of $O(n^2 \cdot d)$ to perform matrix multiplication operations, QNet has merely $O(n+d)$ quantum circuit depth, where $n$ and $d$ represent the length of the sequence and the embedding size, respectively. To employ QNet on the NISQ devices, ResQNet, a quantum-classical hybrid model composed of several QNet blocks linked by residual connections, is introduced. We evaluate ResQNet on various natural language processing tasks, including text classification, rating score prediction, and named entity recognition. ResQNet exhibits a 6% to 818% performance gain on all these tasks over classical state-of-the-art models using the exact embedding dimensions. In summary, this work demonstrates the advantage of quantum computing in natural language processing tasks.

A Long-term Dependent and Trustworthy Approach to Reactor Accident Prognosis based on Temporal Fusion Transformer

Authors: Chengyuan Li, Zhifang Qiu, Yugao Ma, Meifu Li
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2210.17298
Pdf link: https://arxiv.org/pdf/2210.17298
Abstract Prognosis of the reactor accident is a crucial way to ensure appropriate strategies are adopted to avoid radioactive releases. However, there is very limited research in the field of nuclear industry. In this paper, we propose a method for accident prognosis based on the Temporal Fusion Transformer (TFT) model with multi-headed self-attention and gating mechanisms. The method utilizes multiple covariates to improve prediction accuracy on the one hand, and quantile regression methods for uncertainty assessment on the other. The method proposed in this paper is applied to the prognosis after loss of coolant accidents (LOCAs) in HPR1000 reactor. Extensive experimental results show that the method surpasses novel deep learning-based prediction methods in terms of prediction accuracy and confidence. Furthermore, the interference experiments with different signal-to-noise ratios and the ablation experiments for static covariates further illustrate that the robustness comes from the ability to extract the features of static and historical covariates. In summary, this work for the first time applies the novel composite deep learning model TFT to the prognosis of key parameters after a reactor accident, and makes a positive contribution to the establishment of a more intelligent and staff-light maintenance method for reactor systems.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Authors: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.17323
Pdf link: https://arxiv.org/pdf/2210.17323
Abstract Generative Pre-trained Transformer (GPT) models set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs to execute, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 2x when using high-end GPUs (NVIDIA A100) and 4x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

Probabilistic Decomposition Transformer for Time Series Forecasting

Authors: Junlong Tong, Liping Xie, Wankou Yang, Kanjian Zhang
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.17393
Pdf link: https://arxiv.org/pdf/2210.17393
Abstract Time series forecasting is crucial for many fields, such as disaster warning, weather prediction, and energy consumption. The Transformer-based models are considered to have revolutionized the field of sequence modeling. However, the complex temporal patterns of the time series hinder the model from mining reliable temporal dependencies. Furthermore, the autoregressive form of the Transformer introduces cumulative errors in the inference step. In this paper, we propose the probabilistic decomposition Transformer model that combines the Transformer with a conditional generative model, which provides hierarchical and interpretable probabilistic forecasts for intricate time series. The Transformer is employed to learn temporal patterns and implement primary probabilistic forecasts, while the conditional generative model is used to achieve non-autoregressive hierarchical probabilistic forecasts by introducing latent space feature representations. In addition, the conditional generative model reconstructs typical features of the series, such as seasonality and trend terms, from probability distributions in the latent space to enable complex pattern separation and provide interpretable forecasts. Extensive experiments on several datasets demonstrate the effectiveness and robustness of the proposed model, indicating that it compares favorably with the state of the art.

Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

Authors: Simone Rossetti (1 and 2), Damiano Zappia (1), Marta Sanzari (2), Marco Schaerf (1 and 2), Fiora Pirri (1 and 2) ((1) DeepPlants, (2) DIAG Sapienza)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2210.17400
Pdf link: https://arxiv.org/pdf/2210.17400
Abstract Weakly Supervised Semantic Segmentation (WSSS) research has explored many directions to improve the typical pipeline CNN plus class activation maps (CAM) plus refinements, given the image-class label as the only supervision. Though the gap with the fully supervised methods is reduced, further abating the spread seems unlikely within this framework. On the other hand, WSSS methods based on Vision Transformers (ViT) have not yet explored valid alternatives to CAM. ViT features have been shown to retain a scene layout, and object boundaries in self-supervised learning. To confirm these findings, we prove that the advantages of transformers in self-supervised methods are further strengthened by Global Max Pooling (GMP), which can leverage patch features to negotiate pixel-label probability with class probability. This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. The end-to-end presented network learns with a single optimization process, refined shape and proper localization for segmentation masks. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3%$ mIoU on PascalVOC 2012 $val$ set. We show that our approach has the least set of parameters, though obtaining higher accuracy than all other approaches. In a sentence, quantitative and qualitative results of our method reveal that ViT-PCM is an excellent alternative to CNN-CAM based architectures.

AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Authors: Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, Jianfeng Gao
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2210.17451
Pdf link: https://arxiv.org/pdf/2210.17451
Abstract Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updating hundreds of millions to billions of parameters, and storing a large copy of the PLM weights for every task resulting in increased cost for storing, sharing and serving the models. To address this, parameter-efficient fine-tuning (PEFT) techniques were introduced where small trainable components are injected in the PLM and updated during fine-tuning. We propose AdaMix as a general PEFT method that tunes a mixture of adaptation modules -- given the underlying PEFT method of choice -- introduced in each Transformer layer while keeping most of the PLM weights frozen. For instance, AdaMix can leverage a mixture of adapters like Houlsby or a mixture of low rank decomposition matrices like LoRA to improve downstream task performance over the corresponding PEFT methods for fully supervised and few-shot NLU and NLG tasks. Further, we design AdaMix such that it matches the same computational cost and the number of tunable parameters as the underlying PEFT method. By only tuning 0.1-0.2% of PLM parameters, we show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks.

Leveraging Pre-trained Models for Failure Analysis Triplets Generation

Authors: Kenneth Ezukwoke, Anis Hoayek, Mireille Batton-Hubert, Xavier Boucher, Pascal Gounet, Jerome Adrian
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2210.17497
Pdf link: https://arxiv.org/pdf/2210.17497
Abstract Pre-trained Language Models recently gained traction in the Natural Language Processing (NLP) domain for text summarization, generation and question-answering tasks. This stems from the innovation introduced in Transformer models and their overwhelming performance compared with Recurrent Neural Network Models (Long Short Term Memory (LSTM)). In this paper, we leverage the attention mechanism of pre-trained causal language models such as Transformer model for the downstream task of generating Failure Analysis Triplets (FATs) - a sequence of steps for analyzing defected components in the semiconductor industry. We compare different transformer models for this generative task and observe that Generative Pre-trained Transformer 2 (GPT2) outperformed other transformer model for the failure analysis triplet generation (FATG) task. In particular, we observe that GPT2 (trained on 1.5B parameters) outperforms pre-trained BERT, BART and GPT3 by a large margin on ROUGE. Furthermore, we introduce Levenshstein Sequential Evaluation metric (LESE) for better evaluation of the structured FAT data and show that it compares exactly with human judgment than existing metrics.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

Nov 01 '22 04:11 DongZhouGu

arxiv-daily arxiv-daily copied to clipboard

New submissions for Tue, 1 Nov 22

Keyword: human object interaction

Keyword: visual relation detection

Keyword: object detection

Pair DETR: Contrastive Learning Speeds Up DETR Training

iSmallNet: Densely Nested Network with Label Decoupling for Infrared Small Target Detection

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

On-the-fly Object Detection using StyleGAN with CLIP Guidance

SL3D: Self-supervised-Self-labeled 3D Recognition

Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot Object Detection

Foreign Object Debris Detection for Airport Pavement Images based on Self-supervised Localization and Vision Transformer

Tech Report: One-stage Lightweight Object Detectors

Keyword: transformer

ImplantFormer: Vision Transformer based Implant Position Regression Using Dental CBCT Data

Pair DETR: Contrastive Learning Speeds Up DETR Training

Empirical Evaluation of Post-Training Quantization Methods for Language Tasks

TFormer: 3D Tooth Segmentation in Mesh Scans with Geometry Guided Transformer

XNOR-FORMER: Learning Accurate Approximations in Long Speech Transformers

Exemplar Guided Deep Neural Network for Spatial Transcriptomics Analysis of Gene Expression Prediction

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks

Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition

Combining Attention Module and Pixel Shuffle for License Plate Super-Resolution

TT-Net: Dual-path transformer based sound field translation in the spherical harmonic domain

Medical Codes Prediction from Clinical Notes: From Human Coders to Machines

Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts

Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot Object Detection

Foreign Object Debris Detection for Airport Pavement Images based on Self-supervised Localization and Vision Transformer

ViTASD: Robust Vision Transformer Baselines for Autism Spectrum Disorder Facial Diagnosis

Character-level White-Box Adversarial Attacks against Transformers via Attachable Subwords Substitution

Structured State Space Decoder for Speech Recognition and Synthesis

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

Improving Cause-of-Death Classification from Verbal Autopsy Reports

Tables to LaTeX: structure and content extraction from scientific tables

Multi-Camera Calibration Free BEV Representation for 3D Object Detection

QNet: A Quantum-native Sequence Encoder Architecture

A Long-term Dependent and Trustworthy Approach to Reactor Accident Prognosis based on Temporal Fusion Transformer

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Probabilistic Decomposition Transformer for Time Series Forecasting

Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Leveraging Pre-trained Models for Failure Analysis Triplets Generation

Keyword: scene understanding

Keyword: visual reasoning

arxiv-daily
arxiv-daily copied to clipboard